diff --git a/README.md b/README.md index 4e0e087..1b5e733 100644 --- a/README.md +++ b/README.md @@ -2,93 +2,143 @@ ## Overview -This project is designed to automate the process of handling, extracting, and processing documents. It integrates various services, such as AWS Textract, OpenAI, Dropbox, Nextcloud, and Paperless NGX, to extract metadata, process document contents, and store the results. The system is flexible and configurable via environment variables, allowing easy customization for different workflows. +This project automates the handling, extraction, and processing of documents using a combination of services such as OpenAI, Dropbox, Nextcloud, and Paperless NGX. The system extracts metadata, processes document contents, and stores the results efficiently. It is designed for flexibility and configurability through environment variables, making it easily customizable for different workflows. ## Features -- **Document Upload & Storage**: Upload documents to S3 and manage them with Dropbox and Nextcloud. -- **OCR Processing**: Use AWS Textract to extract text from scanned documents. +- **Document Upload & Storage**: Upload and manage documents via Dropbox and Nextcloud. +- **OCR Processing**: Extract text from scanned documents. - **Metadata Extraction**: Automatically extract key information using OpenAI's API. - **Document Management**: Store processed documents and metadata in Paperless NGX for easy retrieval. - **IMAP Integration**: Fetch documents from multiple IMAP email accounts for processing. ## Environment Variables -The project is configured through the `.env` file, where you define credentials and settings such as: - -- AWS access keys for Textract -- Dropbox and Nextcloud credentials -- Paperless NGX API token -- IMAP settings for fetching documents - - -# `.env` Configuration Explanation - -This section provides details about each configuration option in the `.env` file. It also explains how to retrieve the necessary credentials and information for each setting. - -| **Variable** | **Description** | **How to Obtain** | -|---------------------------------------|-----------------|-------------------| -| `DATABASE_URL` | Path to the SQLite database. | This is typically set to a local path where your database file will be stored. Example: `sqlite:///./app/database.db` | -| `REDIS_URL` | URL for connecting to Redis. | Set the URL for your Redis instance (e.g., `redis://localhost:6379/0`). Install Redis locally or use a cloud service like Redis Labs. | -| `WORKDIR` | Working directory for the application. | Choose a directory where your application will read and write files, such as `/workdir`. | -| `AWS_REGION` | AWS region for services like Textract and S3. | Set this to your AWS region (e.g., `eu-central-1`). You can find your region in the AWS Management Console. | -| `S3_BUCKET_NAME` | Name of your S3 bucket. | Create an S3 bucket in your AWS account and use its name (e.g., `my-bucket-name`). | -| `NEXTCLOUD_UPLOAD_URL` | URL for uploading files to Nextcloud. | Format the URL for Nextcloud's WebDAV endpoint. Example: `https://nextcloud.example.com/remote.php/dav/files/` | -| `NEXTCLOUD_FOLDER` | Folder in Nextcloud where files are uploaded. | Choose the folder path where documents will be uploaded (e.g., `/Documents/Uploads`). | -| `PAPERLESS_NGX_URL` | URL for Paperless NGX API endpoint for document uploads. | Obtain this from your Paperless NGX instance (e.g., `https://paperless.example.com/api/documents/post_document/`). | -| `PAPERLESS_HOST` | Host URL for Paperless NGX. | Set this to the root URL of your Paperless NGX instance (e.g., `https://paperless.example.com`). | -| **Tokens/API Credentials** | **Tokens and keys for third-party services** | **How to Obtain** | -| `AWS_ACCESS_KEY_ID` | AWS Access Key ID for authentication with AWS services. | Create a new access key from the AWS IAM console under **Access Management** → **Users**. | -| `AWS_SECRET_ACCESS_KEY` | AWS Secret Access Key for AWS authentication. | This is generated alongside the access key ID in the AWS IAM console. Keep this secure. | -| `OPENAI_API_KEY` | API key for accessing OpenAI services. | Get this from the [OpenAI platform](https://platform.openai.com/account/api-keys). | -| `PAPERLESS_NGX_API_TOKEN` | API token for Paperless NGX. | Obtain this from your Paperless NGX instance or generate a new token from its settings. | -| `DROPBOX_APP_KEY` | Dropbox App Key for Dropbox API. | Generate this in the [Dropbox Developer Console](https://www.dropbox.com/developers/apps/create). | -| `DROPBOX_APP_SECRET` | Dropbox App Secret for Dropbox API. | This can be found in the Dropbox Developer Console after creating your app. | -| `DROPBOX_REFRESH_TOKEN` | Dropbox Refresh Token for accessing Dropbox. | You can get this by going through the OAuth process using Dropbox's CLI or API. Check the [OAuth guide](https://www.dropbox.com/developers/reference/oauth-guide). | -| **User Credentials** | **Authentication for various services** | **How to Obtain** | -| `ADMIN_USERNAME` | Username for admin access to the application. | This is typically chosen by you and set for managing your application. | -| `ADMIN_PASSWORD` | Password for admin access. | Set this to a strong password for admin authentication. | -| `NEXTCLOUD_USERNAME` | Username for accessing Nextcloud. | Your Nextcloud username (e.g., `username@example.com`). | -| `NEXTCLOUD_PASSWORD` | Password for accessing Nextcloud. | The password associated with your Nextcloud account. | -| `IMAP1_USERNAME` | IMAP username for the first email account. | The email address associated with the first IMAP account. | -| `IMAP1_PASSWORD` | IMAP password for the first email account. | The password for the above IMAP account. | -| `IMAP2_USERNAME` | IMAP username for the second email account. | The email address associated with the second IMAP account. | -| `IMAP2_PASSWORD` | IMAP password for the second email account. | The password for the second IMAP account. | -| **IMAP Settings** | **IMAP configuration for email fetching** | **How to Obtain** | -| `IMAP1_HOST` | Hostname of the first IMAP server. | The hostname for the first email provider (e.g., `mail.example.com` or `imap.gmail.com`). | -| `IMAP1_PORT` | Port for the first IMAP server. | Typically `993` for SSL/TLS. | -| `IMAP1_SSL` | Enable SSL for the first IMAP connection. | Set to `true` to enable SSL. | -| `IMAP1_POLL_INTERVAL_MINUTES` | How often to poll the first IMAP server (in minutes). | Set to `5` or adjust as needed. | -| `IMAP1_DELETE_AFTER_PROCESS` | Delete emails after processing in the first IMAP account. | Set to `false` to leave emails or `true` to delete after processing. | -| `IMAP2_HOST` | Hostname of the second IMAP server. | The hostname for the second email provider (e.g., `imap.gmail.com`). | -| `IMAP2_PORT` | Port for the second IMAP server. | Typically `993` for SSL/TLS. | -| `IMAP2_SSL` | Enable SSL for the second IMAP connection. | Set to `true` to enable SSL. | -| `IMAP2_POLL_INTERVAL_MINUTES` | How often to poll the second IMAP server (in minutes). | Set to `10` or adjust as needed. | -| `IMAP2_DELETE_AFTER_PROCESS` | Delete emails after processing in the second IMAP account. | Set to `false` to leave emails or `true` to delete after processing. | - - -## Setup - -1. **Clone the repository**: +The project is configured via the `.env` file, where credentials and settings for different services are defined. Below is a breakdown of key configuration variables: + +### General Configuration + +| **Variable** | **Description** | **How to Obtain** | +|-------------|----------------|-------------------| +| `DATABASE_URL` | Path to the SQLite database. | Example: `sqlite:///./app/database.db` | +| `REDIS_URL` | URL for Redis connection. | Example: `redis://redis:6379/0` | +| `WORKDIR` | Working directory for the application. | Example: `/workdir` | +| `NEXTCLOUD_UPLOAD_URL` | Nextcloud WebDAV upload URL. | Example: `https://nextcloud.example.com/remote.php/dav/files/` | +| `NEXTCLOUD_FOLDER` | Folder in Nextcloud for file uploads. | Example: `/Documents/Uploads` | +| `PAPERLESS_NGX_URL` | Paperless NGX API endpoint. | Example: `https://paperless.example.com/api/documents/post_document/` | +| `PAPERLESS_HOST` | Root URL for Paperless NGX. | Example: `https://paperless.example.com` | + +### Tokens/API Credentials + +| **Variable** | **Description** | **How to Obtain** | +|-------------|----------------|-------------------| +| `OPENAI_API_KEY` | API key for OpenAI services. | Get from [OpenAI platform](https://platform.openai.com/account/api-keys). | +| `PAPERLESS_NGX_API_TOKEN` | API token for Paperless NGX. | Obtain from your Paperless NGX instance. | +| `DROPBOX_APP_KEY` | Dropbox API key. | Generate from the [Dropbox Developer Console](https://www.dropbox.com/developers/apps/create). | +| `DROPBOX_APP_SECRET` | Dropbox API secret. | Available in the Dropbox Developer Console. | +| `DROPBOX_REFRESH_TOKEN` | Dropbox OAuth refresh token. | Obtain by following Dropbox's OAuth flow. | + +### User Credentials + +| **Variable** | **Description** | +|-------------|----------------| +| `ADMIN_USERNAME` | Admin username for system access. | +| `ADMIN_PASSWORD` | Admin password for system access. | +| `NEXTCLOUD_USERNAME` | Username for Nextcloud authentication. | +| `NEXTCLOUD_PASSWORD` | Password for Nextcloud authentication. | + +### IMAP Configuration + +| **Variable** | **Description** | +|-------------|----------------| +| `IMAP1_USERNAME` | IMAP username for the first email account. | +| `IMAP1_PASSWORD` | IMAP password for the first email account. | +| `IMAP1_HOST` | Hostname of the first IMAP server. | +| `IMAP1_PORT` | IMAP server port (typically `993`). | +| `IMAP1_SSL` | Enable SSL (`true` or `false`). | +| `IMAP1_POLL_INTERVAL_MINUTES` | Polling interval for IMAP server. | +| `IMAP1_DELETE_AFTER_PROCESS` | Delete emails after processing (`true` or `false`). | + +### Additional Services + +| **Variable** | **Description** | +|-------------|----------------| +| `GOTENBERG_URL` | URL for Gotenberg PDF processing. | + +## Running as a Docker Container + +This project includes a `docker-compose.yml` file that allows for easy deployment using Docker. The following services are defined: + +- **API Service**: Runs the document processing API using `uvicorn`. +- **Worker Service**: Runs the Celery worker for handling document processing tasks. +- **Redis**: Used as a message broker for Celery. +- **Gotenberg**: Provides PDF processing capabilities. + +### Running the Application with Docker Compose + +1. **Ensure Docker and Docker Compose are installed**. +2. **Clone the repository and navigate to the directory**: ```bash git clone cd ``` - -2. **Install dependencies**: - ```bash - pip install -r requirements.txt - ``` - -3. **Configure the environment**: - - Create a `.env` file based on the provided example and fill in the required fields (AWS, Dropbox, IMAP, etc.). - -4. **Run the application**: +3. **Create and configure the `.env` file**. +4. **Start the services**: ```bash - python app.py + docker-compose up -d ``` +5. The API will be available at `http://localhost:8000`. + +### Services in `docker-compose.yml` + +```yaml +services: + api: + image: christianlouis/document-processor:latest + container_name: document_api + working_dir: /workdir + command: ["sh", "-c", "cd /app && uvicorn app.main:app --host 0.0.0.0 --port 8000"] + environment: + - PYTHONPATH=/app + env_file: + - .env + ports: + - "8000:8000" + depends_on: + - redis + - worker + volumes: + - /var/docparse/workdir:/workdir + + worker: + image: christianlouis/document-processor:latest + container_name: document_worker + working_dir: /workdir + command: ["celery", "-A", "app.celery_worker", "worker", "-B", "--loglevel=info", "-Q", "document_processor,default,celery"] + env_file: + - .env + environment: + - PYTHONPATH=/app + depends_on: + - redis + - gotenberg + volumes: + - /var/docparse/workdir:/workdir + + gotenberg: + image: gotenberg/gotenberg:latest + container_name: gotenberg + + redis: + image: redis:alpine + container_name: document_redis + restart: always +``` + +## To-Do List + +- Refactor AWS-related code to Azure. +- Remove unnecessary environment variables. +- Make upload targets configurable. +- Remove S3 upload functionality. -## Notes - -- The system requires access to AWS Textract, Dropbox, Nextcloud, and Paperless NGX for full functionality. -- Ensure you have proper permissions set up for these services (IAM roles, API tokens, etc.).