Skip to content

Commit

Permalink
Merge pull request #1 from christianlouis/main
Browse files Browse the repository at this point in the history
Potential 1.1 release
  • Loading branch information
christianlouis authored Feb 12, 2025
2 parents f46224e + 7e0044e commit d81bc03
Show file tree
Hide file tree
Showing 6 changed files with 139 additions and 42 deletions.
27 changes: 0 additions & 27 deletions .env

This file was deleted.

46 changes: 39 additions & 7 deletions .env.demo
Original file line number Diff line number Diff line change
@@ -1,10 +1,42 @@
# **Config Variables**
DATABASE_URL=sqlite:///./app/database.db
REDIS_URL=redis://redis:6379/0
WORKDIR=/workdir
AWS_REGION="eu-central-1"
S3_BUCKET_NAME=<your_bucket_name>
NEXTCLOUD_UPLOAD_URL=https://nextcloud.example.com/remote.php/dav/files/<USERNAME>
NEXTCLOUD_FOLDER="<NEXTCLOUD_FOLDER_PATH>"
PAPERLESS_NGX_URL=https://paperless.example.com/api/documents/post_document/
PAPERLESS_HOST=https://paperless.example.com

# **Tokens/API Credentials**
AWS_ACCESS_KEY_ID="<AWS_ACCESS_KEY>"
AWS_SECRET_ACCESS_KEY="<AWS_SECRET_ACCESS_KEY>"
OPENAI_API_KEY="<OPENAI_API_KEY>"
PAPERLESS_NGX_API_TOKEN=<PAPERLESS_API_TOKEN>
DROPBOX_APP_KEY=<DROPBOX_APP_KEY>
DROPBOX_APP_SECRET=<DROPBOX_APP_SECRET>
DROPBOX_REFRESH_TOKEN=<DROPBOX_REFRESH_TOKEN>

# **User Credentials**
ADMIN_USERNAME=admin
ADMIN_PASSWORD=securepassword
ADMIN_PASSWORD=your_secure_password
NEXTCLOUD_USERNAME=<NEXTCLOUD_USERNAME>
NEXTCLOUD_PASSWORD=<NEXTCLOUD_PASSWORD>
IMAP1_USERNAME=<IMAP1_USERNAME>
IMAP1_PASSWORD=<IMAP1_PASSWORD>
IMAP2_USERNAME=<IMAP2_USERNAME>
IMAP2_PASSWORD=<IMAP2_PASSWORD>

AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=your_aws_region
AWS_TEXTRACT_ROLE_ARN=your_textract_role
# **IMAP Settings**
IMAP1_HOST=mail.example.com
IMAP1_PORT=993
IMAP1_SSL=true
IMAP1_POLL_INTERVAL_MINUTES=5
IMAP1_DELETE_AFTER_PROCESS=false

DATABASE_URL=sqlite:///./app/database.db
REDIS_URL=redis://redis:6379/0
IMAP2_HOST=imap.gmail.com
IMAP2_PORT=993
IMAP2_SSL=true
IMAP2_POLL_INTERVAL_MINUTES=10
IMAP2_DELETE_AFTER_PROCESS=false
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ share/python-wheels/
.installed.cfg
*.egg
MANIFEST
.env

# PyInstaller
# Usually these files are written by a python script from a template
Expand Down
94 changes: 94 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Document Processing System

## Overview

This project is designed to automate the process of handling, extracting, and processing documents. It integrates various services, such as AWS Textract, OpenAI, Dropbox, Nextcloud, and Paperless NGX, to extract metadata, process document contents, and store the results. The system is flexible and configurable via environment variables, allowing easy customization for different workflows.

## Features

- **Document Upload & Storage**: Upload documents to S3 and manage them with Dropbox and Nextcloud.
- **OCR Processing**: Use AWS Textract to extract text from scanned documents.
- **Metadata Extraction**: Automatically extract key information using OpenAI's API.
- **Document Management**: Store processed documents and metadata in Paperless NGX for easy retrieval.
- **IMAP Integration**: Fetch documents from multiple IMAP email accounts for processing.

## Environment Variables

The project is configured through the `.env` file, where you define credentials and settings such as:

- AWS access keys for Textract
- Dropbox and Nextcloud credentials
- Paperless NGX API token
- IMAP settings for fetching documents


# `.env` Configuration Explanation

This section provides details about each configuration option in the `.env` file. It also explains how to retrieve the necessary credentials and information for each setting.

| **Variable** | **Description** | **How to Obtain** |
|---------------------------------------|-----------------|-------------------|
| `DATABASE_URL` | Path to the SQLite database. | This is typically set to a local path where your database file will be stored. Example: `sqlite:///./app/database.db` |
| `REDIS_URL` | URL for connecting to Redis. | Set the URL for your Redis instance (e.g., `redis://localhost:6379/0`). Install Redis locally or use a cloud service like Redis Labs. |
| `WORKDIR` | Working directory for the application. | Choose a directory where your application will read and write files, such as `/workdir`. |
| `AWS_REGION` | AWS region for services like Textract and S3. | Set this to your AWS region (e.g., `eu-central-1`). You can find your region in the AWS Management Console. |
| `S3_BUCKET_NAME` | Name of your S3 bucket. | Create an S3 bucket in your AWS account and use its name (e.g., `my-bucket-name`). |
| `NEXTCLOUD_UPLOAD_URL` | URL for uploading files to Nextcloud. | Format the URL for Nextcloud's WebDAV endpoint. Example: `https://nextcloud.example.com/remote.php/dav/files/<USERNAME>` |
| `NEXTCLOUD_FOLDER` | Folder in Nextcloud where files are uploaded. | Choose the folder path where documents will be uploaded (e.g., `/Documents/Uploads`). |
| `PAPERLESS_NGX_URL` | URL for Paperless NGX API endpoint for document uploads. | Obtain this from your Paperless NGX instance (e.g., `https://paperless.example.com/api/documents/post_document/`). |
| `PAPERLESS_HOST` | Host URL for Paperless NGX. | Set this to the root URL of your Paperless NGX instance (e.g., `https://paperless.example.com`). |
| **Tokens/API Credentials** | **Tokens and keys for third-party services** | **How to Obtain** |
| `AWS_ACCESS_KEY_ID` | AWS Access Key ID for authentication with AWS services. | Create a new access key from the AWS IAM console under **Access Management****Users**. |
| `AWS_SECRET_ACCESS_KEY` | AWS Secret Access Key for AWS authentication. | This is generated alongside the access key ID in the AWS IAM console. Keep this secure. |
| `OPENAI_API_KEY` | API key for accessing OpenAI services. | Get this from the [OpenAI platform](https://platform.openai.com/account/api-keys). |
| `PAPERLESS_NGX_API_TOKEN` | API token for Paperless NGX. | Obtain this from your Paperless NGX instance or generate a new token from its settings. |
| `DROPBOX_APP_KEY` | Dropbox App Key for Dropbox API. | Generate this in the [Dropbox Developer Console](https://www.dropbox.com/developers/apps/create). |
| `DROPBOX_APP_SECRET` | Dropbox App Secret for Dropbox API. | This can be found in the Dropbox Developer Console after creating your app. |
| `DROPBOX_REFRESH_TOKEN` | Dropbox Refresh Token for accessing Dropbox. | You can get this by going through the OAuth process using Dropbox's CLI or API. Check the [OAuth guide](https://www.dropbox.com/developers/reference/oauth-guide). |
| **User Credentials** | **Authentication for various services** | **How to Obtain** |
| `ADMIN_USERNAME` | Username for admin access to the application. | This is typically chosen by you and set for managing your application. |
| `ADMIN_PASSWORD` | Password for admin access. | Set this to a strong password for admin authentication. |
| `NEXTCLOUD_USERNAME` | Username for accessing Nextcloud. | Your Nextcloud username (e.g., `username@example.com`). |
| `NEXTCLOUD_PASSWORD` | Password for accessing Nextcloud. | The password associated with your Nextcloud account. |
| `IMAP1_USERNAME` | IMAP username for the first email account. | The email address associated with the first IMAP account. |
| `IMAP1_PASSWORD` | IMAP password for the first email account. | The password for the above IMAP account. |
| `IMAP2_USERNAME` | IMAP username for the second email account. | The email address associated with the second IMAP account. |
| `IMAP2_PASSWORD` | IMAP password for the second email account. | The password for the second IMAP account. |
| **IMAP Settings** | **IMAP configuration for email fetching** | **How to Obtain** |
| `IMAP1_HOST` | Hostname of the first IMAP server. | The hostname for the first email provider (e.g., `mail.example.com` or `imap.gmail.com`). |
| `IMAP1_PORT` | Port for the first IMAP server. | Typically `993` for SSL/TLS. |
| `IMAP1_SSL` | Enable SSL for the first IMAP connection. | Set to `true` to enable SSL. |
| `IMAP1_POLL_INTERVAL_MINUTES` | How often to poll the first IMAP server (in minutes). | Set to `5` or adjust as needed. |
| `IMAP1_DELETE_AFTER_PROCESS` | Delete emails after processing in the first IMAP account. | Set to `false` to leave emails or `true` to delete after processing. |
| `IMAP2_HOST` | Hostname of the second IMAP server. | The hostname for the second email provider (e.g., `imap.gmail.com`). |
| `IMAP2_PORT` | Port for the second IMAP server. | Typically `993` for SSL/TLS. |
| `IMAP2_SSL` | Enable SSL for the second IMAP connection. | Set to `true` to enable SSL. |
| `IMAP2_POLL_INTERVAL_MINUTES` | How often to poll the second IMAP server (in minutes). | Set to `10` or adjust as needed. |
| `IMAP2_DELETE_AFTER_PROCESS` | Delete emails after processing in the second IMAP account. | Set to `false` to leave emails or `true` to delete after processing. |


## Setup

1. **Clone the repository**:
```bash
git clone <repository_url>
cd <repository_name>
```

2. **Install dependencies**:
```bash
pip install -r requirements.txt
```

3. **Configure the environment**:
- Create a `.env` file based on the provided example and fill in the required fields (AWS, Dropbox, IMAP, etc.).

4. **Run the application**:
```bash
python app.py
```

## Notes

- The system requires access to AWS Textract, Dropbox, Nextcloud, and Paperless NGX for full functionality.
- Ensure you have proper permissions set up for these services (IAM roles, API tokens, etc.).
3 changes: 0 additions & 3 deletions app/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,19 @@ class Settings(BaseSettings):
aws_access_key_id: str
aws_secret_access_key: str
aws_region: str
aws_textract_role_arn: str
database_url: str
redis_url: str
s3_bucket_name: str
openai_api_key: str
workdir: str
dropbox_app_key: str
dropbox_app_secret: str
dropbox_token_file_path: str
dropbox_folder: str
dropbox_refresh_token: str
nextcloud_upload_url: str
nextcloud_username: str
nextcloud_password: str
nextcloud_folder: str
paperless_ngx_url: str
paperless_ngx_api_token: str
paperless_host: str

Expand Down
10 changes: 5 additions & 5 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ services:
container_name: document_api

# We'll keep the code in /app, but set working_dir to the shared data directory
working_dir: /var/docparse/working
working_dir: /workdir

# We'll run uvicorn from the container's /app code
command: ["sh", "-c", "cd /app && uvicorn app.main:app --host 0.0.0.0 --port 8000"]
Expand All @@ -28,16 +28,16 @@ services:
volumes:
# optional: mount your code if you want local dev changes to reflect
# - ./app:/app
- /var/docparse/working:/var/docparse/working
- /var/docparse/workdir:/workdir

worker:
image: christianlouis/document-processor:latest
container_name: document_worker

# same shared working directory
working_dir: /var/docparse/working
working_dir: /workdir

command: ["celery", "-A", "app.celery_worker", "worker", "--loglevel=info", "-Q", "document_processor,default,celery"]
command: ["celery", "-A", "app.celery_worker", "worker", "-B", "--loglevel=info", "-Q", "document_processor,default,celery"]
env_file:
- .env
environment:
Expand All @@ -50,7 +50,7 @@ services:
volumes:
# optional: mount your code if you want local dev changes
# - ./app:/app
- /var/docparse/working:/var/docparse/working
- /var/docparse/workdir:/workdir

redis:
image: redis:alpine
Expand Down

0 comments on commit d81bc03

Please sign in to comment.