-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added text detection in upload_to_s3 #12
Conversation
WalkthroughThe changes modify the PDF file handling logic in the upload function. Specifically, the function now checks for embedded text within the PDF. If text is present, it bypasses Textract processing, instead using the Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Uploader as upload_to_s3()
participant PDF as PDF File
participant Fitz as fitz Library
participant GPT as extract_metadata_with_gpt()
participant S3 as Amazon S3
Client->>Uploader: Call upload_to_s3(file)
Uploader->>PDF: Read file & check for embedded text
alt Embedded Text Present
Uploader->>Fitz: Extract text locally
Fitz-->>Uploader: Extracted text
Uploader->>GPT: Process text for metadata
GPT-->>Uploader: Metadata response
else No Embedded Text
Uploader->>Textract: Process file via Textract (default flow)
end
Uploader->>S3: Upload file using UUID filename
S3-->>Uploader: Upload confirmation
Uploader->>Client: Return "Uploaded to S3 for OCR"
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
app/tasks/upload_to_s3.py (3)
7-7
: Consider pinning the PyMuPDF version.
While importingfitz
is appropriate for checking embedded text, please consider pinning the library version in your requirements for better reproducibility and to avoid compatibility surprises in the future.
55-59
: Avoid reopening the same PDF multiple times.
Currently, the code opens and scans the PDF for text, closes it, and then reopens. You could streamline this by doing both the “has embedded text” check and text extraction in a single pass to reduce I/O and improve performance.def upload_to_s3(original_local_file: str): ... - pdf_doc = fitz.open(new_local_path) - has_text = any(page.get_text() for page in pdf_doc) - pdf_doc.close() + doc = fitz.open(new_local_path) + text_fragments = [page.get_text() for page in doc] + has_text = any(text_fragments) + doc.close() if has_text: ...
83-83
: Consider unified return structure.
In the embedded-text case, the return dictionary lacks the"s3_key"
field. For more consistent post-processing or logging, consider adding"s3_key"
or other relevant fields even if skipping S3 upload.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (1)
app/tasks/upload_to_s3.py
(3 hunks)
🔇 Additional comments (4)
app/tasks/upload_to_s3.py (4)
11-11
: Import for GPT metadata extraction is well-structured.
Your import statement is clearly defined, and the usage seems straightforward. No immediate concerns.
27-29
: Docstring accurately reflects the updated behavior.
Appreciate that you documented the new logic regarding embedded text detection and Textract skipping.
60-74
: Watch out for Celery message size limits with large PDFs.
When extracting all text from large PDFs, the resulting string might exceed Celery’s message size limit, leading to task failures. Consider chunking or storing the text somewhere (e.g., S3, database) and passing only a reference to the next task.
80-80
: Clear comment for Textract trigger.
The explanatory comment nicely clarifies when Textract is used.
Summary by CodeRabbit