Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added text detection in upload_to_s3 #12

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

christianlouis
Copy link
Owner

@christianlouis christianlouis commented Feb 14, 2025

Summary by CodeRabbit

  • New Features
    • Enhanced PDF processing to detect embedded text before applying OCR, with local text extraction when applicable.
    • Updated status messaging to clearly indicate when files are processed for OCR.

Copy link

coderabbitai bot commented Feb 14, 2025

Walkthrough

The changes modify the PDF file handling logic in the upload function. Specifically, the function now checks for embedded text within the PDF. If text is present, it bypasses Textract processing, instead using the fitz library to extract the text locally before passing it to the metadata extraction function (extract_metadata_with_gpt). Other aspects such as error handling for missing S3 bucket names and file existence, UUID-based filename generation, and the return message have been updated accordingly.

Changes

File Path Change Summary
app/.../upload_to_s3.py Modified upload_to_s3: now checks for embedded PDF text; uses fitz for local extraction; routes extracted text to extract_metadata_with_gpt; updated return message.
app/.../extract_metadata_with_gpt.py Added extract_metadata_with_gpt method to handle further processing of locally extracted text.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Uploader as upload_to_s3()
    participant PDF as PDF File
    participant Fitz as fitz Library
    participant GPT as extract_metadata_with_gpt()
    participant S3 as Amazon S3

    Client->>Uploader: Call upload_to_s3(file)
    Uploader->>PDF: Read file & check for embedded text
    alt Embedded Text Present
        Uploader->>Fitz: Extract text locally
        Fitz-->>Uploader: Extracted text
        Uploader->>GPT: Process text for metadata
        GPT-->>Uploader: Metadata response
    else No Embedded Text
        Uploader->>Textract: Process file via Textract (default flow)
    end
    Uploader->>S3: Upload file using UUID filename
    S3-->>Uploader: Upload confirmation
    Uploader->>Client: Return "Uploaded to S3 for OCR"
Loading

Poem

I'm a rabbit with a code-filled heart,
Hopping through PDFs, playing my part.
When text is found, I skip the long route,
With fitz I extract, to metadata I scoot.
S3 gets my package with a bounding leap,
In this codey field, my joy runs deep!
🚀🐇 Happy coding, let's make magic leap!

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
app/tasks/upload_to_s3.py (3)

7-7: Consider pinning the PyMuPDF version.
While importing fitz is appropriate for checking embedded text, please consider pinning the library version in your requirements for better reproducibility and to avoid compatibility surprises in the future.


55-59: Avoid reopening the same PDF multiple times.
Currently, the code opens and scans the PDF for text, closes it, and then reopens. You could streamline this by doing both the “has embedded text” check and text extraction in a single pass to reduce I/O and improve performance.

 def upload_to_s3(original_local_file: str):
     ...
-    pdf_doc = fitz.open(new_local_path)
-    has_text = any(page.get_text() for page in pdf_doc)
-    pdf_doc.close()
+    doc = fitz.open(new_local_path)
+    text_fragments = [page.get_text() for page in doc]
+    has_text = any(text_fragments)
+    doc.close()
 
     if has_text:
         ...

83-83: Consider unified return structure.
In the embedded-text case, the return dictionary lacks the "s3_key" field. For more consistent post-processing or logging, consider adding "s3_key" or other relevant fields even if skipping S3 upload.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between e0365e5 and 40fe288.

📒 Files selected for processing (1)
  • app/tasks/upload_to_s3.py (3 hunks)
🔇 Additional comments (4)
app/tasks/upload_to_s3.py (4)

11-11: Import for GPT metadata extraction is well-structured.
Your import statement is clearly defined, and the usage seems straightforward. No immediate concerns.


27-29: Docstring accurately reflects the updated behavior.
Appreciate that you documented the new logic regarding embedded text detection and Textract skipping.


60-74: Watch out for Celery message size limits with large PDFs.
When extracting all text from large PDFs, the resulting string might exceed Celery’s message size limit, leading to task failures. Consider chunking or storing the text somewhere (e.g., S3, database) and passing only a reference to the next task.


80-80: Clear comment for Textract trigger.
The explanatory comment nicely clarifies when Textract is used.

@christianlouis christianlouis merged commit 230c59f into main Feb 14, 2025
4 checks passed
@christianlouis christianlouis deleted the skip-ocr-for-pdf-with-text branch February 14, 2025 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant