Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure ocr #14

Merged
merged 2 commits into from
Feb 14, 2025
Merged

Azure ocr #14

merged 2 commits into from
Feb 14, 2025

Conversation

christianlouis
Copy link
Owner

@christianlouis christianlouis commented Feb 14, 2025

fixes #3

Summary by CodeRabbit

  • New Features
    • Enhanced PDF document processing now leverages Azure Document Intelligence for more accurate OCR and conversion to searchable formats.
    • New configuration options enable seamless integration with Azure services, ensuring improved performance, stability, and error handling.
    • These improvements deliver a significantly faster and more efficient document scanning and searching experience.

Copy link

coderabbitai bot commented Feb 14, 2025

Walkthrough

The changes update the project configuration and document processing functionality to integrate Azure Document Intelligence. The environment file now includes Azure-specific variables, and the application's settings class is updated accordingly. The document processing logic in the process_with_textract module is refactored to replace AWS Textract operations with Azure operations using the new configuration. Additionally, a new dependency for Azure's Document Intelligence package has been added to the requirements file.

Changes

File(s) Change Summary
.env.demo Added new configuration variables: AZURE_REGION="eastus", AZURE_ENDPOINT="https://<yourendpoint>.cognitiveservices.azure.com/", and AZURE_AI_KEY=<AZURE_AI_KEY>.
app/config.py Updated the Settings class by adding new string attributes: azure_ai_key, azure_region, and azure_endpoint for Azure services configuration.
app/tasks/process_with_textract.py Overhauled document processing: replaced AWS Textract with Azure Document Intelligence. Updated client initialization, error handling, docstring, and logic flow.
requirements.txt Included new dependency azure-ai-documentintelligence to support Azure integration.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Process as process_with_textract
    participant Azure as Azure Document Intelligence

    User->>Process: Trigger document processing (PDF file)
    Process->>Azure: Initialize client with AZURE_ENDPOINT & AZURE_AI_KEY
    Process->>Azure: Submit document for OCR processing
    Azure-->>Process: Return processed (searchable) PDF
    Process->>Process: Save processed PDF locally
    Process-->>User: Return processed document details
Loading

Poem

Oh, what a change, so fresh and bright,
A rabbit hops with pure delight.
From AWS to Azure we now flow,
Configured well with settings to show.
With keys and endpoints in a neat array,
I cheer and hop—hip-hip-hooray!
🐰✨

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
app/tasks/process_with_textract.py (3)

48-55: Handling of the OCR-processed PDF is well-structured.
Overwriting the existing file is convenient, though watch for conflicting processes trying to read or write the file simultaneously.


57-62: Extraction and handoff of text is straightforward.
If documents become very large, consider streaming approaches to reduce memory usage when reading content into extracted_text.


66-66: Error logging is appropriately granular.
Catch-all exceptions can mask specific Azure exceptions. You may want to handle network errors, authentication errors, etc., separately for clarity.

app/config.py (1)

27-29: New Azure settings are consistent and align with environment variables.
Consider leveraging managed secrets services rather than storing keys in environment variables when operating in production.

requirements.txt (1)

15-15: Pin or verify the version for better stability.
Having a pinned version (e.g., azure-ai-documentintelligence==x.y.z) can help avoid unexpected breaking changes.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 9fc27d6 and 9ebac99.

📒 Files selected for processing (4)
  • .env.demo (2 hunks)
  • app/config.py (1 hunks)
  • app/tasks/process_with_textract.py (2 hunks)
  • requirements.txt (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .env.demo
🔇 Additional comments (6)
app/tasks/process_with_textract.py (6)

3-5: Imports look good and necessary for Azure integration.
These additions correctly import the Azure Document Intelligence modules needed for OCR functionality.


14-18: Client initialization is appropriate.
Using environment-based credentials for the Azure endpoint key is a best practice. Ensure the endpoint includes the correct region or domain.


23-31: Docstring updates align well with the new Azure logic.
The step-by-step explanation is helpful for maintenance and onboarding.


36-36: More explicit error handling and logging is beneficial.
Raising a FileNotFoundError is more informative than a generic Exception, and the additional log statement at line 38 aids in debugging.

Also applies to: 38-38


64-64: Return statement is clear and descriptive.
This return structure is helpful for debugging, as it encapsulates all relevant paths in a single dictionary.


40-47:

❓ Verification inconclusive

Verify the stability of the poller’s details.
If poller.details["operation_id"] becomes unavailable or changes in the future, this might break the flow. Consider adding checks or try/except around the retrieval of operation_id.


🏁 Script executed:

#!/bin/bash
# Check for usage of operation_id and confirm "operation_id" is a valid key in poller.details.
rg -A 3 'poller\.details\[\"operation_id\"\]' 

Length of output: 166


Action Required: Enhance Error Handling for API Response

In app/tasks/process_with_textract.py (lines 40–47), the code directly accesses

operation_id = poller.details["operation_id"]

which could fail if the API changes its response structure. Please add error handling (e.g., a try/except block or explicit key checks) to ensure that the key "operation_id" exists before attempting to use it. Manual verification of the API response is advised to confirm that the key remains stable over time.

@christianlouis christianlouis merged commit 9156c62 into main Feb 14, 2025
4 checks passed
@christianlouis christianlouis deleted the azure-ocr branch February 14, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Analyze why only a few pages are processed by Textract, not the entire file
1 participant