-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure ocr #14
Azure ocr #14
Conversation
WalkthroughThe changes update the project configuration and document processing functionality to integrate Azure Document Intelligence. The environment file now includes Azure-specific variables, and the application's settings class is updated accordingly. The document processing logic in the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Process as process_with_textract
participant Azure as Azure Document Intelligence
User->>Process: Trigger document processing (PDF file)
Process->>Azure: Initialize client with AZURE_ENDPOINT & AZURE_AI_KEY
Process->>Azure: Submit document for OCR processing
Azure-->>Process: Return processed (searchable) PDF
Process->>Process: Save processed PDF locally
Process-->>User: Return processed document details
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
app/tasks/process_with_textract.py (3)
48-55
: Handling of the OCR-processed PDF is well-structured.
Overwriting the existing file is convenient, though watch for conflicting processes trying to read or write the file simultaneously.
57-62
: Extraction and handoff of text is straightforward.
If documents become very large, consider streaming approaches to reduce memory usage when reading content intoextracted_text
.
66-66
: Error logging is appropriately granular.
Catch-all exceptions can mask specific Azure exceptions. You may want to handle network errors, authentication errors, etc., separately for clarity.app/config.py (1)
27-29
: New Azure settings are consistent and align with environment variables.
Consider leveraging managed secrets services rather than storing keys in environment variables when operating in production.requirements.txt (1)
15-15
: Pin or verify the version for better stability.
Having a pinned version (e.g.,azure-ai-documentintelligence==x.y.z
) can help avoid unexpected breaking changes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (4)
.env.demo
(2 hunks)app/config.py
(1 hunks)app/tasks/process_with_textract.py
(2 hunks)requirements.txt
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .env.demo
🔇 Additional comments (6)
app/tasks/process_with_textract.py (6)
3-5
: Imports look good and necessary for Azure integration.
These additions correctly import the Azure Document Intelligence modules needed for OCR functionality.
14-18
: Client initialization is appropriate.
Using environment-based credentials for the Azure endpoint key is a best practice. Ensure the endpoint includes the correct region or domain.
23-31
: Docstring updates align well with the new Azure logic.
The step-by-step explanation is helpful for maintenance and onboarding.
36-36
: More explicit error handling and logging is beneficial.
Raising aFileNotFoundError
is more informative than a genericException
, and the additional log statement at line 38 aids in debugging.Also applies to: 38-38
64-64
: Return statement is clear and descriptive.
This return structure is helpful for debugging, as it encapsulates all relevant paths in a single dictionary.
40-47
:❓ Verification inconclusive
Verify the stability of the poller’s details.
Ifpoller.details["operation_id"]
becomes unavailable or changes in the future, this might break the flow. Consider adding checks or try/except around the retrieval ofoperation_id
.
🏁 Script executed:
#!/bin/bash # Check for usage of operation_id and confirm "operation_id" is a valid key in poller.details. rg -A 3 'poller\.details\[\"operation_id\"\]'Length of output: 166
Action Required: Enhance Error Handling for API Response
In
app/tasks/process_with_textract.py
(lines 40–47), the code directly accessesoperation_id = poller.details["operation_id"]which could fail if the API changes its response structure. Please add error handling (e.g., a try/except block or explicit key checks) to ensure that the key
"operation_id"
exists before attempting to use it. Manual verification of the API response is advised to confirm that the key remains stable over time.
fixes #3
Summary by CodeRabbit