Skip to content

Images a little larger than the page size are being ignored #251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
heitorneves91 opened this issue Apr 15, 2025 · 1 comment
Closed

Images a little larger than the page size are being ignored #251

heitorneves91 opened this issue Apr 15, 2025 · 1 comment
Labels
enhancement New feature or request fix developed

Comments

@heitorneves91
Copy link

heitorneves91 commented Apr 15, 2025

Hello, I have been dealing with a strange behaviour where a PDF has a page whose only content is an image.
This image is a little bigger than the page, so the image gets filtered out on the line below:

https://github.com/pymupdf/RAG/blob/main/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py#L884

I changed the code as shown below and the image returned successfully.

--- a/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py
+++ b/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py
@@ -881,7 +881,7 @@ def to_markdown(
             for i in img_info
             if i["bbox"].width >= image_size_limit * parms.clip.width
             and i["bbox"].height >= image_size_limit * parms.clip.height
-            and i["bbox"] in parms.clip
+            and i["bbox"].intersects(parms.clip)
             and i["bbox"].width > 3
             and i["bbox"].height > 3
         ]

Wouldn't be more correct to exclude images that have no intersection with the page rect, instead of ensuring it is 100% contained inside this rect?

Thank you!
Heitor

@JorjMcKie JorjMcKie added enhancement New feature or request fix developed labels Apr 26, 2025
@JorjMcKie
Copy link
Collaborator

Fixed with v0.0.22.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fix developed
Projects
None yet
Development

No branches or pull requests

2 participants