Images a little larger than the page size are being ignored #251

heitorneves91 · 2025-04-15T17:46:40Z

Hello, I have been dealing with a strange behaviour where a PDF has a page whose only content is an image.
This image is a little bigger than the page, so the image gets filtered out on the line below:

https://github.com/pymupdf/RAG/blob/main/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py#L884

I changed the code as shown below and the image returned successfully.

--- a/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py
+++ b/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py
@@ -881,7 +881,7 @@ def to_markdown(
             for i in img_info
             if i["bbox"].width >= image_size_limit * parms.clip.width
             and i["bbox"].height >= image_size_limit * parms.clip.height
-            and i["bbox"] in parms.clip
+            and i["bbox"].intersects(parms.clip)
             and i["bbox"].width > 3
             and i["bbox"].height > 3
         ]

Wouldn't be more correct to exclude images that have no intersection with the page rect, instead of ensuring it is 100% contained inside this rect?

Thank you!
Heitor

JorjMcKie · 2025-04-28T09:23:57Z

Fixed with v0.0.22.

JorjMcKie added enhancement New feature or request fix developed labels Apr 26, 2025

JorjMcKie closed this as completed Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Images a little larger than the page size are being ignored #251

Images a little larger than the page size are being ignored #251

heitorneves91 commented Apr 15, 2025 •

edited

Loading

JorjMcKie commented Apr 28, 2025

Uh oh!

Images a little larger than the page size are being ignored #251

Images a little larger than the page size are being ignored #251

Comments

heitorneves91 commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JorjMcKie commented Apr 28, 2025

Uh oh!

heitorneves91 commented Apr 15, 2025 •

edited

Loading