You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have been dealing with a strange behaviour where a PDF has a page whose only content is an image.
This image is a little bigger than the page, so the image gets filtered out on the line below:
I changed the code as shown below and the image returned successfully.
--- a/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py+++ b/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py@@ -881,7 +881,7 @@ def to_markdown(
for i in img_info
if i["bbox"].width >= image_size_limit * parms.clip.width
and i["bbox"].height >= image_size_limit * parms.clip.height
- and i["bbox"] in parms.clip+ and i["bbox"].intersects(parms.clip)
and i["bbox"].width > 3
and i["bbox"].height > 3
]
Wouldn't be more correct to exclude images that have no intersection with the page rect, instead of ensuring it is 100% contained inside this rect?
Thank you!
Heitor
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Hello, I have been dealing with a strange behaviour where a PDF has a page whose only content is an image.
This image is a little bigger than the page, so the image gets filtered out on the line below:
https://github.com/pymupdf/RAG/blob/main/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py#L884
I changed the code as shown below and the image returned successfully.
Wouldn't be more correct to exclude images that have no intersection with the page rect, instead of ensuring it is 100% contained inside this rect?
Thank you!
Heitor
The text was updated successfully, but these errors were encountered: