-
Notifications
You must be signed in to change notification settings - Fork 127
Pymupdf4llm to_markdown crashes on some documents #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is an upstream error (base library). The PDF is not built according to specifications: <<
/Type /Page
/Contents [ 16 0 R 17 0 R 18 0 R 19 0 R 26 0 R 27 0 R 49 0 R
50 0 R ]
/CropBox [ -21 -21 616.276 862.89 ] # <== this is wrong: larger than MediaBox!
/Group 61 0 R
/MediaBox [ 0 0 595.28 841.89 ]
/Parent 1 0 R
/Resources <<
/ColorSpace 5 0 R
/ExtGState 6 0 R
/Font 7 0 R
/Pattern 8 0 R
/ProcSet [ /PDF /ImageC /Text ]
/Shading 9 0 R
/XObject 10 0 R
>>
/Rotate 0
>> However, the actually required internal correction is not done by MuPDF. The resulting incorrect value of For debugging purposes, I am attaching the sub-pdf with page 0 here: Link to MuPDF bug item: https://bugs.ghostscript.com/show_bug.cgi?id=708497 |
While we are dealing with this: the PDF can of course be repaired without major problems. Please let us know if you need that here. |
Thank you for the answer. |
Discussing with the MuPDF team revealed that there is no MuPDF problem here, but a bug in PyMuPDF (not PyMuPDF4LLM). |
As an immediate fix, I have changed PyMuPDF4LLM to prevent this error from happening. Therefore I am going to transfer this issue to that repository's issue list. |
Fixed with v0.0.22. |
Description of the bug
Calling on
to_markdown
with this document crashes python.output:
Changing python versions does not help.
Changing pymupdf4llm version does not help.
Just loading pdf using pymupdf does work.
How to reproduce the bug
PyMuPDF version
1.25.5
Operating system
Linux
Python version
3.10
The text was updated successfully, but these errors were encountered: