Skip to content

Pymupdf4llm to_markdown crashes on some documents #258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4rrw opened this issue Apr 23, 2025 · 6 comments
Closed

Pymupdf4llm to_markdown crashes on some documents #258

4rrw opened this issue Apr 23, 2025 · 6 comments
Labels
bug Something isn't working fix developed

Comments

@4rrw
Copy link

4rrw commented Apr 23, 2025

Description of the bug

Calling on to_markdown with this document crashes python.
output:

'example.py' terminated by signal SIGSEGV (Address boundary error)

Changing python versions does not help.
Changing pymupdf4llm version does not help.
Just loading pdf using pymupdf does work.

How to reproduce the bug

import pymupdf4llm

document_filepath = "documents/example-document.pdf"
pages = pymupdf4llm.to_markdown(
    document_filepath,
)

PyMuPDF version

1.25.5

Operating system

Linux

Python version

3.10

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Apr 23, 2025

This is an upstream error (base library). The PDF is not built according to specifications:
The CropBox of all pages is not contained in the MediaBox - which is wrong. Example for page 0:

<<
  /Type /Page
  /Contents [ 16 0 R 17 0 R 18 0 R 19 0 R 26 0 R 27 0 R 49 0 R
      50 0 R ]
  /CropBox [ -21 -21 616.276 862.89 ]  # <== this is wrong: larger than MediaBox!
  /Group 61 0 R
  /MediaBox [ 0 0 595.28 841.89 ]
  /Parent 1 0 R
  /Resources <<
    /ColorSpace 5 0 R
    /ExtGState 6 0 R
    /Font 7 0 R
    /Pattern 8 0 R
    /ProcSet [ /PDF /ImageC /Text ]
    /Shading 9 0 R
    /XObject 10 0 R
  >>
  /Rotate 0
>>

However, the actually required internal correction is not done by MuPDF. The resulting incorrect value of page.rect is unexpected and leads to a storage violation when building certain internal Pixmaps.

For debugging purposes, I am attaching the sub-pdf with page 0 here:
page0.pdf

Link to MuPDF bug item: https://bugs.ghostscript.com/show_bug.cgi?id=708497

@JorjMcKie
Copy link
Collaborator

While we are dealing with this: the PDF can of course be repaired without major problems. Please let us know if you need that here.

@4rrw
Copy link
Author

4rrw commented Apr 23, 2025

Thank you for the answer.

@JorjMcKie JorjMcKie added the bug Something isn't working label Apr 24, 2025
@JorjMcKie
Copy link
Collaborator

Discussing with the MuPDF team revealed that there is no MuPDF problem here, but a bug in PyMuPDF (not PyMuPDF4LLM).
Part of the extraction logic of PyMuPDF4LLM includes making some small Pixmaps for background color checking. Due to the wrong CropBox values in this PDF, some of these Pixmaps were empty which was not correctly checked, and ultimately resulting in storage violations you have experienced.
We are going to improve both, PyMuPDF and PyMuPDF4LLM to avoid this.

@JorjMcKie
Copy link
Collaborator

As an immediate fix, I have changed PyMuPDF4LLM to prevent this error from happening. Therefore I am going to transfer this issue to that repository's issue list.

@JorjMcKie JorjMcKie transferred this issue from pymupdf/PyMuPDF Apr 24, 2025
@JorjMcKie
Copy link
Collaborator

Fixed with v0.0.22.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix developed
Projects
None yet
Development

No branches or pull requests

2 participants