Pymupdf4llm to_markdown crashes on some documents #258

4rrw · 2025-04-23T08:14:08Z

Description of the bug

Calling on to_markdown with this document crashes python.
output:

'example.py' terminated by signal SIGSEGV (Address boundary error)

Changing python versions does not help.
Changing pymupdf4llm version does not help.
Just loading pdf using pymupdf does work.

How to reproduce the bug

import pymupdf4llm

document_filepath = "documents/example-document.pdf"
pages = pymupdf4llm.to_markdown(
    document_filepath,
)

PyMuPDF version

1.25.5

Operating system

Linux

Python version

3.10

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2025-04-23T10:00:34Z

This is an upstream error (base library). The PDF is not built according to specifications:
The CropBox of all pages is not contained in the MediaBox - which is wrong. Example for page 0:

<<
  /Type /Page
  /Contents [ 16 0 R 17 0 R 18 0 R 19 0 R 26 0 R 27 0 R 49 0 R
      50 0 R ]
  /CropBox [ -21 -21 616.276 862.89 ]  # <== this is wrong: larger than MediaBox!
  /Group 61 0 R
  /MediaBox [ 0 0 595.28 841.89 ]
  /Parent 1 0 R
  /Resources <<
    /ColorSpace 5 0 R
    /ExtGState 6 0 R
    /Font 7 0 R
    /Pattern 8 0 R
    /ProcSet [ /PDF /ImageC /Text ]
    /Shading 9 0 R
    /XObject 10 0 R
  >>
  /Rotate 0
>>

However, the actually required internal correction is not done by MuPDF. The resulting incorrect value of page.rect is unexpected and leads to a storage violation when building certain internal Pixmaps.

For debugging purposes, I am attaching the sub-pdf with page 0 here:
page0.pdf

Link to MuPDF bug item: https://bugs.ghostscript.com/show_bug.cgi?id=708497

JorjMcKie · 2025-04-23T10:27:30Z

While we are dealing with this: the PDF can of course be repaired without major problems. Please let us know if you need that here.

4rrw · 2025-04-23T15:03:56Z

Thank you for the answer.

JorjMcKie · 2025-04-24T08:19:26Z

Discussing with the MuPDF team revealed that there is no MuPDF problem here, but a bug in PyMuPDF (not PyMuPDF4LLM).
Part of the extraction logic of PyMuPDF4LLM includes making some small Pixmaps for background color checking. Due to the wrong CropBox values in this PDF, some of these Pixmaps were empty which was not correctly checked, and ultimately resulting in storage violations you have experienced.
We are going to improve both, PyMuPDF and PyMuPDF4LLM to avoid this.

JorjMcKie · 2025-04-24T09:52:55Z

As an immediate fix, I have changed PyMuPDF4LLM to prevent this error from happening. Therefore I am going to transfer this issue to that repository's issue list.

JorjMcKie · 2025-04-28T09:23:26Z

Fixed with v0.0.22.

JorjMcKie added the bug Something isn't working label Apr 24, 2025

JorjMcKie transferred this issue from pymupdf/PyMuPDF Apr 24, 2025

JorjMcKie added the fix developed label Apr 26, 2025

JorjMcKie closed this as completed Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pymupdf4llm to_markdown crashes on some documents #258

Pymupdf4llm to_markdown crashes on some documents #258

4rrw commented Apr 23, 2025

JorjMcKie commented Apr 23, 2025 •

edited

Loading

Uh oh!

JorjMcKie commented Apr 23, 2025

Uh oh!

4rrw commented Apr 23, 2025

Uh oh!

JorjMcKie commented Apr 24, 2025

Uh oh!

JorjMcKie commented Apr 24, 2025

Uh oh!

JorjMcKie commented Apr 28, 2025

Uh oh!

Pymupdf4llm to_markdown crashes on some documents #258

Pymupdf4llm to_markdown crashes on some documents #258

Comments

4rrw commented Apr 23, 2025

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JorjMcKie commented Apr 23, 2025

Uh oh!

4rrw commented Apr 23, 2025

Uh oh!

JorjMcKie commented Apr 24, 2025

Uh oh!

JorjMcKie commented Apr 24, 2025

Uh oh!

JorjMcKie commented Apr 28, 2025

Uh oh!

JorjMcKie commented Apr 23, 2025 •

edited

Loading