Single-row/column tables are skipped #255

c01o · 2025-04-19T09:02:48Z

related: #252

Problem

Single-row or single-column tables should be interpreted as single-line text, but they are simply being ignored.

Codes to Reproduce

# cell 1
import reportlab
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle

def create_pdf(output_path, data = None):
    if data is None:
        data = []

    pdf = SimpleDocTemplate(output_path, pagesize=letter)

    table = Table(data)
    table.setStyle(TableStyle([('GRID', (0, 0), (-1, -1), 1, colors.black)]))

    pdf.build([table])

# cell 2
import pymupdf
import pymupdf4llm

def parse_with_pymupdf(pdf_path):
    _pdf = pymupdf.open(pdf_path)
    print(f"--pymupdf--\n{_pdf[0].get_text()}")


def parse_with_pymupdf4llm(pdf_path):
    _content = pymupdf4llm.to_markdown(pdf_path)
    print(f"--pymupdf4llm {pymupdf4llm.__version__}--\n{_content}")

# cell 3
testcases = [
    ['1x1.pdf', [["0,0/1x1"]]],
    ['1x2.pdf', [["0,0/1x2", "0,1/1x2"]]],
    ['2x1.pdf', [["0,0/2x1"], ["1,0/2x1"]]],
    ['2x2.pdf', [["0,0/2x2", "0,1/2x2"], ["1,0/2x2", "1,1/2x2"]]]
]

from pathlib import Path

testdir = Path('./test')
testdir.mkdir(exist_ok=True)

for name, data in testcases:
    pdf_path = testdir / name
    
    print(name + ':')
    create_pdf(str(pdf_path), data)
    parse_with_pymupdf(pdf_path)
    parse_with_pymupdf4llm(pdf_path)
    print('-----------------------------')

Result

In PDFs with tables smaller than 2x2, the table content is being ignored.

1x1.pdf:
--pymupdf--
0,0/1x1

--pymupdf4llm 0.0.21--
-----


-----------------------------
1x2.pdf:
--pymupdf--
0,0/1x2
0,1/1x2

--pymupdf4llm 0.0.21--
-----


-----------------------------
2x1.pdf:
--pymupdf--
0,0/2x1
1,0/2x1

--pymupdf4llm 0.0.21--
-----


-----------------------------
2x2.pdf:
--pymupdf--
0,0/2x2
0,1/2x2
1,0/2x2
1,1/2x2

--pymupdf4llm 0.0.21--
|0,0/2x2|0,1/2x2|
|---|---|
|1,0/2x2|1,1/2x2|


-----


-----------------------------

Version

0.0.21

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2025-04-26T12:23:35Z

Please provide a reproducing file!

c01o · 2025-04-26T12:31:00Z

The script above has create_pdf() function which generates reproducing files. Is it not enough?

JorjMcKie · 2025-04-26T13:44:30Z

Ah, sorry, didn't understand that.

JorjMcKie · 2025-04-27T13:57:53Z

The problem here is insufficient logic to classify vector graphics as "insignificant" (= just appearance sugar).
I hope this fix will be more precisely cover this.

JorjMcKie · 2025-04-28T09:23:42Z

Fixed with v0.0.22.

JorjMcKie added the waiting for information label Apr 26, 2025

JorjMcKie added bug Something isn't working fix developed and removed waiting for information labels Apr 26, 2025

JorjMcKie closed this as completed Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single-row/column tables are skipped #255

Single-row/column tables are skipped #255

c01o commented Apr 19, 2025

JorjMcKie commented Apr 26, 2025

Uh oh!

c01o commented Apr 26, 2025

Uh oh!

JorjMcKie commented Apr 26, 2025

Uh oh!

JorjMcKie commented Apr 27, 2025

Uh oh!

JorjMcKie commented Apr 28, 2025

Uh oh!

Single-row/column tables are skipped #255

Single-row/column tables are skipped #255

Comments

c01o commented Apr 19, 2025

Problem

Codes to Reproduce

Result

Version

JorjMcKie commented Apr 26, 2025

Uh oh!

c01o commented Apr 26, 2025

Uh oh!

JorjMcKie commented Apr 26, 2025

Uh oh!

JorjMcKie commented Apr 27, 2025

Uh oh!

JorjMcKie commented Apr 28, 2025

Uh oh!