Skip to content

Single-row/column tables are skipped #255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
c01o opened this issue Apr 19, 2025 · 5 comments
Closed

Single-row/column tables are skipped #255

c01o opened this issue Apr 19, 2025 · 5 comments
Labels
bug Something isn't working fix developed

Comments

@c01o
Copy link

c01o commented Apr 19, 2025

related: #252

Problem

Single-row or single-column tables should be interpreted as single-line text, but they are simply being ignored.

Codes to Reproduce

# cell 1
import reportlab
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle

def create_pdf(output_path, data = None):
    if data is None:
        data = []

    pdf = SimpleDocTemplate(output_path, pagesize=letter)

    table = Table(data)
    table.setStyle(TableStyle([('GRID', (0, 0), (-1, -1), 1, colors.black)]))

    pdf.build([table])

# cell 2
import pymupdf
import pymupdf4llm

def parse_with_pymupdf(pdf_path):
    _pdf = pymupdf.open(pdf_path)
    print(f"--pymupdf--\n{_pdf[0].get_text()}")


def parse_with_pymupdf4llm(pdf_path):
    _content = pymupdf4llm.to_markdown(pdf_path)
    print(f"--pymupdf4llm {pymupdf4llm.__version__}--\n{_content}")

# cell 3
testcases = [
    ['1x1.pdf', [["0,0/1x1"]]],
    ['1x2.pdf', [["0,0/1x2", "0,1/1x2"]]],
    ['2x1.pdf', [["0,0/2x1"], ["1,0/2x1"]]],
    ['2x2.pdf', [["0,0/2x2", "0,1/2x2"], ["1,0/2x2", "1,1/2x2"]]]
]

from pathlib import Path

testdir = Path('./test')
testdir.mkdir(exist_ok=True)

for name, data in testcases:
    pdf_path = testdir / name
    
    print(name + ':')
    create_pdf(str(pdf_path), data)
    parse_with_pymupdf(pdf_path)
    parse_with_pymupdf4llm(pdf_path)
    print('-----------------------------')

Result

In PDFs with tables smaller than 2x2, the table content is being ignored.

1x1.pdf:
--pymupdf--
0,0/1x1

--pymupdf4llm 0.0.21--
-----


-----------------------------
1x2.pdf:
--pymupdf--
0,0/1x2
0,1/1x2

--pymupdf4llm 0.0.21--
-----


-----------------------------
2x1.pdf:
--pymupdf--
0,0/2x1
1,0/2x1

--pymupdf4llm 0.0.21--
-----


-----------------------------
2x2.pdf:
--pymupdf--
0,0/2x2
0,1/2x2
1,0/2x2
1,1/2x2

--pymupdf4llm 0.0.21--
|0,0/2x2|0,1/2x2|
|---|---|
|1,0/2x2|1,1/2x2|


-----


-----------------------------

Version

0.0.21

@JorjMcKie
Copy link
Collaborator

Please provide a reproducing file!

@c01o
Copy link
Author

c01o commented Apr 26, 2025

The script above has create_pdf() function which generates reproducing files. Is it not enough?

@JorjMcKie
Copy link
Collaborator

Ah, sorry, didn't understand that.

@JorjMcKie JorjMcKie added bug Something isn't working fix developed and removed waiting for information labels Apr 26, 2025
@JorjMcKie
Copy link
Collaborator

The problem here is insufficient logic to classify vector graphics as "insignificant" (= just appearance sugar).
I hope this fix will be more precisely cover this.

@JorjMcKie
Copy link
Collaborator

Fixed with v0.0.22.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix developed
Projects
None yet
Development

No branches or pull requests

2 participants