Skip to content

Null encoding causes parse failure. #3295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nihohit opened this issue May 22, 2025 · 0 comments
Open

Null encoding causes parse failure. #3295

nihohit opened this issue May 22, 2025 · 0 comments
Labels
needs-pdf The issue needs a PDF file to show the problem workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@nihohit
Copy link

nihohit commented May 22, 2025

Environment

$ python -m platform
macOS-15.3.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.5.0, crypt_provider=('cryptography', '44.0.0'), PIL=none

Also recreated on Ubunto 22.04 & Jupyter notebook.

Code + PDF

pdf = PdfReader(bytes_stream)
pages = pdf.get_num_pages()

pages_per_worker = math.ceil(pages / workers)
start_page = pages_per_worker * worker
end_page = min(pages_per_worker * (worker + 1), pages)

return [
    pdf.get_page(i)
    .extract_text(extraction_mode="plain")
    .encode(encoding="utf-8", errors="replace")
    .decode("utf-8")
    for i in range(start_page, end_page)
]

Sorry, the PDF contains proprietary information and can't be shared. The file was analyzed using pdf-online.com:

Compliance | pdfa-3u
-- | --
Result | Document validated successfully.
Details | Validating file "problematic.pdf" for conformance level pdfa-3uThe document does conform to the PDF/A-3u standard.Done.

Traceback

  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2378, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_page.py", line 1859, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 34, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 57, in build_char_map_from_dict
    encoding, map_dict = get_encoding(ft)
                         ^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 129, in get_encoding
    encoding = _parse_encoding(ft)
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 183, in _parse_encoding
    if "/Differences" in enc:
       ^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NullObject' is not iterable
@stefan6419846 stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow needs-pdf The issue needs a PDF file to show the problem labels May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-pdf The issue needs a PDF file to show the problem workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants