Null encoding causes parse failure. #3295

nihohit · 2025-05-22T13:32:16Z

Environment

$ python -m platform
macOS-15.3.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.5.0, crypt_provider=('cryptography', '44.0.0'), PIL=none

Also recreated on Ubunto 22.04 & Jupyter notebook.

Code + PDF

pdf = PdfReader(bytes_stream)
pages = pdf.get_num_pages()

pages_per_worker = math.ceil(pages / workers)
start_page = pages_per_worker * worker
end_page = min(pages_per_worker * (worker + 1), pages)

return [
    pdf.get_page(i)
    .extract_text(extraction_mode="plain")
    .encode(encoding="utf-8", errors="replace")
    .decode("utf-8")
    for i in range(start_page, end_page)
]

Sorry, the PDF contains proprietary information and can't be shared. The file was analyzed using pdf-online.com:

Compliance | pdfa-3u
-- | --
Result | Document validated successfully.
Details | Validating file "problematic.pdf" for conformance level pdfa-3uThe document does conform to the PDF/A-3u standard.Done.

Traceback

  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2378, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_page.py", line 1859, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 34, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 57, in build_char_map_from_dict
    encoding, map_dict = get_encoding(ft)
                         ^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 129, in get_encoding
    encoding = _parse_encoding(ft)
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/shachar/venv/lib/python3.12/site-packages/pypdf/_cmap.py", line 183, in _parse_encoding
    if "/Differences" in enc:
       ^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NullObject' is not iterable

The text was updated successfully, but these errors were encountered:

stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow needs-pdf The issue needs a PDF file to show the problem labels May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Null encoding causes parse failure. #3295

Null encoding causes parse failure. #3295

nihohit commented May 22, 2025

Null encoding causes parse failure. #3295

Null encoding causes parse failure. #3295

Comments

nihohit commented May 22, 2025

Environment

Code + PDF

Traceback