Skip to content

Internal links fail to work after page is merged in #3290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
larsga opened this issue May 20, 2025 · 9 comments
Open

Internal links fail to work after page is merged in #3290

larsga opened this issue May 20, 2025 · 9 comments
Labels
is-feature A feature request workflow-annotation Everything about annotating PDF files workflow-merge From a users perspective, merging is the affected feature/workflow

Comments

@larsga
Copy link
Contributor

larsga commented May 20, 2025

I'm appending one PDF to another, and internal links in the second PDF fail to work after the merge.

Environment

$ python -m platform
macOS-12.7.1-x86_64-i386-64bit-Mach-O

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.5.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=11.2.1

Code + PDF

If I run this code I can reproduce the output. The two input PDFs are attached below.

import pypdf

input_file = 'input.pdf'
merge_file = 'tst.pdf'
output_file = 'output.pdf'

src_reader = pypdf.PdfReader(input_file)
target = pypdf.PdfWriter(clone_from = input_file)

merge_reader = pypdf.PdfReader(merge_file)

for page in merge_reader.pages:
    current_page = page.create_blank_page(
        target,
        width = page.mediabox.width,
        height = page.mediabox.height
    )
    current_page.merge_page(page)
    target.add_page(current_page)

target.write(output_file)

input.pdf
tst.pdf

Further thoughts

If I skip the create_blank_page and merge_page and just instead do target.add_page(page) I get the same problem.

To debug the issue I wrote a little link dumper and here's what it produces on the output:

--- page 1 : IndirectObject(3, 0, 4552833552)
--- page 2 : IndirectObject(18, 0, 4552833552)
--- page 3 : IndirectObject(20, 0, 4552833552)
/A __Realta_FO_d2428e34 UNFOUND!!
/A __Realta_FO_d2428e34 UNFOUND!!
/A __Realta_FO_d2428e41 UNFOUND!!
/A __Realta_FO_d2428e41 UNFOUND!!
/A __Realta_FO_d2428e52 UNFOUND!!
/A __Realta_FO_d2428e52 UNFOUND!!
--- page 4 : IndirectObject(35, 0, 4552833552)
--- page 5 : IndirectObject(37, 0, 4552833552)
--- page 6 : IndirectObject(199, 0, 4552833552)
/Dest: [IndirectObject(214, 0, 4552833552), '/XYZ', 56.69292, 786.1971, 0], page: UNFOUND!!
--- page 7 : IndirectObject(220, 0, 4552833552)
--- page 8 : IndirectObject(221, 0, 4552833552)

The page we care about is page 6. As you can see, the /Dest of the link is a page object 214, but there is no such page object.

I've dug into the code and merge_page as far as I can see just copies all annotations from the source, which means that the link /Dest objects will be the same as they were before, but the new PDF will presumably consist of all new pages? So I don't understand how this was intended to work. If I can get some guidance on what the thinking is I can try to work on this myself.

@stefan6419846
Copy link
Collaborator

Thanks for the report.

So I don't understand how this was intended to work.

I can just guess, but it seems like nobody stumbled upon this and cared about this for now. The goal should probably be to merge these properly as well as far as possible.

If I can get some guidance on what the thinking is I can try to work on this myself.

I appreciate any PRs which improve the current situation.

@stefan6419846 stefan6419846 added is-feature A feature request workflow-merge From a users perspective, merging is the affected feature/workflow workflow-annotation Everything about annotating PDF files labels May 20, 2025
@larsga
Copy link
Contributor Author

larsga commented May 21, 2025

I am fixing this in our software, outside of PyPDF now first, but I will try to make a PyPDF PR afterwards.

Here is my current understanding, which will guide the implementation of the PR. Feedback welcome.

PageObject.merge: This operation just copies the links without making any changes to them.
PdfWriter.add_page: The added page is cloned, causing objects (links and pages) to change identity.

I tried reading the source code of PdfWriter.add_page and as far as I can tell nothing is done about links there. Instead, it seems the page structures are just cloned and added to the new PDF without any restructuring. It's weird, because sometimes old links keep working, and I can't tell why, so I'm worried I've overlooked something. Tips welcome.

The full complexity here is substantial, because both of the operations above are relevant, so the full implementation needs to cover a number of different cases and it needs to collect information first, then act on it later. Let's picture case, just to get a sense of what's going on.

We are writing to a new PDF, called T, pages T1, T2, ... We are merging PDFs A and B into T, and new pages are created by:

  • create new empty page X1,
  • merge into it page A1,
  • merge into it page B1,
  • then add it to T as T1 (the page gets cloned, so X1 becomes T1).

Now, let's say that Ax has a link to Ay, which is further out in A. There are three relevant operations for us:

  • Ax is merged into Xx. The link refers to Ay, which at this point has not been added to the document (and may in theory never be added), so there is no way to know where the link should point in the new document.
  • Xx is added to T as Tx. Ay still hasn't been added, still no way to resolve the link.
  • T is written to file. At this point, we can tell whether Ay was ever added, and fix the links before saving.

If we are going to implement this we will need to:

  • In each PageObject, track the identity of the pages that were merged into this page. (This is how we know that Ty had Ay merged into it, so we can tell that the link refers there. If Ay gets merged multiple times, the link will refer to one of the copies. If Ay never gets merged, the link will be broken, as it must be.)
  • In PdfWriter.add_page record the links that will need to be resolved at saving time.
  • In PdfWriter.write walk through the links that need to be resolved, and use the tracked page identities to work out if the page the link refers to has been added to the document. If it has, make the link point to the new page. (For direct references we simply patch the ArrayObject that holds the reference as the first element (before /XYZ), while for named references we call PdfWriter.add_named_destination to make the name point to the desired page.)

I think this will work and will cover all the corner cases, although it will probably be relatively complex. Again, feedback on this plan very much welcome, so I don't spend time implementing something that will not work, or will be rejected.

@stefan6419846
Copy link
Collaborator

I have to admit that I do not have a complete overview of all the code. It mostly is learning by doing/fixing for me as well. As long as a change improves the current situation and ideally is not too complex for review, chances are high that it gets merged. This being said, we might want to tackle this in smaller parts where possible/feasible.

It's weird, because sometimes old links keep working, and I can't tell why, so I'm worried I've overlooked something. Tips welcome.

As most of the time, without actually looking at the specific details of the files this is just guessing. Maybe it is just a coincidence that it keeps working out of the box due to how the object numbers are defined in the different documents.

@larsga
Copy link
Contributor Author

larsga commented May 21, 2025

Thanks. I will do what I can to keep complexity down. I'm not sure it's possible to break this up much, but we could drop tracking of page identities from the first iteration. That version will not work for us, but it may be a good idea, anyway, to make overall review faster.

@j-t-1
Copy link
Contributor

j-t-1 commented May 21, 2025

pypdf/pypdf/_page.py

Lines 1151 to 1155 in 15c42ff

for page in (self, page2):
if PG.ANNOTS in page:
annots = page[PG.ANNOTS]
if isinstance(annots, ArrayObject):
new_annots.extend(annots)

Do the annotations of page2 need to be changed? We need the destination after the merge not before?

@larsga
Copy link
Contributor Author

larsga commented May 21, 2025

@j-t-1 I'm not sure if there is any point in changing the links there. All that's happened is that one page has been merged into another. We don't yet know what PDF this will be written to in the end. And where the links should go depends on what writer the page is eventually added to and finally written out.

So my proposal (described in the long comment) is to do nothing at this stage except track necessary info. Then we resolve when the PDF is written.

I have a draft PR of that, but getting side-tracked by other bugs, so I haven't been able to finish it yet.

@j-t-1
Copy link
Contributor

j-t-1 commented May 21, 2025

@larsga I reread what you wrote.

PageObject.merge: This operation just copies the links without making any changes to them.

So I think my comment was just saying something you had already said! Thanks for highlighting this.

So my proposal (described in the long comment) is to do nothing at this stage except track necessary info. Then we resolve when the PDF is written.

Definitely seems way forward.

_merge_page should have a comment that the new annotation added to the page may not work. I will read the specification as unsure if this is an issue for Link annotations only.

@pubpub-zz
Copy link
Collaborator

after a quick look:

  • in your example you are just trying to merge pages from the same document : I'd rather use the merge() / append() to add all the pages at once and where the links will not be broken
  • the link is broken because the pointed page can not be found;
  • when you use the merge_page, the pointed page is completely rewritten and there is no way to identify that the created page is the good one : I see no option to have this fixed : you do not know what has been don in the new page.
  • When you use the add_page, instead of using the standard cloning process that should prevent duplication the existing entry is deleted:

    pypdf/pypdf/_writer.py

    Lines 475 to 482 in 15c42ff

    # Acrobat does not accept two indirect references pointing on the same
    # page; therefore in order to add multiple copies of the same
    # page, we need to create a new dictionary for the page, however the
    # objects below (including content) are not duplicated:
    try: # delete an already existing page
    del self._id_translated[id(page_org.indirect_reference.pdf)][ # type: ignore
    page_org.indirect_reference.idnum # type: ignore
    ]

this deletion places the pointed page/object in the "garbage collector" this is why the link is broken.

As said in the comment in the code we have to be carefull. If we want to keep this approach - not my recommendation - I would more likely add an argument "ignore_duplicates" with a default value to false to keep the existing behavior when calling "add_page":
in your original example, if I insert the pointed page many times, where should I jump to ?

@larsga
Copy link
Contributor Author

larsga commented May 22, 2025

in your example you are just trying to merge pages from the same document : I'd rather use the merge() / append() to add all the pages at once and where the links will not be broken

The example does not reflect what the code actually does. The real workflow is:

  • create empty page,
  • merge page from source PDF 1 into it,
  • merge page from source PDF 2 into it,
  • then add to writer.

I'd love to skip the first step, but if I do PyPDF produces broken output. Of course, I'd love to figure out why, and report the bug, but we just recently adopted PyPDF in production, and now I spend all my time chasing different PyPDF bugs and fixing them or coming up with workarounds. So what gets reported to the project is just a subset of what I'm dealing with. I hope in time to be able to report everything and smooth out all the wrinkles, but for now I'm just sprinting all the time.

the link is broken because the pointed page can not be found

Sure, but the page pointed to gets added later, so it's perfectly possible to get this to work. PDFTools, a commercial tool, does it.

when you use the merge_page, the pointed page is completely rewritten and there is no way to identify that the created page is the good one : I see no option to have this fixed : you do not know what has been don in the new page

I have fixed it already, but haven't had time to finish the PR yet. I will submit as soon as I have time. See the long comment for an explanation of how I fix it.

When you use the add_page, instead of using the standard cloning process that should prevent duplication

I don't understand what you mean. I use add_page, yes. What are you suggesting I do instead?

this deletion places the pointed page/object in the "garbage collector" this is why the link is broken.

I don't think this is correct. The link refers to the indirect object of a page in a different PDF, so the link has to be rewritten to point to the new page in this PDF in order for the link to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request workflow-annotation Everything about annotating PDF files workflow-merge From a users perspective, merging is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants