-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Internal links fail to work after page is merged in #3290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report.
I can just guess, but it seems like nobody stumbled upon this and cared about this for now. The goal should probably be to merge these properly as well as far as possible.
I appreciate any PRs which improve the current situation. |
I am fixing this in our software, outside of PyPDF now first, but I will try to make a PyPDF PR afterwards. Here is my current understanding, which will guide the implementation of the PR. Feedback welcome.
I tried reading the source code of The full complexity here is substantial, because both of the operations above are relevant, so the full implementation needs to cover a number of different cases and it needs to collect information first, then act on it later. Let's picture case, just to get a sense of what's going on. We are writing to a new PDF, called T, pages T1, T2, ... We are merging PDFs A and B into T, and new pages are created by:
Now, let's say that Ax has a link to Ay, which is further out in A. There are three relevant operations for us:
If we are going to implement this we will need to:
I think this will work and will cover all the corner cases, although it will probably be relatively complex. Again, feedback on this plan very much welcome, so I don't spend time implementing something that will not work, or will be rejected. |
I have to admit that I do not have a complete overview of all the code. It mostly is learning by doing/fixing for me as well. As long as a change improves the current situation and ideally is not too complex for review, chances are high that it gets merged. This being said, we might want to tackle this in smaller parts where possible/feasible.
As most of the time, without actually looking at the specific details of the files this is just guessing. Maybe it is just a coincidence that it keeps working out of the box due to how the object numbers are defined in the different documents. |
Thanks. I will do what I can to keep complexity down. I'm not sure it's possible to break this up much, but we could drop tracking of page identities from the first iteration. That version will not work for us, but it may be a good idea, anyway, to make overall review faster. |
Lines 1151 to 1155 in 15c42ff
Do the annotations of |
@j-t-1 I'm not sure if there is any point in changing the links there. All that's happened is that one page has been merged into another. We don't yet know what PDF this will be written to in the end. And where the links should go depends on what writer the page is eventually added to and finally written out. So my proposal (described in the long comment) is to do nothing at this stage except track necessary info. Then we resolve when the PDF is written. I have a draft PR of that, but getting side-tracked by other bugs, so I haven't been able to finish it yet. |
@larsga I reread what you wrote.
So I think my comment was just saying something you had already said! Thanks for highlighting this.
Definitely seems way forward.
|
after a quick look:
this deletion places the pointed page/object in the "garbage collector" this is why the link is broken. As said in the comment in the code we have to be carefull. If we want to keep this approach - not my recommendation - I would more likely add an argument "ignore_duplicates" with a default value to false to keep the existing behavior when calling "add_page": |
The example does not reflect what the code actually does. The real workflow is:
I'd love to skip the first step, but if I do PyPDF produces broken output. Of course, I'd love to figure out why, and report the bug, but we just recently adopted PyPDF in production, and now I spend all my time chasing different PyPDF bugs and fixing them or coming up with workarounds. So what gets reported to the project is just a subset of what I'm dealing with. I hope in time to be able to report everything and smooth out all the wrinkles, but for now I'm just sprinting all the time.
Sure, but the page pointed to gets added later, so it's perfectly possible to get this to work. PDFTools, a commercial tool, does it.
I have fixed it already, but haven't had time to finish the PR yet. I will submit as soon as I have time. See the long comment for an explanation of how I fix it.
I don't understand what you mean. I use
I don't think this is correct. The link refers to the indirect object of a page in a different PDF, so the link has to be rewritten to point to the new page in this PDF in order for the link to work. |
I'm appending one PDF to another, and internal links in the second PDF fail to work after the merge.
Environment
Code + PDF
If I run this code I can reproduce the output. The two input PDFs are attached below.
input.pdf
tst.pdf
Further thoughts
If I skip the
create_blank_page
andmerge_page
and just instead dotarget.add_page(page)
I get the same problem.To debug the issue I wrote a little link dumper and here's what it produces on the output:
The page we care about is page 6. As you can see, the
/Dest
of the link is a page object 214, but there is no such page object.I've dug into the code and
merge_page
as far as I can see just copies all annotations from the source, which means that the link/Dest
objects will be the same as they were before, but the new PDF will presumably consist of all new pages? So I don't understand how this was intended to work. If I can get some guidance on what the thinking is I can try to work on this myself.The text was updated successfully, but these errors were encountered: