-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Outputting an annotation and the entire sentence where the annotation is located #87
Comments
Capturing text before/after an annotation is implemented in the code as "context", but is currently used only for strikeout annotations. My expectation was that anyone adding a comment on a specific sentence would use highlight annotations, where the highlight covers the text you want to include with the annotation. If you did want to include context with other annotation types, you could probably modify I'm not sure I'd accept such a change in this repo though. It sounds pretty hard to manage -- in particular identifying sentence boundaries reliably is likely to be problematic, so this could easily produce undesired output, and if you want sentences the next user will want paragraphs, etc. I think perhaps you should be willing to do a bit more work when annotating the document in the first place :) |
Thanks for your quick reply. I can understand what you mean and could implement the code as you said. As a Ph.D. candidate, my main task involves reading and annotating literature. Your tool has been helpful in exporting my annotations in a specific format, which has significantly aided me in my work. However, there is another scenario in my annotations that marks well-used words and phrases, and I hope to be able to export these annotations along with their context (i.e., the whole sentence). This would help me better comprehend the meaning and usage of the phrase when I review my notes. I also have checked the from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text()) How can I use the |
That's basically the problem at the core of pdfannots :) Most page elements have x/y coordinates, and each annotation consists of one or more bounding boxes, so the problem mostly boils down to processing the text elements and then checking for intersections between them and the annotation boxes. However, you can't just use LTTextContainers for this as those are too large (e.g. entire boxes or lines), rather you have to look at the characters inside them. The logic for this is in |
I will have a try. thank you for your kindly guide |
I have done the primary implementation. here is my repo. |
Thank you so much for developing this module, it's fantastic. Is it possible to implement the function of simultaneously outputting an annotation and the entire sentence where the annotation is located?
If possible, please guide me on the general principle, Thanks
The text was updated successfully, but these errors were encountered: