-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-standard annotations are not deleted with remove_annotations() #2438
Comments
Would #1831 solve your issue? |
It doesn't seem to work. I used @MrTomRod 's _utils.py and _writer.py (also had to import writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.remove_annotations(subtypes=None)
writer.remove_annotations(annotation_filter_function="\SMask") But the highlighting is still there. I'm guessing that's because Due to my little understanding of the code, I'm not sure if defining my own function, like @MrTomRod 's provided example, would make a difference: def is_google_link(
page: DictionaryObject,
annotation: ArrayObject,
obj: DictionaryObject
) -> bool:
try:
uri = obj['/A']['/URI']
return uri.startswith('https://google.com/')
except KeyError:
return False but I'm guessing probably not. I believe an option similar to |
From what I've seen, this file does not contain annotation but just some drawings. There is no global solution I could imagine to remove them. The "/SMask" proposed does not seem like a good idea neither. |
I think I'm able to loop through the resources, but how would a deletion command for that look like with the current code? Wouldn't it be necessary to have an |
I managed to delete the unwanted objects like this: def remove_specific_xobjects(page, type):
if '/XObject' in page['/Resources']:
xobjects = page['/Resources']['/XObject']
specific_xobjects = [key for key in xobjects.keys() if type in key]
for key in specific_xobjects:
del xobjects[key] # Remove the identified XObjects
reader = PdfReader("input.pdf")
writer = PdfWriter()
# Iterate through each page of the PDF
for page_num in range(len(reader.pages)):
page = reader.pages[page_num] # Get current page
remove_specific_xobjects(page, '/FXX1') # Remove XObjects containing "/FXX1" from the page's resources
writer.add_page(page)
# Write the modified PDF to a new file
writer.write('output.pdf') but it's of course a rough way of doing it because it produces a damaged PDF that needs to be repaired (used Ghostscript to rebuild it). It would be nice if something like this could be implemented in the writer in a proper way. |
I agree, my proposal was not good for the damages.
this one should be good |
@pubpub-zz Looks neat! Only problem I have with this is that when writing the output file compressing doesn't seem efficient. It reduces size to 1/4 of the starting size (193 MiB to 49 MiB), but that's still a lot compared to the original compressed file with the FXX1 images (1.17 MiB) or to what produces Ghostscript after my I also tried to add p.compress_content_streams(level=9) at the end of the outer for loop, but it didn't seem to make a difference. I realize this might have nothing to do with the original issue, so feel free to close this. Thank you. :) Edit: if I choose the original compressed PDF as input file, the output is just 683 KiB, so pypdf behaves correctly in that scenario. The original PDF was first uncompressed with pdftk, which generated the 193 MiB file, so of course it would be pdftk's job to recompress it again, not pypdf's. Sorry for the hassle. |
@pubpub-zz Is there something we should document/implement here or do you consider this resolved? |
the annotation is not an annotation actually. It is just some painting over the text. I don't think any documentation is required |
Explanation
In this PDF,
PdfWriter.remove_annotations()
doesn't succeed in removing the highlighting because apparently it is stored as image ('/Subtype': '/Image', '/Type': '/XObject'
).PdfWriter.remove_images(to_delete=ImageType.XOBJECT_IMAGES)
succeeds of course, but it also removes the actual images. What distinguishes the two is the/SMask
attribute in the highlighting. Now, I can easily fix the problem by running a regex that removes everything that's between "obj" and "endobj" when/SMask
is found and then repairing the resulting PDF:But I can't find a way to get pypdf to remove just the objects that have
/SMask
. It would be nice if we could remove all objects that have a particularImageAttributes
. Or maybe makePdfWriter.remove_annotations(subtypes=None)
also remove all objects that have/SMask
(I have no idea if/SMask
is also used for something else though).Code Example
Possibly something like this:
The text was updated successfully, but these errors were encountered: