Non-standard annotations are not deleted with remove_annotations() #2438

whitesnakeftw · 2024-02-03T02:55:53Z

Explanation

In this PDF, PdfWriter.remove_annotations() doesn't succeed in removing the highlighting because apparently it is stored as image ('/Subtype': '/Image', '/Type': '/XObject').

PdfWriter.remove_images(to_delete=ImageType.XOBJECT_IMAGES) succeeds of course, but it also removes the actual images. What distinguishes the two is the /SMask attribute in the highlighting. Now, I can easily fix the problem by running a regex that removes everything that's between "obj" and "endobj" when /SMask is found and then repairing the resulting PDF:

33 0 obj 
<<
/Width 1800
/BitsPerComponent 8
/SMask 89 0 R
/Height 2542
/Subtype /Image
/Length 13726800
/Type /XObject
/ColorSpace /DeviceRGB
>>
stream
....
endstream 
endobj

But I can't find a way to get pypdf to remove just the objects that have /SMask. It would be nice if we could remove all objects that have a particular ImageAttributes. Or maybe make PdfWriter.remove_annotations(subtypes=None) also remove all objects that have /SMask (I have no idea if /SMask is also used for something else though).

Code Example

Possibly something like this:

from pypdf import PdfWriter

writer = PdfWriter()
writer.remove_images(to_delete=ImageAttributes.S_MASK)

The text was updated successfully, but these errors were encountered:

MartinThoma · 2024-02-03T07:27:00Z

Would #1831 solve your issue?

whitesnakeftw · 2024-02-03T13:48:07Z

Would #1831 solve your issue?

It doesn't seem to work. I used @MrTomRod 's _utils.py and _writer.py (also had to import logger_error to make it compatible with current pypdf) and ran:

writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.remove_annotations(subtypes=None)
writer.remove_annotations(annotation_filter_function="\SMask")

But the highlighting is still there. I'm guessing that's because annotation_filter_function is a further filter to subtypes, meaning we're only filtering inside \Annots, but in my document there are no \Annots at all.

Due to my little understanding of the code, I'm not sure if defining my own function, like @MrTomRod 's provided example, would make a difference:

def is_google_link(
    page: DictionaryObject,
    annotation: ArrayObject,
    obj: DictionaryObject
) -> bool:
    try:
        uri = obj['/A']['/URI']
        return uri.startswith('https://google.com/')
    except KeyError:
        return False

but I'm guessing probably not. I believe an option similar to annotation_filter_function should possibly be implemented inside remove_objects_from_page().

pubpub-zz · 2024-02-04T08:09:54Z

From what I've seen, this file does not contain annotation but just some drawings. There is no global solution I could imagine to remove them. The "/SMask" proposed does not seem like a good idea neither.
The only solution I would propose would be to loop through the resources and delete the images "FXX1".

whitesnakeftw · 2024-02-04T13:25:42Z

From what I've seen, this file does not contain annotation but just some drawings. There is no global solution I could imagine to remove them. The "/SMask" proposed does not seem like a good idea neither. The only solution I would propose would be to loop through the resources and delete the images "FXX1".

I think I'm able to loop through the resources, but how would a deletion command for that look like with the current code?

Wouldn't it be necessary to have an ImageType.FXX1 defined, and specific code to handle it inside clean_forms?

whitesnakeftw · 2024-02-11T14:20:39Z

I managed to delete the unwanted objects like this:

def remove_specific_xobjects(page, type):
    if '/XObject' in page['/Resources']:
        xobjects = page['/Resources']['/XObject']
        specific_xobjects = [key for key in xobjects.keys() if type in key]
        for key in specific_xobjects:
            del xobjects[key]  # Remove the identified XObjects

reader = PdfReader("input.pdf")
writer = PdfWriter()

# Iterate through each page of the PDF
for page_num in range(len(reader.pages)):
    page = reader.pages[page_num]  # Get current page
    remove_specific_xobjects(page, '/FXX1')  # Remove XObjects containing "/FXX1" from the page's resources
    writer.add_page(page)

# Write the modified PDF to a new file
writer.write('output.pdf')

but it's of course a rough way of doing it because it produces a damaged PDF that needs to be repaired (used Ghostscript to rebuild it). It would be nice if something like this could be implemented in the writer in a proper way.

pubpub-zz · 2024-02-12T18:24:41Z

I agree, my proposal was not good for the damages.
This is a new proposal:

from pypdf import PdfWriter
from PIL import Image

w = PdfWriter(clone_from="laradiceuncompressed.pdf")
for p in w.pages:
    for i in p.images:
        if "FXX" in i.name:
            i.replace(Image.new("RGBA",(1,1)))

w.write("output.pdf")

this one should be good

whitesnakeftw · 2024-02-12T22:50:44Z

@pubpub-zz Looks neat! Only problem I have with this is that when writing the output file compressing doesn't seem efficient. It reduces size to 1/4 of the starting size (193 MiB to 49 MiB), but that's still a lot compared to the original compressed file with the FXX1 images (1.17 MiB) or to what produces Ghostscript after my remove_specific_xobjects() function (364 KiB).

I also tried to add

p.compress_content_streams(level=9)

at the end of the outer for loop, but it didn't seem to make a difference.

I realize this might have nothing to do with the original issue, so feel free to close this. Thank you. :)

Edit: if I choose the original compressed PDF as input file, the output is just 683 KiB, so pypdf behaves correctly in that scenario. The original PDF was first uncompressed with pdftk, which generated the 193 MiB file, so of course it would be pdftk's job to recompress it again, not pypdf's. Sorry for the hassle.

stefan6419846 · 2024-02-20T19:37:09Z

@pubpub-zz Is there something we should document/implement here or do you consider this resolved?

pubpub-zz · 2024-02-21T19:59:19Z

the annotation is not an annotation actually. It is just some painting over the text. I don't think any documentation is required
I did not see the edit. We can close it 😀

whitesnakeftw assigned MartinThoma Feb 3, 2024

MartinThoma added the workflow-annotation Everything about annotating PDF files label Feb 3, 2024

stefan6419846 unassigned MartinThoma Feb 20, 2024

pubpub-zz closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-standard annotations are not deleted with remove_annotations() #2438

Non-standard annotations are not deleted with remove_annotations() #2438

whitesnakeftw commented Feb 3, 2024

MartinThoma commented Feb 3, 2024

whitesnakeftw commented Feb 3, 2024 •

edited

Loading

pubpub-zz commented Feb 4, 2024

whitesnakeftw commented Feb 4, 2024

whitesnakeftw commented Feb 11, 2024 •

edited

Loading

pubpub-zz commented Feb 12, 2024

whitesnakeftw commented Feb 12, 2024 •

edited

Loading

stefan6419846 commented Feb 20, 2024

pubpub-zz commented Feb 21, 2024

Non-standard annotations are not deleted with remove_annotations() #2438

Non-standard annotations are not deleted with remove_annotations() #2438

Comments

whitesnakeftw commented Feb 3, 2024

Explanation

Code Example

MartinThoma commented Feb 3, 2024

whitesnakeftw commented Feb 3, 2024 • edited Loading

pubpub-zz commented Feb 4, 2024

whitesnakeftw commented Feb 4, 2024

whitesnakeftw commented Feb 11, 2024 • edited Loading

pubpub-zz commented Feb 12, 2024

whitesnakeftw commented Feb 12, 2024 • edited Loading

stefan6419846 commented Feb 20, 2024

pubpub-zz commented Feb 21, 2024

whitesnakeftw commented Feb 3, 2024 •

edited

Loading

whitesnakeftw commented Feb 11, 2024 •

edited

Loading

whitesnakeftw commented Feb 12, 2024 •

edited

Loading