Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-standard annotations are not deleted with remove_annotations() #2438

Closed
whitesnakeftw opened this issue Feb 3, 2024 · 9 comments
Closed
Labels
workflow-annotation Everything about annotating PDF files

Comments

@whitesnakeftw
Copy link

Explanation

In this PDF, PdfWriter.remove_annotations() doesn't succeed in removing the highlighting because apparently it is stored as image ('/Subtype': '/Image', '/Type': '/XObject').

PdfWriter.remove_images(to_delete=ImageType.XOBJECT_IMAGES) succeeds of course, but it also removes the actual images. What distinguishes the two is the /SMask attribute in the highlighting. Now, I can easily fix the problem by running a regex that removes everything that's between "obj" and "endobj" when /SMask is found and then repairing the resulting PDF:

33 0 obj 
<<
/Width 1800
/BitsPerComponent 8
/SMask 89 0 R
/Height 2542
/Subtype /Image
/Length 13726800
/Type /XObject
/ColorSpace /DeviceRGB
>>
stream
....
endstream 
endobj

But I can't find a way to get pypdf to remove just the objects that have /SMask. It would be nice if we could remove all objects that have a particular ImageAttributes. Or maybe make PdfWriter.remove_annotations(subtypes=None) also remove all objects that have /SMask (I have no idea if /SMask is also used for something else though).

Code Example

Possibly something like this:

from pypdf import PdfWriter

writer = PdfWriter()
writer.remove_images(to_delete=ImageAttributes.S_MASK)
@MartinThoma
Copy link
Member

Would #1831 solve your issue?

@MartinThoma MartinThoma added the workflow-annotation Everything about annotating PDF files label Feb 3, 2024
@whitesnakeftw
Copy link
Author

whitesnakeftw commented Feb 3, 2024

Would #1831 solve your issue?

It doesn't seem to work. I used @MrTomRod 's _utils.py and _writer.py (also had to import logger_error to make it compatible with current pypdf) and ran:

writer = PdfWriter()
writer.clone_document_from_reader(reader)
writer.remove_annotations(subtypes=None)
writer.remove_annotations(annotation_filter_function="\SMask")

But the highlighting is still there. I'm guessing that's because annotation_filter_function is a further filter to subtypes, meaning we're only filtering inside \Annots, but in my document there are no \Annots at all.

Due to my little understanding of the code, I'm not sure if defining my own function, like @MrTomRod 's provided example, would make a difference:

def is_google_link(
    page: DictionaryObject,
    annotation: ArrayObject,
    obj: DictionaryObject
) -> bool:
    try:
        uri = obj['/A']['/URI']
        return uri.startswith('https://google.com/')
    except KeyError:
        return False

but I'm guessing probably not. I believe an option similar to annotation_filter_function should possibly be implemented inside remove_objects_from_page().

@pubpub-zz
Copy link
Collaborator

From what I've seen, this file does not contain annotation but just some drawings. There is no global solution I could imagine to remove them. The "/SMask" proposed does not seem like a good idea neither.
The only solution I would propose would be to loop through the resources and delete the images "FXX1".

@whitesnakeftw
Copy link
Author

From what I've seen, this file does not contain annotation but just some drawings. There is no global solution I could imagine to remove them. The "/SMask" proposed does not seem like a good idea neither. The only solution I would propose would be to loop through the resources and delete the images "FXX1".

I think I'm able to loop through the resources, but how would a deletion command for that look like with the current code?

Wouldn't it be necessary to have an ImageType.FXX1 defined, and specific code to handle it inside clean_forms?

@whitesnakeftw
Copy link
Author

whitesnakeftw commented Feb 11, 2024

I managed to delete the unwanted objects like this:

def remove_specific_xobjects(page, type):
    if '/XObject' in page['/Resources']:
        xobjects = page['/Resources']['/XObject']
        specific_xobjects = [key for key in xobjects.keys() if type in key]
        for key in specific_xobjects:
            del xobjects[key]  # Remove the identified XObjects

reader = PdfReader("input.pdf")
writer = PdfWriter()

# Iterate through each page of the PDF
for page_num in range(len(reader.pages)):
    page = reader.pages[page_num]  # Get current page
    remove_specific_xobjects(page, '/FXX1')  # Remove XObjects containing "/FXX1" from the page's resources
    writer.add_page(page)

# Write the modified PDF to a new file
writer.write('output.pdf')

but it's of course a rough way of doing it because it produces a damaged PDF that needs to be repaired (used Ghostscript to rebuild it). It would be nice if something like this could be implemented in the writer in a proper way.

@pubpub-zz
Copy link
Collaborator

I agree, my proposal was not good for the damages.
This is a new proposal:

from pypdf import PdfWriter
from PIL import Image

w = PdfWriter(clone_from="laradiceuncompressed.pdf")
for p in w.pages:
    for i in p.images:
        if "FXX" in i.name:
            i.replace(Image.new("RGBA",(1,1)))

w.write("output.pdf")

this one should be good

@whitesnakeftw
Copy link
Author

whitesnakeftw commented Feb 12, 2024

@pubpub-zz Looks neat! Only problem I have with this is that when writing the output file compressing doesn't seem efficient. It reduces size to 1/4 of the starting size (193 MiB to 49 MiB), but that's still a lot compared to the original compressed file with the FXX1 images (1.17 MiB) or to what produces Ghostscript after my remove_specific_xobjects() function (364 KiB).

I also tried to add

p.compress_content_streams(level=9)

at the end of the outer for loop, but it didn't seem to make a difference.

I realize this might have nothing to do with the original issue, so feel free to close this. Thank you. :)

Edit: if I choose the original compressed PDF as input file, the output is just 683 KiB, so pypdf behaves correctly in that scenario. The original PDF was first uncompressed with pdftk, which generated the 193 MiB file, so of course it would be pdftk's job to recompress it again, not pypdf's. Sorry for the hassle.

@stefan6419846
Copy link
Collaborator

@pubpub-zz Is there something we should document/implement here or do you consider this resolved?

@pubpub-zz
Copy link
Collaborator

the annotation is not an annotation actually. It is just some painting over the text. I don't think any documentation is required
I did not see the edit. We can close it 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-annotation Everything about annotating PDF files
Projects
None yet
Development

No branches or pull requests

4 participants