Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible remove_annots_from_page function #1829

Open
MrTomRod opened this issue May 3, 2023 · 6 comments
Open

More flexible remove_annots_from_page function #1829

MrTomRod opened this issue May 3, 2023 · 6 comments
Labels
is-feature A feature request workflow-annotation Everything about annotating PDF files

Comments

@MrTomRod
Copy link
Contributor

MrTomRod commented May 3, 2023

I would like to dynamically remove certain annotations from a page but not others. I solved it like this:

from pypdf import PageObject, PdfWriter, PdfReader
from pypdf.constants import PageAttributes as PG
from pypdf.generic import NullObject, IndirectObject, ArrayObject, DictionaryObject
from typing import cast, Union, Optional, Callable


class MyPdfWriter(PdfWriter):
    """
    Remove annotations by custom delete_decide_function.

    Args:
        delete_decide_function: Function that takes two arguments, 
        ArrayObject and DictionaryObject, and decides whether to remove
        them from the page. For example:
            def is_google_link(an: ArrayObject, obj: DictionaryObject) -> bool:
                try:
                    uri = obj['/A']['/URI']
                    return uri.startswith('https://google.com/')
                except KeyError:
                    return False
    """
    def remove_annots_from_page(
            self,
            page: Union[IndirectObject, PageObject, DictionaryObject],
            delete_decide_function: Optional[Callable] = None
    ) -> None:
        # based on https://github.com/py-pdf/pypdf/blob/3de03b75bc6c63e97dc682428eac8e4e8aa9276c/pypdf/_writer.py#L1922
        page = cast(DictionaryObject, page.get_object())
        if PG.ANNOTS in page:
            i = 0
            while i < len(cast(ArrayObject, page[PG.ANNOTS])):
                an = cast(ArrayObject, page[PG.ANNOTS])[i]
                obj = cast(DictionaryObject, an.get_object())
                if delete_decide_function is None or delete_decide_function(an, obj):
                    if isinstance(an, IndirectObject):
                        self._objects[an.idnum - 1] = NullObject()  # to reduce PDF size
                    del page[PG.ANNOTS][i]  # type:ignore
                else:
                    i += 1


def is_sciwheel(an: ArrayObject, obj: DictionaryObject) -> bool:
    try:
        uri = obj['/A']['/URI']
        return uri.startswith('https://sciwheel.com/')
    except KeyError:
        return False


def remove_pdf_links(in_pdf, out_pdf):
    pdf = MyPdfWriter(clone_from=in_pdf)

    for page in pdf.pages:
        # print first line of page
        print(page.extract_text().split('\n')[0])

        # remove sciwheel.com hyperlinks from page
        # new_pdf._remove_annots_from_page(page, subtypes=("/Link",) )
        pdf.remove_annots_from_page(page, is_sciwheel)

    pdf.write(out_pdf)

I thought my remove_annots_from_page function is superior to the existing _remove_annots_from_page, so I thought I'd share it.

@pubpub-zz
Copy link
Collaborator

@MrTomRod
I would recommend you first to open directly the pdf into a PdfWriter objected using clone_from parameter
Once loaded in, you will be able to remove the annotations you want. you should have a look at https://pypdf.readthedocs.io/en/stable/_modules/pypdf/_writer.html#PdfWriter.remove_links for inspiration

@MrTomRod
Copy link
Contributor Author

MrTomRod commented May 3, 2023

I don't think the existing API enables me to do what I want, i.e., to remove only certain hyperlinks, namely those that start with https://sciwheel.com/.

Thanks for the clone_from hint, it's much cleaner now. I adapted the code above.

@pubpub-zz
Copy link
Collaborator

I agree that the existing functions may not be adequate for you but you should copy and then adjust _remove_annots_from_page() as you wish.

@MrTomRod
Copy link
Contributor Author

MrTomRod commented May 3, 2023

Yes, I already managed to do what I wanted. The point is that my solution is more flexible and would imo improve your library.

Feel free to close this issue if you don't think it's a good suggestion.

@pubpub-zz
Copy link
Collaborator

If you think you can propose a solution that will improve pypdf feel free to propose a PR

MrTomRod added a commit to MrTomRod/pypdf that referenced this issue May 4, 2023
@MartinThoma MartinThoma added the is-feature A feature request label Jun 11, 2023
@stefan6419846 stefan6419846 added the workflow-annotation Everything about annotating PDF files label Feb 20, 2024
@stefan6419846
Copy link
Collaborator

@MrTomRod Are you still interested in enhancing pypdf this way? Then feel free to open a corresponding PR. Otherwise, we will probably close this issue in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request workflow-annotation Everything about annotating PDF files
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants