ENH: Make remove_annots_from_page more flexible #1831

MrTomRod · 2023-05-04T10:27:21Z

Changes:

_remove_annots_from_page -> remove_annots_from_page
Replaced the subtypes: Optional[Iterable[str]] parameter by delete_decide_function: Optional[Callable] = None,

Closes #1829

py-pdf#1829) - add pytest-socket to pyproject.toml

codecov · 2023-05-04T11:33:28Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (ec85a27) 94.54% compared to head (6dc1c5b) 94.51%.
Report is 1 commits behind head on main.

Files	Patch %	Lines
pypdf/_writer.py	80.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1831      +/-   ##
==========================================
- Coverage   94.54%   94.51%   -0.04%     
==========================================
  Files          43       43              
  Lines        7549     7561      +12     
  Branches     1490     1497       +7     
==========================================
+ Hits         7137     7146       +9     
- Misses        253      255       +2     
- Partials      159      160       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MartinThoma · 2023-06-11T09:01:05Z

Thank you for the contribution <3

Overall, I tend to add the improvement to pypdf, but I think a few decisions need to be discussed and potentially refined:

Method name: I'd rather not add a new method PdfWriter.remove_annots_from_page as we already have PdfWriter.remove_annotations. The parameter delete_decide_function: Optional[Callable] = None, could be added there.
Callback parameters: We could give the page / page_index to the delete_decide_function callback. This would remove the need for remove_annots_from_page, wouldn't it?
Documentation: The documentation regarding the delete_decide_function could be better. What is the first parameter?
Naming: delete_decide_function doesn't sound good. Maybe annotation_filter_function? In Django, what you filter for remains. Meaning a return value of True would mean that it's not deleted. Other ideas: should_remove, should_delete, should_remain, should_keep,

- rename delete_decide_function to annotation_filter_function - add page to callback for annotation_filter_function

MrTomRod · 2023-06-17T15:07:30Z

I'm glad you like it!

1 Method name

You're right, it's confusing. I changed remove_annots_from_page back to _remove_annots_from_page.

2 Callback parameters

I gave the page as callback.

I didn't remove _remove_annots_from_page for two reasons:

Removing annotations is done by processing each page separately using a loop. It makes sense to have a separate function to reduce nesting.
_remove_annots_from_page is heavily used in remove_objects_from_page. If we used remove_annotations there, this function would always loop over all pages unnecessarily.

3 Documentation

Is it better now?

4 Naming

I also didn't like my names. I settled on annotation_filter_function.

MartinThoma · 2023-08-13T07:12:01Z

@pubpub-zz What is your opinion about this PR?

tests/test_writer.py

pypdf/_utils.py

tests/test_writer.py

MartinThoma · 2023-08-20T19:17:27Z

pypdf/_writer.py

+        subtypes: Optional[
+            Union[AnnotationSubtype, Iterable[AnnotationSubtype]]
+        ] = None,
+        annotation_filter_function: Optional[Callable] = None,


Suggested change

annotation_filter_function: Optional[Callable] = None,

annotation_filter_function: Optional[Callable[[DictionaryObject, ArrayObject, DictionaryObject], bool]] = None,

see below bout the 2nd param

MartinThoma · 2023-08-20T19:18:27Z

pypdf/_writer.py

+                        annotation: ArrayObject,
+                        obj: DictionaryObject


Maybe you can add a little bit more documentation:

Why is the annotation an ArrayObject?

What does obj represent?

I don't actually properly understand these objects. I just kept the original names and annotations, see this part of the old code where annotations were filtered. I just passed all the relevant context objects to the filter function, in case they might be useful for filtering. :)

In short, I don't know what to write myself, sorry. Can you do this?

MartinThoma · 2023-08-20T19:26:13Z

pypdf/_writer.py

+        subtypes: Optional[
+            Union[AnnotationSubtype, Iterable[AnnotationSubtype]]
+        ] = None,
+        annotation_filter_function: Optional[Callable] = None,


Just an idea: We could call the parameter exclude. That (in combination with the type annotation and the docs) seems pretty clear to me. Whenever it evaluates to True, the entry is excluded. That would inverse the logic.

I'm not happy with annotation_filter_function as it's pretty lengthy + I never know if we "filter out" (remove) or "filter to keep".

We could then make the default the constant-False function (which would not remove anything). This would mean the annotation would become simpler as the "Optional" could be removed.

Then the meaning of subtypes in combination with the new parameter has to be clear. I think it would be best if the both are independent filter functions. If both are set, we can create a new function that combines both internally. That means we don't have to ignore one parameter. And we don't have to log anything.

Hm, I don't really like exclude, because it feels more natural to return True to keep something. Maybe include or filter?

We could then make the default the constant-False function (which would not remove anything). This would mean the annotation would become simpler as the "Optional" could be removed.

I'm not sure I get this. I wanted to set the default_annotation_filter_function as the annotation_filter_function by default, too, without Optional[Callable] = None. But mypy didn't let me iirc.

I think it would be best if the both are independent filter functions. If both are set, we can create a new function that combines both internally. That means we don't have to ignore one parameter. And we don't have to log anything.

Agreed.

pubpub-zz · 2023-08-23T20:04:10Z

pypdf/_writer.py

+        """
+        if annotation_filter_function is None:
+            annotation_filter_function = default_annotation_filter_function
+
        page = cast(DictionaryObject, page.get_object())
        if PG.ANNOTS in page:
            i = 0
            while i < len(cast(ArrayObject, page[PG.ANNOTS])):
                an = cast(ArrayObject, page[PG.ANNOTS])[i]


an normally contains the IndirectObject pointing onto obj. I do not see any use case where this is required. However if somebody needs it, it will be available through indirect_reference property.
For me, just the page, and the annotation (Just to be renamed in the example) which may be clearer are sufficient.

Not sure I properly understood your comment. You're basically suggesting to have a simpler filter function?

def default_annotation_filter_function(page: Any, an: Any, obj: Any) -> bool: becomes: def default_annotation_filter_function(page: Any, an: Any) -> bool:

And obj would still be there in an.indirect_reference?

I never truly understood these objects. Can you maybe write a useful documentation for annotation_filter_function as suggested here?

pubpub-zz · 2023-08-23T20:06:26Z

pypdf/_utils.py

@@ -588,3 +588,8 @@ def replace(self, new_image: Any, **kwargs: Any) -> None:
        self.name = self.name[: self.name.rfind(".")] + extension
        self.data = byte_stream
        self.image = img
+
+
+def default_annotation_filter_function(page: Any, an: Any, obj: Any) -> bool:


I do not see any advantage to have this function in _utils. It could be sufficient as a internal function next to _annotation_filter_function

pubpub-zz · 2023-08-23T20:07:05Z

pypdf/_writer.py

+        subtypes: Optional[
+            Union[AnnotationSubtype, Iterable[AnnotationSubtype]]
+        ] = None,
+        annotation_filter_function: Optional[Callable] = None,


see below bout the 2nd param

pubpub-zz · 2023-08-23T20:14:34Z

pypdf/_writer.py

+        """
+        if subtypes is not None:
+            if annotation_filter_function is not None:
+                logger_warning(


I would more likely raise an ValueError

pubpub-zz · 2023-09-03T19:21:26Z

@MrTomRod
+1 about my comments?

MrTomRod · 2023-09-04T08:46:59Z

Sorry, I've lost the plot a bit, I'm very busy atm.

What comment does "see below bout the 2nd param" refer to exactly?

I think most of the suggestions make sense. Except if

I think it would be best if the both are independent filter functions. If both are set, we can create a new function that combines both internally.

Somehow destroys backwards compatibility.

pubpub-zz · 2023-09-04T18:42:21Z

`> Sorry, I've lost the plot a bit, I'm very busy atm.

What comment does "see below bout the 2nd param" refer to exactly?

I meant an seems for me useless and we should have only page and obj as parameters in the filter.
It would be great to have those as elements in brackets behind callable to improve annotation

I think most of the suggestions make sense. Except if

I think it would be best if the both are independent filter functions. If both are set, we can create a new function that combines both internally.

from my understanding @MartinThoma means, as your new parameter is exclusive from the old one, we should have 2 functions:
legacy
def remove_annotations(self, subtypes: Optional[Union[AnnotationSubtype, Iterable[AnnotationSubtype]]])
and your new function
def remove_filtered_annotations(self, subtypes: Optional[Union[AnnotationSubtype, Iterable[AnnotationSubtype]]])

but in order to keep code simple remove_annotation could call remove_filtered_annotations with a lambda function.

Somehow destroys backwards compatibility.

backwards compatibility would be kept 😀

MartinThoma · 2023-12-24T09:30:57Z

I'm closing this PR now without merging. I have the impression that this is a rather unusual use case and that the API might be confusing.

I would prefer a numpy-style API def remove_annotations(indices) -> None and def annotations() -> Iterable[Annotation] to get the indices. That seems way clearer + we would not have to add parameters to such a function.

MartinThoma · 2023-12-24T09:31:47Z

@MrTomRod Thank you for all the work you put into this PR. Although I didn't merge it, I do value it. If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

MrTomRod · 2023-12-30T21:52:00Z

No worries, and no need to add me to the list. Thanks for your effort!

MrTomRod added 3 commits May 4, 2023 12:25

- New, more flexible and public remove_annots_from_page function (issue

e2fa715

py-pdf#1829) - add pytest-socket to pyproject.toml

fix mypy error

7f31e10

fix ruff styling error

7ac4e68

fix mypy error

ef3595a

MartinThoma changed the title ~~- New, more flexible and public remove_annots_from_page function (iss…~~ ENH: Make remove_annots_from_page more flexible Jun 11, 2023

MartinThoma mentioned this pull request Jun 11, 2023

MAINT: Add pytest-socket to pyproject.toml #1877

Merged

Merge branch 'main' into main

94d30e3

MrTomRod and others added 3 commits June 17, 2023 16:49

implement comments from py-pdf#1831:

98ecc0a

- rename delete_decide_function to annotation_filter_function - add page to callback for annotation_filter_function

rebase master

e6f4d44

Merge branch 'main' into main

498b8b1

MrTomRod and others added 2 commits June 17, 2023 17:10

fix logger

2c33c3b

Merge branch 'main' into main

ab84e93

MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Aug 13, 2023

MartinThoma requested a review from pubpub-zz August 13, 2023 07:11