Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make remove_annots_from_page more flexible #1831

Closed
wants to merge 16 commits into from

Conversation

MrTomRod
Copy link
Contributor

@MrTomRod MrTomRod commented May 4, 2023

Changes:

  • _remove_annots_from_page -> remove_annots_from_page
  • Replaced the subtypes: Optional[Iterable[str]] parameter by delete_decide_function: Optional[Callable] = None,

Closes #1829

@codecov
Copy link

codecov bot commented May 4, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (ec85a27) 94.54% compared to head (6dc1c5b) 94.51%.
Report is 1 commits behind head on main.

Files Patch % Lines
pypdf/_writer.py 80.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1831      +/-   ##
==========================================
- Coverage   94.54%   94.51%   -0.04%     
==========================================
  Files          43       43              
  Lines        7549     7561      +12     
  Branches     1490     1497       +7     
==========================================
+ Hits         7137     7146       +9     
- Misses        253      255       +2     
- Partials      159      160       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MartinThoma MartinThoma changed the title - New, more flexible and public remove_annots_from_page function (iss… ENH: Make remove_annots_from_page more flexible Jun 11, 2023
@MartinThoma
Copy link
Member

Thank you for the contribution <3

Overall, I tend to add the improvement to pypdf, but I think a few decisions need to be discussed and potentially refined:

  1. Method name: I'd rather not add a new method PdfWriter.remove_annots_from_page as we already have PdfWriter.remove_annotations. The parameter delete_decide_function: Optional[Callable] = None, could be added there.
  2. Callback parameters: We could give the page / page_index to the delete_decide_function callback. This would remove the need for remove_annots_from_page, wouldn't it?
  3. Documentation: The documentation regarding the delete_decide_function could be better. What is the first parameter?
  4. Naming: delete_decide_function doesn't sound good. Maybe annotation_filter_function? In Django, what you filter for remains. Meaning a return value of True would mean that it's not deleted. Other ideas: should_remove, should_delete, should_remain, should_keep,

MrTomRod and others added 3 commits June 17, 2023 16:49
- rename delete_decide_function to annotation_filter_function
- add page to callback for annotation_filter_function
@MrTomRod
Copy link
Contributor Author

MrTomRod commented Jun 17, 2023

I'm glad you like it!

1 Method name

You're right, it's confusing. I changed remove_annots_from_page back to _remove_annots_from_page.

2 Callback parameters

I gave the page as callback.

I didn't remove _remove_annots_from_page for two reasons:

  • Removing annotations is done by processing each page separately using a loop. It makes sense to have a separate function to reduce nesting.
  • _remove_annots_from_page is heavily used in remove_objects_from_page. If we used remove_annotations there, this function would always loop over all pages unnecessarily.

3 Documentation

Is it better now?

4 Naming

I also didn't like my names. I settled on annotation_filter_function.

@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Aug 13, 2023
@MartinThoma MartinThoma requested a review from pubpub-zz August 13, 2023 07:11
@MartinThoma
Copy link
Member

@pubpub-zz What is your opinion about this PR?

tests/test_writer.py Outdated Show resolved Hide resolved
pypdf/_utils.py Outdated Show resolved Hide resolved
tests/test_writer.py Outdated Show resolved Hide resolved
tests/test_writer.py Outdated Show resolved Hide resolved
tests/test_writer.py Outdated Show resolved Hide resolved
tests/test_writer.py Outdated Show resolved Hide resolved
tests/test_writer.py Outdated Show resolved Hide resolved
tests/test_writer.py Outdated Show resolved Hide resolved
subtypes: Optional[
Union[AnnotationSubtype, Iterable[AnnotationSubtype]]
] = None,
annotation_filter_function: Optional[Callable] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
annotation_filter_function: Optional[Callable] = None,
annotation_filter_function: Optional[Callable[[DictionaryObject, ArrayObject, DictionaryObject], bool]] = None,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below bout the 2nd param

Comment on lines +2122 to +2123
annotation: ArrayObject,
obj: DictionaryObject
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can add a little bit more documentation:

  • Why is the annotation an ArrayObject?
  • What does obj represent?

Copy link
Contributor Author

@MrTomRod MrTomRod Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually properly understand these objects. I just kept the original names and annotations, see this part of the old code where annotations were filtered. I just passed all the relevant context objects to the filter function, in case they might be useful for filtering. :)

In short, I don't know what to write myself, sorry. Can you do this?

subtypes: Optional[
Union[AnnotationSubtype, Iterable[AnnotationSubtype]]
] = None,
annotation_filter_function: Optional[Callable] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea: We could call the parameter exclude. That (in combination with the type annotation and the docs) seems pretty clear to me. Whenever it evaluates to True, the entry is excluded. That would inverse the logic.

I'm not happy with annotation_filter_function as it's pretty lengthy + I never know if we "filter out" (remove) or "filter to keep".

We could then make the default the constant-False function (which would not remove anything). This would mean the annotation would become simpler as the "Optional" could be removed.

Then the meaning of subtypes in combination with the new parameter has to be clear. I think it would be best if the both are independent filter functions. If both are set, we can create a new function that combines both internally. That means we don't have to ignore one parameter. And we don't have to log anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't really like exclude, because it feels more natural to return True to keep something. Maybe include or filter?

We could then make the default the constant-False function (which would not remove anything). This would mean the annotation would become simpler as the "Optional" could be removed.

I'm not sure I get this. I wanted to set the default_annotation_filter_function as the annotation_filter_function by default, too, without Optional[Callable] = None. But mypy didn't let me iirc.

I think it would be best if the both are independent filter functions. If both are set, we can create a new function that combines both internally. That means we don't have to ignore one parameter. And we don't have to log anything.

Agreed.

"""
if annotation_filter_function is None:
annotation_filter_function = default_annotation_filter_function

page = cast(DictionaryObject, page.get_object())
if PG.ANNOTS in page:
i = 0
while i < len(cast(ArrayObject, page[PG.ANNOTS])):
an = cast(ArrayObject, page[PG.ANNOTS])[i]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an normally contains the IndirectObject pointing onto obj. I do not see any use case where this is required. However if somebody needs it, it will be available through indirect_reference property.
For me, just the page, and the annotation (Just to be renamed in the example) which may be clearer are sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I properly understood your comment. You're basically suggesting to have a simpler filter function?

def default_annotation_filter_function(page: Any, an: Any, obj: Any) -> bool:
    becomes:
def default_annotation_filter_function(page: Any, an: Any) -> bool:

And obj would still be there in an.indirect_reference?

I never truly understood these objects. Can you maybe write a useful documentation for annotation_filter_function as suggested here?

@@ -588,3 +588,8 @@ def replace(self, new_image: Any, **kwargs: Any) -> None:
self.name = self.name[: self.name.rfind(".")] + extension
self.data = byte_stream
self.image = img


def default_annotation_filter_function(page: Any, an: Any, obj: Any) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see any advantage to have this function in _utils. It could be sufficient as a internal function next to _annotation_filter_function

subtypes: Optional[
Union[AnnotationSubtype, Iterable[AnnotationSubtype]]
] = None,
annotation_filter_function: Optional[Callable] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below bout the 2nd param

"""
if subtypes is not None:
if annotation_filter_function is not None:
logger_warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would more likely raise an ValueError

@MartinThoma MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Aug 27, 2023
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 3, 2023

@MrTomRod
+1 about my comments?

@MrTomRod
Copy link
Contributor Author

MrTomRod commented Sep 4, 2023

Sorry, I've lost the plot a bit, I'm very busy atm.

What comment does "see below bout the 2nd param" refer to exactly?

I think most of the suggestions make sense. Except if

I think it would be best if the both are independent filter functions. If both are set, we can create a new function that combines both internally.

Somehow destroys backwards compatibility.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 4, 2023

`> Sorry, I've lost the plot a bit, I'm very busy atm.

What comment does "see below bout the 2nd param" refer to exactly?

I meant an seems for me useless and we should have only page and obj as parameters in the filter.
It would be great to have those as elements in brackets behind callable to improve annotation

I think most of the suggestions make sense. Except if

I think it would be best if the both are independent filter functions. If both are set, we can create a new function that combines both internally.

from my understanding @MartinThoma means, as your new parameter is exclusive from the old one, we should have 2 functions:
legacy
def remove_annotations(self, subtypes: Optional[Union[AnnotationSubtype, Iterable[AnnotationSubtype]]])
and your new function
def remove_filtered_annotations(self, subtypes: Optional[Union[AnnotationSubtype, Iterable[AnnotationSubtype]]])

but in order to keep code simple remove_annotation could call remove_filtered_annotations with a lambda function.

Somehow destroys backwards compatibility.

backwards compatibility would be kept 😀

@MartinThoma
Copy link
Member

I'm closing this PR now without merging. I have the impression that this is a rather unusual use case and that the API might be confusing.

I would prefer a numpy-style API def remove_annotations(indices) -> None and def annotations() -> Iterable[Annotation] to get the indices. That seems way clearer + we would not have to add parameters to such a function.

@MartinThoma
Copy link
Member

@MrTomRod Thank you for all the work you put into this PR. Although I didn't merge it, I do value it. If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

@MrTomRod
Copy link
Contributor Author

No worries, and no need to add me to the list. Thanks for your effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More flexible remove_annots_from_page function
3 participants