Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: text position detection works worse than in PyPDF2 #2200

Closed
vors opened this issue Sep 18, 2023 · 7 comments · Fixed by #2206
Closed

BUG: text position detection works worse than in PyPDF2 #2200

vors opened this issue Sep 18, 2023 · 7 comments · Fixed by #2206

Comments

@vors
Copy link

vors commented Sep 18, 2023

I'm trying to add a highlighting annotation to the doc using the text visitor to identify the coordinates to add it.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.5.2-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

repro repo https://github.com/vors/pypdf-highlighting-repro

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

You can add them to your tests.

Visuals

image

@pubpub-zz
Copy link
Collaborator

this is a regression that seems to be due to #2060
Under analysis to find the fix

@MartinThoma
Copy link
Member

I think we might need a rendered image for testing.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 18, 2023

I think we might need a rendered image for testing.

@MartinThoma, Can you clarify which image you are talking about?

@pubpub-zz
Copy link
Collaborator

test file:
page_178.pdf

@stefan6419846
Copy link
Collaborator

I think we might need a rendered image for testing.

Can you clarify which image you are talking about?

This probably still needs to be generated. Similar to the watermarking tests, render an image from the page with the corresponding highlighting as in the original issue description and check that it matches the expected position.

Nevertheless, I am not sure whether this really is required to detect such issues, as a plain text position test with a set of words/text snippets and their positions should do the same while not requiring any outside rendering.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 21, 2023

Nevertheless, I am not sure whether this really is required to detect such issues, as a plain text position test with a set of words/text snippets and their positions should do the same while not requiring any outside rendering.

I've been able to check the position through pdf inline content analysis and confirmed with PDF-XChange Editor which provides the coordinate of the cursor.
Note: It showed that the position with PyPDF2 was not very accurate.

@MartinThoma
Copy link
Member

I meant an image which doesn't exist so far. Something similar to the merge page rendering tests.

@MartinThoma MartinThoma changed the title text position detection works worse then PyPDF2 BUG: text position detection works worse than in PyPDF2 Sep 24, 2023
MartinThoma pushed a commit that referenced this issue Oct 8, 2023
Reworks and is still valid to close #2059

Closes #2200
Closes #2075
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants