BUG: text position detection works worse than in PyPDF2 #2200

vors · 2023-09-18T15:28:49Z

I'm trying to add a highlighting annotation to the doc using the text visitor to identify the coordinates to add it.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.5.2-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

repro repo https://github.com/vors/pypdf-highlighting-repro

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

You can add them to your tests.

Visuals

pubpub-zz · 2023-09-18T19:18:10Z

this is a regression that seems to be due to #2060
Under analysis to find the fix

MartinThoma · 2023-09-18T20:15:14Z

I think we might need a rendered image for testing.

pubpub-zz · 2023-09-18T20:17:31Z

I think we might need a rendered image for testing.

@MartinThoma, Can you clarify which image you are talking about?

pubpub-zz · 2023-09-20T17:39:38Z

test file:
page_178.pdf

stefan6419846 · 2023-09-21T14:44:42Z

I think we might need a rendered image for testing.

Can you clarify which image you are talking about?

This probably still needs to be generated. Similar to the watermarking tests, render an image from the page with the corresponding highlighting as in the original issue description and check that it matches the expected position.

Nevertheless, I am not sure whether this really is required to detect such issues, as a plain text position test with a set of words/text snippets and their positions should do the same while not requiring any outside rendering.

pubpub-zz · 2023-09-21T16:55:54Z

Nevertheless, I am not sure whether this really is required to detect such issues, as a plain text position test with a set of words/text snippets and their positions should do the same while not requiring any outside rendering.

I've been able to check the position through pdf inline content analysis and confirmed with PDF-XChange Editor which provides the coordinate of the cursor.
Note: It showed that the position with PyPDF2 was not very accurate.

MartinThoma · 2023-09-22T19:47:31Z

I meant an image which doesn't exist so far. Something similar to the merge page rendering tests.

Reworks and is still valid to close #2059 Closes #2200 Closes #2075

pubpub-zz mentioned this issue Sep 19, 2023

BUG: invalid cm/tm in visitor functions #2206

Merged

MartinThoma changed the title ~~text position detection works worse then PyPDF2~~ BUG: text position detection works worse than in PyPDF2 Sep 24, 2023

MartinThoma closed this as completed in #2206 Oct 8, 2023

MartinThoma pushed a commit that referenced this issue Oct 8, 2023

BUG: invalid cm/tm in visitor functions (#2206)

bcd85c4

Reworks and is still valid to close #2059 Closes #2200 Closes #2075

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: text position detection works worse than in PyPDF2 #2200

BUG: text position detection works worse than in PyPDF2 #2200

vors commented Sep 18, 2023 •

edited

Loading

pubpub-zz commented Sep 18, 2023

MartinThoma commented Sep 18, 2023

pubpub-zz commented Sep 18, 2023 •

edited

Loading

pubpub-zz commented Sep 20, 2023

stefan6419846 commented Sep 21, 2023

pubpub-zz commented Sep 21, 2023 •

edited

Loading

MartinThoma commented Sep 22, 2023

BUG: text position detection works worse than in PyPDF2 #2200

BUG: text position detection works worse than in PyPDF2 #2200

Comments

vors commented Sep 18, 2023 • edited Loading

Environment

Code + PDF

Visuals

pubpub-zz commented Sep 18, 2023

MartinThoma commented Sep 18, 2023

pubpub-zz commented Sep 18, 2023 • edited Loading

pubpub-zz commented Sep 20, 2023

stefan6419846 commented Sep 21, 2023

pubpub-zz commented Sep 21, 2023 • edited Loading

MartinThoma commented Sep 22, 2023

vors commented Sep 18, 2023 •

edited

Loading

pubpub-zz commented Sep 18, 2023 •

edited

Loading

pubpub-zz commented Sep 21, 2023 •

edited

Loading