Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix getTextContent evaluation to only apply TJ horizontal offsets using numeric items/args #7714

Merged
merged 1 commit into from
Oct 13, 2016

Conversation

cemerick
Copy link
Contributor

While the array argument to TJ should only contain strings and numbers, other
unfortunate items are found in PDFs in the wild, e.g.:

[(Grandes) 0.0 Tc
-250.0 (Client\350les,) 0.0 Tc
-250.0 (Financements) 0.0 Tc
-250.0 (et) 0.0 Tc
-250.0 (March\351s) ] TJ

getOperatorList already properly ignores any non-string, non-numeric values in
TJ arrays; without this patch to getTextContent, returned text items can have
NaN widths due to calculations being applied to those non-numeric values.

@yurydelendik
Copy link
Contributor

Can you provide some form of test (e.g. ref text)?

@cemerick
Copy link
Contributor Author

You mean a sample PDF that contains an out-of-spec TJ array? Not at the moment, the example I have is from a customer.

FWIW, the change is just mirroring what getOperatorList does already.

@yurydelendik
Copy link
Contributor

Hmm, we might have 'eq' test for it -- it's probably a matter of adding 'text' one

@cemerick
Copy link
Contributor Author

Updated with a minimized test case. Hopefully I got the manifest entry right.

FWIW, here's an easy expression to evaluate once the testcase PDF is loaded in the viewer:

PDFViewerApplication.pdfDocument.getPage(1).then((p) => p.getTextContent()).then((tc) => console.log(tc.items[0]));

You'll see the sole textItem; its width will be NaN using HEAD, ~193 with the patch. Turns out it is affecting the text layer (positively): the scaleX css property is being calculated properly, so the text run's div now covers the rendered text fully.

@@ -258,3 +258,5 @@
!annotation-text-widget.pdf
!annotation-choice-widget.pdf
!zero_descent.pdf
!operator-in-TJ-array.pdf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please remove this newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

…ng numeric items/args

While the array argument to TJ should only contain strings and numbers, other
unfortunate items are found in PDFs in the wild, e.g.:

[(Grandes) 0.0 Tc
-250.0 (Client\350les,) 0.0 Tc
-250.0 (Financements) 0.0 Tc
-250.0 (et) 0.0 Tc
-250.0 (March\351s) ] TJ

getOperatorList already properly ignores any non-string, non-numeric values in
TJ arrays; without this patch to getTextContent, returned text items can have
NaN widths due to calculations being applied to those non-numeric values.
@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/4aca01e041e53d7/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/4aca01e041e53d7/output.txt

Total script time: 1.08 mins

Published

@timvandermeij
Copy link
Contributor

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/c795d0674b05ad0/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://107.22.172.223:8877/a39b4967195cc48/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://107.22.172.223:8877/a39b4967195cc48/output.txt

Total script time: 25.27 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/a39b4967195cc48/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Linux)


Failed

Full output at http://107.21.233.14:8877/c795d0674b05ad0/output.txt

Total script time: 29.45 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://107.21.233.14:8877/c795d0674b05ad0/reftest-analyzer.html#web=eq.log

@timvandermeij
Copy link
Contributor

/botio makeref

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://107.22.172.223:8877/cd84092ce8d9cfa/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/5810f38e5ab44de/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://107.22.172.223:8877/cd84092ce8d9cfa/output.txt

Total script time: 25.20 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/5810f38e5ab44de/output.txt

Total script time: 28.34 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@timvandermeij timvandermeij merged commit c457e60 into mozilla:master Oct 13, 2016
@timvandermeij
Copy link
Contributor

Thank you for your contribution!

movsb pushed a commit to movsb/pdf.js that referenced this pull request Jul 14, 2018
Fix getTextContent evaluation to only apply TJ horizontal offsets using numeric items/args
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants