process hexadecimal strings containing line breaks, but strip line breaks first (fix #273) #346

Connum · 2020-09-28T08:34:53Z

This is the same as #344, with the mess cleaned up.

Strings decoded as hexadecimal may contain line break characters in the hexadecimal representation of the raw data. These were ignored and the hexadecimal representation was returned unprocessed, resulting in what can be seen in #273: hexadecimal strings in the otherwise decoded text content.

This fix takes line breaks in hexadecimal representation into account, stripping those line breaks before decoding.

Please note that the CI check will expectedly fail for test case testGetDataTmIssue336 while PR #343 is pending.

Edit by @k00ni: fixes #273

…eaks first (fix smalot#273)

k00ni · 2020-09-28T11:51:03Z

Same here, unfortunately. Can't open the diff view for FontTest because Github thinks its a binary file.

Connum · 2020-09-28T12:32:43Z

Same here, unfortunately. Can't open the diff view for FontTest, because Github thinks its a binary file.

Yeah, right, that was an error in reasoning on my side, to think that we would now be able to see a comparison. Of course the initial state of the file is still regarded as a binary file and therefore won't be regarded for diffing. But in the future, once this is pushed, the file will be treated as a text file again.
You could copy the content of the raw file https://raw.githubusercontent.com/smalot/pdfparser/6daa665c2d98726a2dedbadc500e16380e8ebd59/tests/Integration/FontTest.php and use the diff tool of your choice to compare it to the master version locally.

Here's the diff for the change (notice the ? before the hex code):

235c235  
<         $hexa = '<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2"  style="font-size:12.0pt;text-align:left;color:?0000;font-weight:normal;font-style:norm\  
---  
>         $hexa = '<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2"  style="font-size:12.0pt;text-align:left;color:0000;font-weight:normal;font-style:norm\

And a Text Compare Report created with Beyond Compare (GitHub doesn't allow the upload of the html unfortunately):

I also noticed that the hex color code following the byte char was incomplete, it should be 000000, I fixed that in an additonal commit. My guess is that it was correct at some point and somehow the first to zeroes turned into the 00 byte symbol.

k00ni · 2020-09-28T12:51:04Z

Using the 3. commit itself also reveals your changes (75e23db). Thank you for clearing this, will look into it later on.

k00ni · 2020-09-30T07:44:27Z

I just saw that in your PR #345 the same line in Parser.php was changed:

https://github.com/smalot/pdfparser/pull/345/files#diff-6dd9476cf5bca6913f04bf742aacdce9

Does it mean that this PR was replaced?

Connum · 2020-09-30T08:09:39Z

Oh dear, I might have pushed something to the wrong branch... let me check!

Connum · 2020-09-30T08:32:49Z

Oh dear, I might have pushed something to the wrong branch... let me check!

Yup, that was exactly what I did... Sorry for the mess in a PR that was supposed to clean up the mess! ;-)

Corrected, master merged and code linted.

k00ni

Looks good. Will keep this open for a few days to allow others to comment.

…eaks first (fix smalot#273) (smalot#346) * process hexadecimal strings containing line breaks, but strip line breaks first (fix smalot#273) * remove binary symbold from test data string * code linting

Connum added 2 commits September 25, 2020 22:12

process hexadecimal strings containing line breaks, but strip line br…

4455b0d

…eaks first (fix smalot#273)

remove binary symbold from test data string

20c9f92

k00ni self-assigned this Sep 28, 2020

k00ni added the fix label Sep 28, 2020

k00ni added the needs more info label Sep 30, 2020

Connum force-pushed the fix-273 branch from bfd4730 to 20c9f92 Compare September 30, 2020 08:21

Connum added 2 commits September 30, 2020 10:23

Merge branch 'master' into fix-273-new

1297f61

code linting

8d42c4e

k00ni approved these changes Sep 30, 2020

View reviewed changes

k00ni removed the needs more info label Sep 30, 2020

k00ni merged commit 72a8778 into smalot:master Oct 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process hexadecimal strings containing line breaks, but strip line breaks first (fix #273) #346

process hexadecimal strings containing line breaks, but strip line breaks first (fix #273) #346

Connum commented Sep 28, 2020

k00ni commented Sep 28, 2020 •

edited

Loading

Connum commented Sep 28, 2020

k00ni commented Sep 28, 2020 •

edited

Loading

k00ni commented Sep 30, 2020

Connum commented Sep 30, 2020

Connum commented Sep 30, 2020

k00ni left a comment

process hexadecimal strings containing line breaks, but strip line breaks first (fix #273) #346

process hexadecimal strings containing line breaks, but strip line breaks first (fix #273) #346

Conversation

Connum commented Sep 28, 2020

k00ni commented Sep 28, 2020 • edited Loading

Connum commented Sep 28, 2020

k00ni commented Sep 28, 2020 • edited Loading

k00ni commented Sep 30, 2020

Connum commented Sep 30, 2020

Connum commented Sep 30, 2020

k00ni left a comment

Choose a reason for hiding this comment

k00ni commented Sep 28, 2020 •

edited

Loading

k00ni commented Sep 28, 2020 •

edited

Loading