Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process hexadecimal strings containing line breaks, but strip line breaks first (fix #273) #346

Merged
merged 4 commits into from
Oct 8, 2020

Conversation

Connum
Copy link
Contributor

@Connum Connum commented Sep 28, 2020

This is the same as #344, with the mess cleaned up.

Strings decoded as hexadecimal may contain line break characters in the hexadecimal representation of the raw data. These were ignored and the hexadecimal representation was returned unprocessed, resulting in what can be seen in #273: hexadecimal strings in the otherwise decoded text content.

This fix takes line breaks in hexadecimal representation into account, stripping those line breaks before decoding.

Please note that the CI check will expectedly fail for test case testGetDataTmIssue336 while PR #343 is pending.

Edit by @k00ni: fixes #273

@k00ni
Copy link
Collaborator

k00ni commented Sep 28, 2020

Same here, unfortunately. Can't open the diff view for FontTest because Github thinks its a binary file.

@k00ni k00ni self-assigned this Sep 28, 2020
@k00ni k00ni added the fix label Sep 28, 2020
@Connum
Copy link
Contributor Author

Connum commented Sep 28, 2020

Same here, unfortunately. Can't open the diff view for FontTest, because Github thinks its a binary file.

Yeah, right, that was an error in reasoning on my side, to think that we would now be able to see a comparison. Of course the initial state of the file is still regarded as a binary file and therefore won't be regarded for diffing. But in the future, once this is pushed, the file will be treated as a text file again.
You could copy the content of the raw file https://raw.githubusercontent.com/smalot/pdfparser/6daa665c2d98726a2dedbadc500e16380e8ebd59/tests/Integration/FontTest.php and use the diff tool of your choice to compare it to the master version locally.

Here's the diff for the change (notice the ? before the hex code):

235c235  
<         $hexa = '<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2"  style="font-size:12.0pt;text-align:left;color:?0000;font-weight:normal;font-style:norm\  
---  
>         $hexa = '<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2"  style="font-size:12.0pt;text-align:left;color:0000;font-weight:normal;font-style:norm\  

And a Text Compare Report created with Beyond Compare (GitHub doesn't allow the upload of the html unfortunately):
grafik

I also noticed that the hex color code following the byte char was incomplete, it should be 000000, I fixed that in an additonal commit. My guess is that it was correct at some point and somehow the first to zeroes turned into the 00 byte symbol.

@k00ni
Copy link
Collaborator

k00ni commented Sep 28, 2020

Using the 3. commit itself also reveals your changes (75e23db). Thank you for clearing this, will look into it later on.

@k00ni
Copy link
Collaborator

k00ni commented Sep 30, 2020

I just saw that in your PR #345 the same line in Parser.php was changed:

https://github.com/smalot/pdfparser/pull/345/files#diff-6dd9476cf5bca6913f04bf742aacdce9

Does it mean that this PR was replaced?

@Connum
Copy link
Contributor Author

Connum commented Sep 30, 2020

Oh dear, I might have pushed something to the wrong branch... let me check!

@Connum
Copy link
Contributor Author

Connum commented Sep 30, 2020

Oh dear, I might have pushed something to the wrong branch... let me check!

Yup, that was exactly what I did... Sorry for the mess in a PR that was supposed to clean up the mess! ;-)

Corrected, master merged and code linted.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Will keep this open for a few days to allow others to comment.

@k00ni k00ni merged commit 72a8778 into smalot:master Oct 8, 2020
partulaj pushed a commit to partulaj/pdfparser that referenced this pull request Dec 21, 2020
…eaks first (fix smalot#273) (smalot#346)

* process hexadecimal strings containing line breaks, but strip line breaks first (fix smalot#273)

* remove binary symbold from test data string

* code linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problem in part of the translation
2 participants