Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process hexadecimal strings containing line breaks, but strip line breaks first (fix #273) #344

Closed
wants to merge 2 commits into from

Conversation

Connum
Copy link
Contributor

@Connum Connum commented Sep 25, 2020

Strings decoded as hexadecimal may contain line break characters in the hexadecimal representation of the raw data. These were ignored and the hexadecimal representation was returned unprocessed, resulting in what can be seen in #273: hexadecimal strings in the otherwise decoded text content.

This fix takes line breaks in hexadecimal representation into account, stripping those line breaks before decoding.

Please note that the CI check will expectedly fail for test case testGetDataTmIssue336 while PR #343 is pending.

Edit by @k00ni: fixes #273

@Connum Connum mentioned this pull request Sep 25, 2020
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its weird, I can't see the file difference of tests/Integration/FontTest.php (reference, its saying its a binary file).

  • @Connum can you please add a link showing the code you changed? (Or paste it here)

@k00ni k00ni added the fix label Sep 28, 2020
@Connum
Copy link
Contributor Author

Connum commented Sep 28, 2020

That's strange indeed... I had a similar issue in my GUI (SourceTree) recently, I think it might have to do with character encoding... I'll look into it!

@Connum
Copy link
Contributor Author

Connum commented Sep 28, 2020

So this is the code I added at the end of testDecodeHexadecimal():

// hexadecimal string with a line break should not return the input string
// addressing issue #273: https://github.com/smalot/pdfparser/issues/273
$hexa = "<0027004c0056005300520051004c0045004c004f004c005d0044006f006d0052001d000300560048005b00570044001000490048004c00550044000f0003001400170003004700480003004900480059004800550048004c00550052000300470048000300\n15001300150013>";
$this->assertEquals("\x0\x27\x0\x4c\x0\x56\x0\x53\x0\x52\x0\x51\x0\x4c\x0\x45\x0\x4c\x0\x4f\x0\x4c\x0\x5d\x0\x44\x0\x6f\x0\x6d\x0\x52\x0\x1d\x0\x3\x0\x56\x0\x48\x0\x5b\x0\x57\x0\x44\x0\x10\x0\x49\x0\x48\x0\x4c\x0\x55\x0\x44\x0\xf\x0\x3\x0\x14\x0\x17\x0\x3\x0\x47\x0\x48\x0\x3\x0\x49\x0\x48\x0\x59\x0\x48\x0\x55\x0\x48\x0\x4c\x0\x55\x0\x52\x0\x3\x0\x47\x0\x48\x0\x3\x0\x15\x0\x13\x0\x15\x0\x13", Font::decodeHexadecimal($hexa));

Maybe the mass of encoded chars leads git to believe that this is a binary file? It's very strange... I'm looking to find a way to get around this...

@Connum
Copy link
Contributor Author

Connum commented Sep 28, 2020

Very interesting... I tried different things, saving the file with another editor (SublimeText instead of VSCode), and running the git tool dos2unix available under Windows, which gave me:

dos2unix: Binary symbol 0x00 found at line 235
dos2unix: Skipping binary file FontTest.php

Looking at that line in SublimeEdit, which has a nicer representation for binary symbols, I could see that there's indeed a binary symbol in a test string, that shouldn't be there:
grafik

So it turns out that it doesn't have anything to do with the code I added, but I have no idea why it didn't cause issues earlier... I removed that character, as it has nothing to do with the XML test (it would actually render the XML, or at least the CSS inline style invalid!) and must have gotten there accidentally. The test case runs still fine, as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problem in part of the translation
2 participants