Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add font fallback + Support for font IDs containing hyphens #614

Merged
merged 10 commits into from
Jul 31, 2023

Conversation

GreyWyvern
Copy link
Contributor

If part of a text stream is positioned after an incorrect font command, then undecoded, jumbled bytes will appear in the getText() output. This change adds code to check this output for UTF-8 control characters (\x00-\x1f + \x7f) and if they appear, loop through all available fonts to see if we can find one that decodes this output properly. If none is found, the original string is used. Resolves #586.

Also add support for font IDs containing hyphens. Previously these were ignored as invalid. Resolves #145.

If a text stream is "decoded" and contains UTF-8 control characters, it probably wasn't decoded using the proper font code page. Add a loop that cycles through all the available fonts to see if there's a better decode choice. Resolves Issue 586.

As well, add the ability to parse font IDs containing dashes (-). Resolves Issue 145
Simplify these tests in case future edits change spacing rules.
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its good to see you are still with us!

Just a few remarks/questions.

src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved
samples/FontIDHyphen.pdf Outdated Show resolved Hide resolved
samples/ImproperFontFallback.pdf Outdated Show resolved Hide resolved
Let PCRE handle the conversion rather than PHP. Hopefully fixes PHPStan complaints about null byte.
GreyWyvern and others added 6 commits July 14, 2023 10:38
Remove the Font ID with hyphen test case PDF as we could not contact the submitter to get permission to use it.
Change the unit test to directly test if a Font ID with a hyphen is correctly parsed.
Add one more test for font-fallback. This addition also resolves smalot#495.
Catches situations where a null byte \x00 may not be found by preg_match in a unicode context.
Null bytes in the text string usually means that a CIDMap encoded string has been passed through as UTF-8 bytes without being translated by any matching CIDMap pairs.
@GreyWyvern GreyWyvern mentioned this pull request Jul 21, 2023
@k00ni k00ni linked an issue Jul 23, 2023 that may be closed by this pull request
@k00ni
Copy link
Collaborator

k00ni commented Jul 23, 2023

Are you done here @GreyWyvern?

@GreyWyvern
Copy link
Contributor Author

Are you done here @GreyWyvern?

Yes, sorry. I was on vacation this week. :)

@k00ni
Copy link
Collaborator

k00ni commented Jul 31, 2023

Are you done here @GreyWyvern?

Yes, sorry. I was on vacation this week. :)

All good, hope you had a good one.

@k00ni k00ni merged commit ce434c1 into smalot:master Jul 31, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants