Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str_replace in Font.php now seems to work as expected #597

Merged
merged 4 commits into from
Jul 26, 2023

Conversation

tskodaw
Copy link
Contributor

@tskodaw tskodaw commented Apr 21, 2023

Hi,
I ran into an issue while text has been decoded using /ToUnicode and a CMap

I used a PDF file which was read correctly by Acrobat Reader, and showed the text. It has been created by a third hand with setasign/SetaFPDF.
Because of privacy issues I can't provide it :-( nor can I reproduce such PDF, I don't know what they have done.
nevertheless the thing was: (all takes place in Font.php)

decodeContentByToUnicodeCMapOrDescendantFonts(465) with translateChar(107)
did sometimes not find the character, which should be there and also was obviously in the $this->table as well had some offset-issues.
The CMap mapped like 67 characters ..
A while of digging showed me, that in decimals my 2 bytes encoded string has been in decimals something like this:

00|47|00|50|00|03|00|92|98
the error came with 00|92 -> which obviously has not been in the table
further debugging showed me, that my desired letter has been in 3 bytes there 00|92|98

the original decompressed string had character combinations of '\b' '\r' '\f' in it. So they were not the "special characters" but always backslash and the letter.

which should have been replaced to the one byte long special character with str_replace in line 436.
While debugging I found out that was not the case resulting in 92|98 for \b
the special character for \b would have been decimal 8 which is what was the right letter in the table. I played around with escaping etc. finally I just used chr(92) chr(98) etc. (see commit)
php version has been 8.1.6 for windows
I don't now why it did not replace the strings as intended and expected ...

Perhaps - if you have time - and the issue is known - you could look into the matter and perhaps the chr(92) etc. approach helps.

Best Regards


fixes #160

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tskodaw thank you for your pull request.

My knowledge of this code part is very limited and there are a few questions about your feedback besides the code change. To sum it up: with your change, PDFParser can decode/produce more characters than before. Is that correct?

decodeContentByToUnicodeCMapOrDescendantFonts(465) with translateChar(107)
did sometimes not find the character, which should be there and also was obviously in the $this->table as well had some offset-issues.

Its seems that there are characters missing in https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Font.php#L437-L438. Why should they be there in the first place? Can you point to a paragraph in the PDF specification, technical documentation or provide an argument, which underlines that?

The CMap mapped like 67 characters ..
A while of digging showed me, that in decimals my 2 bytes encoded string has been in decimals something like this: 00|47|00|50|00|03|00|92|98
the error came with 00|92 -> which obviously has not been in the table
further debugging showed me, that my desired letter has been in 3 bytes there 00|92|98 the original decompressed string had character combinations of '\b' '\r' '\f' in it. So they were not the "special characters" but always backslash and the letter.

Do you mean, that the replacements were faulty all along?

which should have been replaced to the one byte long special character with str_replace in line 436. While debugging I found out that was not the case resulting in 92|98 for \b
the special character for \b would have been decimal 8 which is what was the right letter in the table. I played around with escaping etc. finally I just used chr(92) chr(98) etc. (see commit)

Can you please elaborate a bit about this? I checked chr documentation and it says:

This can be used to create a one-character string in a single-byte encoding such as ASCII, ISO-8859, or Windows 1252, by passing the position of a desired character in the encoding's mapping table

Please explain, why you wanna change the encoding of the replaced characters from UTF-8 (currently) to ASCII, ISO-8859, or Windows 1252 (with your changes).

@k00ni k00ni added enhancement missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. needs more info tests required labels Apr 24, 2023
@tskodaw
Copy link
Contributor Author

tskodaw commented Apr 28, 2023

Hello @k00ni , thanks for answering.

Actually I hoped that someone who wrote that part perhaps will run into it and knows what to do. So your answer and the questions let me dig deeper. But unluckily I normally have nothing to do with internals of PDFs.

Can you point to a paragraph in the PDF specification

For this specific constellation I did not have found any specific arguments why those characters which will be transformed are escaped in the first place. Perhaps (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf) page 54 "TABLE 3.2 Escape sequences" in literal strings could be a hint. But there is no explanation if escaping is needed or not, when it comes to string literals in brackets for mappings with CMaps and /ToUnicode.

Do you mean, that the replacements were faulty all along?

No, I guess with all the mappings and replacements in the debugger having those strings represented as decimals|hex and characters before and after the replacement, I became a little confused about what has been replaced and what not. I guess there would have been more issues reported if that would have been the case.

As far as I can see it now, just the replacement for \b is missing which is not been solved with just adding '\b' and "\b" respectively to the arrays. Instead of "\b" the replacement should be "\x08" or chr(8) . I chose chr(8).
But in my case without that replacement the whole text token becomes useless at the beginning of the first '\b's position.
In order to make it more clear which decimal value would be replaced I used the chr([VALUE]) replacement variant. Since the font mapping table ($this->table) uses the decimal representation as index for the coded characters (see Font.php:112). So knowing which decimal value would be there might facilitate look ups in the mapping table while debugging. Instead of having in mind which special character / hex value has which decimal value.

Can you please elaborate a bit about this? I checked chr documentation and it says:

This can be used to create a one-character string in a single-byte encoding such as ASCII, ISO-8859, or Windows 1252, by passing the position of a desired character in the encoding's mapping table
Please explain, why you wanna change the encoding of the replaced characters from UTF-8 (currently) to ASCII, ISO-8859, or Windows 1252 (with your changes).

Using chr([DecimalValue]) is used to have clarity which value will be at that position (see above) the values are the same as ASCII or UTF8 "\n" "\r" etc. as stated in: "The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes." (https://developer.mozilla.org/en-US/docs/Glossary/UTF-8)

conclusion leaving everything as it is, but adding the '\b' in the needle array and "\x08" or chr(8) to the replacement array would imho be already sufficient and would allow extraction of 2byte encoded tokens into unicode via CMap/ToUnicode having '\b' in it. There decimals 92 98 would be replaced with decimal 08 analogue to 92 110 to ("\n") 92 102 ("\f") etc . From 2 bytes to 1 bytes which - in my case helped to have a 3 bytes sequence corrected to its intended 2 bytes sequence as with the other special characters. But perhaps there is someone knowing more about those escaped string literals around? Or are there any Test-PDFs to run automated tests?

thanks a lot for investigating this matter.

@k00ni
Copy link
Collaborator

k00ni commented May 10, 2023

Thank you for your detailed answer, I support your view.

Could you add a basic unit test to check that the new addition works?

@radieter
Copy link

radieter commented May 19, 2023

I just submitted a very similar pull request. I had issues with lost characters. Same like you I cannot provide a demo by now for the same reasony. Using xdebug I discovered that the str_replace removed a "\" too much...

also: why is this replacement hard-coded?

@k00ni
Copy link
Collaborator

k00ni commented May 22, 2023

@tskodaw what is the status here?

@tskodaw
Copy link
Contributor Author

tskodaw commented May 23, 2023

@k00ni oh I just came to realize the notification e-mails, I need more time to look into the projects test setup and into a special test itself. Are there any quick direction hints concerning that matter? thx for your patience

@k00ni
Copy link
Collaborator

k00ni commented May 24, 2023

Here a rough example how to achieve a basic unit test:

One or two tests are enough IMHO.

Thank you for investing the time.

@k00ni
Copy link
Collaborator

k00ni commented Jun 5, 2023

We made some changes to account for latest PHP-CS-Fixer rule changes (#602). Please merge in master branch to avoid failing coding style test pipeline.

@k00ni
Copy link
Collaborator

k00ni commented Jul 4, 2023

@tskodaw whats the status here? need any help?

@radieter
Copy link

radieter commented Jul 5, 2023

I still wonder why there is an escaped blank in the search-array. such thing doesn't exist.

@k00ni k00ni added the stale needs decision label Jul 5, 2023
@k00ni k00ni marked this pull request as draft July 10, 2023 07:02
@k00ni k00ni marked this pull request as ready for review July 19, 2023 08:56
@k00ni
Copy link
Collaborator

k00ni commented Jul 19, 2023

@tskodaw I added a basic test which should capture the \b-part. It is sufficient enough and a PDF is not needed. Please have a look.

@GreyWyvern
Copy link
Contributor

GreyWyvern commented Jul 21, 2023

The PDF referenced by #160 contains issues with \b's that are resolved by this fix.

Interestingly enough, @radieter, it also contains escaped spaces "\ ", but in the same strings of text, actual spaces are encoded as a separate \### octal escape code. This would make you think, "okay, the octal value for the space character must be getting translated to an actual character by the CIDMap" but in the case of the above PDF, they apparently are nothing more than actual spaces, or the character they're getting translated to is some thinspace or zero-width space that PdfParser is translating as a full space.

If I change the replace value for "\ " to an empty string, then in two instances in the PDF linked above, an incorrectly added space is removed.

            // replace escaped chars
            $text = str_replace(
                ['\\\\', '\(', '\)', '\n', '\r', '\t', '\f', '\ ', '\b'],
                [\chr(92), \chr(40), \chr(41), \chr(10), \chr(13), \chr(9), \chr(12), '', \chr(8)],
                $text
            );

Before:

Your Smartphone an d Mobile Apps	
Your ePayslips Porta l	

After:

Your Smartphone and Mobile Apps	
Your ePayslips Portal	

@k00ni k00ni merged commit 5c82748 into smalot:master Jul 26, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix missing or incomplete functionality For something which is not a bug, but more like an incomplete feature.
Projects
None yet
4 participants