str_replace in Font.php now seems to work as expected #597

tskodaw · 2023-04-21T19:26:29Z

Hi,
I ran into an issue while text has been decoded using /ToUnicode and a CMap

I used a PDF file which was read correctly by Acrobat Reader, and showed the text. It has been created by a third hand with setasign/SetaFPDF.
Because of privacy issues I can't provide it :-( nor can I reproduce such PDF, I don't know what they have done.
nevertheless the thing was: (all takes place in Font.php)

decodeContentByToUnicodeCMapOrDescendantFonts(465) with translateChar(107)
did sometimes not find the character, which should be there and also was obviously in the $this->table as well had some offset-issues.
The CMap mapped like 67 characters ..
A while of digging showed me, that in decimals my 2 bytes encoded string has been in decimals something like this:

00|47|00|50|00|03|00|92|98
the error came with 00|92 -> which obviously has not been in the table
further debugging showed me, that my desired letter has been in 3 bytes there 00|92|98

the original decompressed string had character combinations of '\b' '\r' '\f' in it. So they were not the "special characters" but always backslash and the letter.

which should have been replaced to the one byte long special character with str_replace in line 436.
While debugging I found out that was not the case resulting in 92|98 for \b
the special character for \b would have been decimal 8 which is what was the right letter in the table. I played around with escaping etc. finally I just used chr(92) chr(98) etc. (see commit)
php version has been 8.1.6 for windows
I don't now why it did not replace the strings as intended and expected ...

Perhaps - if you have time - and the issue is known - you could look into the matter and perhaps the chr(92) etc. approach helps.

Best Regards

fixes #160

k00ni

@tskodaw thank you for your pull request.

My knowledge of this code part is very limited and there are a few questions about your feedback besides the code change. To sum it up: with your change, PDFParser can decode/produce more characters than before. Is that correct?

decodeContentByToUnicodeCMapOrDescendantFonts(465) with translateChar(107)
did sometimes not find the character, which should be there and also was obviously in the $this->table as well had some offset-issues.

Its seems that there are characters missing in https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Font.php#L437-L438. Why should they be there in the first place? Can you point to a paragraph in the PDF specification, technical documentation or provide an argument, which underlines that?

The CMap mapped like 67 characters ..
A while of digging showed me, that in decimals my 2 bytes encoded string has been in decimals something like this: 00|47|00|50|00|03|00|92|98
the error came with 00|92 -> which obviously has not been in the table
further debugging showed me, that my desired letter has been in 3 bytes there 00|92|98 the original decompressed string had character combinations of '\b' '\r' '\f' in it. So they were not the "special characters" but always backslash and the letter.

Do you mean, that the replacements were faulty all along?

which should have been replaced to the one byte long special character with str_replace in line 436. While debugging I found out that was not the case resulting in 92|98 for \b
the special character for \b would have been decimal 8 which is what was the right letter in the table. I played around with escaping etc. finally I just used chr(92) chr(98) etc. (see commit)

Can you please elaborate a bit about this? I checked chr documentation and it says:

This can be used to create a one-character string in a single-byte encoding such as ASCII, ISO-8859, or Windows 1252, by passing the position of a desired character in the encoding's mapping table

Please explain, why you wanna change the encoding of the replaced characters from UTF-8 (currently) to ASCII, ISO-8859, or Windows 1252 (with your changes).

tskodaw · 2023-04-28T20:04:33Z

Hello @k00ni , thanks for answering.

Actually I hoped that someone who wrote that part perhaps will run into it and knows what to do. So your answer and the questions let me dig deeper. But unluckily I normally have nothing to do with internals of PDFs.

Can you point to a paragraph in the PDF specification

For this specific constellation I did not have found any specific arguments why those characters which will be transformed are escaped in the first place. Perhaps (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf) page 54 "TABLE 3.2 Escape sequences" in literal strings could be a hint. But there is no explanation if escaping is needed or not, when it comes to string literals in brackets for mappings with CMaps and /ToUnicode.

Do you mean, that the replacements were faulty all along?

No, I guess with all the mappings and replacements in the debugger having those strings represented as decimals|hex and characters before and after the replacement, I became a little confused about what has been replaced and what not. I guess there would have been more issues reported if that would have been the case.

As far as I can see it now, just the replacement for \b is missing which is not been solved with just adding '\b' and "\b" respectively to the arrays. Instead of "\b" the replacement should be "\x08" or chr(8) . I chose chr(8).
But in my case without that replacement the whole text token becomes useless at the beginning of the first '\b's position.
In order to make it more clear which decimal value would be replaced I used the chr([VALUE]) replacement variant. Since the font mapping table ($this->table) uses the decimal representation as index for the coded characters (see Font.php:112). So knowing which decimal value would be there might facilitate look ups in the mapping table while debugging. Instead of having in mind which special character / hex value has which decimal value.

Can you please elaborate a bit about this? I checked chr documentation and it says:

This can be used to create a one-character string in a single-byte encoding such as ASCII, ISO-8859, or Windows 1252, by passing the position of a desired character in the encoding's mapping table
Please explain, why you wanna change the encoding of the replaced characters from UTF-8 (currently) to ASCII, ISO-8859, or Windows 1252 (with your changes).

Using chr([DecimalValue]) is used to have clarity which value will be at that position (see above) the values are the same as ASCII or UTF8 "\n" "\r" etc. as stated in: "The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes." (https://developer.mozilla.org/en-US/docs/Glossary/UTF-8)

conclusion leaving everything as it is, but adding the '\b' in the needle array and "\x08" or chr(8) to the replacement array would imho be already sufficient and would allow extraction of 2byte encoded tokens into unicode via CMap/ToUnicode having '\b' in it. There decimals 92 98 would be replaced with decimal 08 analogue to 92 110 to ("\n") 92 102 ("\f") etc . From 2 bytes to 1 bytes which - in my case helped to have a 3 bytes sequence corrected to its intended 2 bytes sequence as with the other special characters. But perhaps there is someone knowing more about those escaped string literals around? Or are there any Test-PDFs to run automated tests?

thanks a lot for investigating this matter.

k00ni · 2023-05-10T07:46:15Z

Thank you for your detailed answer, I support your view.

Could you add a basic unit test to check that the new addition works?

radieter · 2023-05-19T12:57:59Z

I just submitted a very similar pull request. I had issues with lost characters. Same like you I cannot provide a demo by now for the same reasony. Using xdebug I discovered that the str_replace removed a "\" too much...

also: why is this replacement hard-coded?

k00ni · 2023-05-22T05:53:24Z

@tskodaw what is the status here?

tskodaw · 2023-05-23T12:20:47Z

@k00ni oh I just came to realize the notification e-mails, I need more time to look into the projects test setup and into a special test itself. Are there any quick direction hints concerning that matter? thx for your patience

k00ni · 2023-05-24T05:23:26Z

Here a rough example how to achieve a basic unit test:

create a test class (https://github.com/smalot/pdfparser/blob/master/tests/PHPUnit/Unit/PDFObjectTest.php looks like a good starting point, copy it)
instantiate Font class (with real instances or mocks/stubs)
call decodeText with a parameter, so that your code part is used during computation

One or two tests are enough IMHO.

Thank you for investing the time.

k00ni · 2023-06-05T05:43:01Z

We made some changes to account for latest PHP-CS-Fixer rule changes (#602). Please merge in master branch to avoid failing coding style test pipeline.

k00ni · 2023-07-04T13:47:02Z

@tskodaw whats the status here? need any help?

radieter · 2023-07-05T11:03:26Z

I still wonder why there is an escaped blank in the search-array. such thing doesn't exist.

k00ni · 2023-07-19T08:59:14Z

@tskodaw I added a basic test which should capture the \b-part. It is sufficient enough and a PDF is not needed. Please have a look.

GreyWyvern · 2023-07-21T14:50:47Z

The PDF referenced by #160 contains issues with \b's that are resolved by this fix.

Interestingly enough, @radieter, it also contains escaped spaces "\ ", but in the same strings of text, actual spaces are encoded as a separate \### octal escape code. This would make you think, "okay, the octal value for the space character must be getting translated to an actual character by the CIDMap" but in the case of the above PDF, they apparently are nothing more than actual spaces, or the character they're getting translated to is some thinspace or zero-width space that PdfParser is translating as a full space.

If I change the replace value for "\ " to an empty string, then in two instances in the PDF linked above, an incorrectly added space is removed.

            // replace escaped chars
            $text = str_replace(
                ['\\\\', '\(', '\)', '\n', '\r', '\t', '\f', '\ ', '\b'],
                [\chr(92), \chr(40), \chr(41), \chr(10), \chr(13), \chr(9), \chr(12), '', \chr(8)],
                $text
            );

Before:

Your Smartphone an d Mobile Apps	
Your ePayslips Porta l

After:

Your Smartphone and Mobile Apps	
Your ePayslips Portal

str_replace in Font.php now seems to work as expected

4f71298

k00ni reviewed Apr 24, 2023

View reviewed changes

k00ni added enhancement missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. needs more info tests required labels Apr 24, 2023

fixed coding style issues in Font.php

6f6fc3c

k00ni added fix and removed needs more info enhancement labels May 10, 2023

k00ni mentioned this pull request May 22, 2023

Update Font.php: Remove unnecessary blank replacement #601

Closed

k00ni added the stale needs decision label Jul 5, 2023

k00ni marked this pull request as draft July 10, 2023 07:02

added FontTest.php to check decoding of \b

9ab99c6

k00ni marked this pull request as ready for review July 19, 2023 08:56

fixed coding style issue in FontTest.php

db07926

k00ni removed stale needs decision tests required labels Jul 19, 2023

This was referenced Jul 21, 2023

Parser returns \f at some lines #138

Closed

Some characters are replaced with \f and \b #124

Closed

This was linked to issues Jul 23, 2023

Parser returns \f at some lines #138

Closed

Some characters are replaced with \f and \b #124

Closed

k00ni merged commit 5c82748 into smalot:master Jul 26, 2023
26 checks passed

This was referenced Aug 10, 2023

getText error behind percent sign (%) #110

Closed

Decode to wrong character #607

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str_replace in Font.php now seems to work as expected #597

str_replace in Font.php now seems to work as expected #597

tskodaw commented Apr 21, 2023 •

edited by k00ni

Loading

k00ni left a comment •

edited

Loading

tskodaw commented Apr 28, 2023 •

edited

Loading

k00ni commented May 10, 2023

radieter commented May 19, 2023 •

edited

Loading

k00ni commented May 22, 2023

tskodaw commented May 23, 2023

k00ni commented May 24, 2023 •

edited

Loading

k00ni commented Jun 5, 2023

k00ni commented Jul 4, 2023

radieter commented Jul 5, 2023

k00ni commented Jul 19, 2023 •

edited

Loading

GreyWyvern commented Jul 21, 2023 •

edited

Loading

str_replace in Font.php now seems to work as expected #597

str_replace in Font.php now seems to work as expected #597

Conversation

tskodaw commented Apr 21, 2023 • edited by k00ni Loading

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

tskodaw commented Apr 28, 2023 • edited Loading

k00ni commented May 10, 2023

radieter commented May 19, 2023 • edited Loading

k00ni commented May 22, 2023

tskodaw commented May 23, 2023

k00ni commented May 24, 2023 • edited Loading

k00ni commented Jun 5, 2023

k00ni commented Jul 4, 2023

radieter commented Jul 5, 2023

k00ni commented Jul 19, 2023 • edited Loading

GreyWyvern commented Jul 21, 2023 • edited Loading

tskodaw commented Apr 21, 2023 •

edited by k00ni

Loading

k00ni left a comment •

edited

Loading

tskodaw commented Apr 28, 2023 •

edited

Loading

radieter commented May 19, 2023 •

edited

Loading

k00ni commented May 24, 2023 •

edited

Loading

k00ni commented Jul 19, 2023 •

edited

Loading

GreyWyvern commented Jul 21, 2023 •

edited

Loading