fix to iconv() illegal character error (issue #549) #580

Stasky745 · 2023-02-23T11:30:41Z

In some PDFs this line crashes with:

iconv(): Detected an illegal character in input string

This can be seen in issue #549

This change fixes it. According to iconv documentation:

"If you append the string //TRANSLIT to out_charset transliteration is activated. This means that when a character can't be represented in the target charset, it can be approximated through one or several similarly looking characters. If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded. Otherwise, str is cut from the first illegal character and an E_NOTICE is generated."

k00ni · 2023-02-23T12:26:57Z

@Stasky745 Thank you for this pull request.

Could you add a test to proof its working?

@LucianoHanna can you check it out if it works for you? (#549)

Stasky745 · 2023-02-23T13:49:36Z

caca.txt
I've got the contents of the file that cause the error.
Doing this crashes:

<?php
$text = file_get_contents("caca.txt");
echo iconv("CP1252", "UTF-8", $text);

While doing it this way returns the text perfectly:

<?php
$text = file_get_contents("caca.txt");
echo iconv("CP1252", "UTF-8//TRANSLIT//IGNORE", $text);

Is this useful for you?

k00ni · 2023-02-24T08:02:25Z

Is this useful for you?

Not really, sorry :) Were your examples meant to be used for a test? I get the point about demonstrating the failing code though.

If you need help with writing a test, let me know.

The failing Windows tests are not because of your change, I will propably fix them in a separate pull request, so ignore them for now.

Stasky745 · 2023-02-24T13:49:15Z

If you need help with writing a test, let me know.

Yeah I don't really know how to write a test. Is there a manual or something?

k00ni · 2023-03-10T10:16:42Z

Yeah I don't really know how to write a test. Is there a manual or something?

No problem. Give us some time and we will bring this forward.

because it fails in versions < 10

k00ni · 2023-03-30T08:02:00Z

Sorry for the late response. Besides obvious changes I also merged in master branch.

To verify your code change I have to first create a mini-demo which triggers the notice. Your linked file + example code didn't produce the notice on my machine (PHP 8.1.16). I then tried to create an artificial example, but for some reason it always uses UTF-8, so no luck either:

pdfparser/tests/PHPUnit/Integration/FontTest.php

Lines 460 to 463 in b1242d4

    
           // these 3 lines should trigger the following PHP notice: iconv(): Detected an illegal character in input string 
        
           $text = 'äüüß'; 
        
           $encoded = mb_convert_encoding($text, 'UTF-8', 'CP1252'); 
        
           iconv("CP1252", "UTF-8", $encoded);

When running this code part (using PHPUnit) locally it should at least mention a notice.

This doesn't mean I don't believe your code works. It is about the fact, that each code change must be accompanied by at least one test, which proofs (to some extent) the code works as expected. Otherwise we would risk a situation in which this library is not reliable anymore.

Could you provide another example or a hint, how to create a string which triggers this notice?

eapacheco · 2023-04-13T20:46:36Z

Hi @k00ni @Stasky745 ! I've just added my issue to #549 and this PR does fix it!

DISCLAIMER: I also mentioned an alternate implementation as I was concerned that //TRANSLIT//IGNORE could compromise the data output, given that transliterating characters can corrupt them (e.g. ß -> ss).
But it turns out that that issue would not affect my current use case, so both approaches are good!

k00ni · 2023-04-14T08:34:12Z

@eapacheco thank you for reporting this.

I would love to merge it in right away, but I can't get my head around the fact, that my test approach doesn't work. Also, changing character-related functionality resulted in issues in the past, therefore I think its important to back such changes with test code.

My spare time is to rare currently to look deeper into it, so if anyone has an idea how to test this fix, I would appreciate it!

fbett · 2023-04-19T13:40:47Z

@k00ni

I tried also to fix the test, it was tricky.

Is it possible, that all the characters used in your test are already included in the encodings CP1252 and UTF8 - so no warning should be thrown and it's not a problem, that there is no error?

So here are two unit tests using the provided "caca.txt" file.

public function testError() {
        $sample = file_get_contents( 'caca.txt');
        $this->expectError();
        iconv('CP1252', 'UTF-8', $sample);
    }

    public function testEncoding() {
        $sample = file_get_contents( 'caca.txt');
        $this->assertEquals(
            '     Parcel·la a l’oest del vial d’accés a la C-66: 17016A003000200000EP, que es correspon amb la parcel·la 20 del polígon 3, ',
            iconv('CP1252', 'UTF-8//TRANSLIT//IGNORE', $sample)
        );
    }

The first unit tests the error. The second the expected result (i'm not speaking Spanish? - i hope i got the expected result)

Using PHPUnit 9.5 with configuration:

convertErrorsToExceptions="true"
convertNoticesToExceptions="true"
convertWarningsToExceptions="true"
stopOnFailure="true"

Tested on PHP 8.0.27 and 8.1.14

I have to say, i have expected a ISO-8859-1 or ISO-8859-15 encoding of the provided file, but using this encoding scheme using iconv('iso-8859-1', 'UTF-8', $sample), i get this (wrong) result:

    � Parcel·la a l�oest del vial d�accés a la C-66: 17016A003000200000EP, que es correspon amb la parcel·la 20 del polígon 3,

-> Also mb_detect_encoding prefers 'iso-8859-1' over 'CP1252' (which is not helping)

Since a while, i'm using the library forceutf8 for problems like this and i gave it a try using \ForceUTF8\Encoding::toUTF8($sample):
It delivers also the same string as expected in testEncoding.

If i'm looking at their code, there seems more things to do, to convert a (faulty?) string to UTF8.
Would it probably better to use a library like this for the utf8 conversion?

…he fix

k00ni

@fbett thank you for the feedback.

If i'm looking at their code, there seems more things to do, to convert a (faulty?) string to UTF8. Would it probably better to use a library like this for the utf8 conversion?

Encoding and character sets is a huge field and many libraries struggle when using encoding, multi byte strings etc. Using a library which does this for us might be a good approach, but my argument against is, that it might apply unexpected string transformations and produces broken output. We should rather believe the data coming from the PDF. If this data is broken in the first place, than there is a mess one has to make the best out of it. One way could be: you can read the raw data and transform it afterwards with a library like forceUTF8. Or use another PDF generator which doesn't generate faulty PDFs.

As far as I know there is no best way here, but I am open for suggestions.

How I fixed my test?

The trick was to find a character in CP1252 which has no valid counterpart in UTF-8. In Google I found: https://stackoverflow.com/a/26330256, which mentioned the following bytes:

 0x81, 0x8D, 0x8F, 0x90, 0x9D

They can't be assigned in UTF-8, which lead to the iconv-warning when used. This is backed by https://www.ibm.com/docs/en/rational-synergy/7.2.1?topic=uc-text-encoding-illegal-character-detection-tool.

Also, I had to use PHPUnit 9 to see the notice, with PHPUnit 10 there wasn't any. It seems they changed the config in some way. I wll leave this open for a few days to see if there is any feedback. Will merge it soon.

Acknowledgement: This revelation was possible because of feedback and the help of @zozlak. Thank you!

warna720 · 2023-04-28T17:22:56Z

Can we get a new release? Thanks!

fix to iconv() illegal character error (issue smalot#549)

6fba87b

k00ni added needs work tests required fix labels Feb 23, 2023

k00ni linked an issue Feb 23, 2023 that may be closed by this pull request

Notice: iconv(): Detected an illegal character in input string #549

Closed

k00ni mentioned this pull request Mar 30, 2023

Notice: iconv(): Detected an illegal character in input string #549

Closed

k00ni added 4 commits March 30, 2023 08:51

Merge remote-tracking branch 'origin' into Stasky745_master

41bd45f

display warnings in PHPUnit; added incomplete test to demonstrate fix

b1242d4

revert --display-notices in Makefile

c9e3aed

because it fails in versions < 10

Fixed coding style problem

86bad65

k00ni self-assigned this Apr 5, 2023

k00ni added the help wanted label Apr 14, 2023

FontTest.php: finalized test which triggers notice when don't using t…

17cd230

…he fix

k00ni approved these changes Apr 20, 2023

View reviewed changes

k00ni removed needs work help wanted tests required labels Apr 20, 2023

k00ni merged commit 9094d77 into smalot:master Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix to iconv() illegal character error (issue #549) #580

fix to iconv() illegal character error (issue #549) #580

Stasky745 commented Feb 23, 2023

k00ni commented Feb 23, 2023

Stasky745 commented Feb 23, 2023 •

edited

Loading

k00ni commented Feb 24, 2023 •

edited

Loading

Stasky745 commented Feb 24, 2023

k00ni commented Mar 10, 2023

k00ni commented Mar 30, 2023 •

edited

Loading

eapacheco commented Apr 13, 2023

k00ni commented Apr 14, 2023 •

edited

Loading

fbett commented Apr 19, 2023 •

edited

Loading

k00ni left a comment •

edited

Loading

warna720 commented Apr 28, 2023

fix to iconv() illegal character error (issue #549) #580

fix to iconv() illegal character error (issue #549) #580

Conversation

Stasky745 commented Feb 23, 2023

k00ni commented Feb 23, 2023

Stasky745 commented Feb 23, 2023 • edited Loading

k00ni commented Feb 24, 2023 • edited Loading

Stasky745 commented Feb 24, 2023

k00ni commented Mar 10, 2023

k00ni commented Mar 30, 2023 • edited Loading

eapacheco commented Apr 13, 2023

k00ni commented Apr 14, 2023 • edited Loading

fbett commented Apr 19, 2023 • edited Loading

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

warna720 commented Apr 28, 2023

Stasky745 commented Feb 23, 2023 •

edited

Loading

k00ni commented Feb 24, 2023 •

edited

Loading

k00ni commented Mar 30, 2023 •

edited

Loading

k00ni commented Apr 14, 2023 •

edited

Loading

fbett commented Apr 19, 2023 •

edited

Loading

k00ni left a comment •

edited

Loading