Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix to iconv() illegal character error (issue #549) #580

Merged
merged 6 commits into from
Apr 24, 2023

Conversation

Stasky745
Copy link
Contributor

In some PDFs this line crashes with:

iconv(): Detected an illegal character in input string

This can be seen in issue #549

This change fixes it. According to iconv documentation:

"If you append the string //TRANSLIT to out_charset transliteration is activated. This means that when a character can't be represented in the target charset, it can be approximated through one or several similarly looking characters. If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded. Otherwise, str is cut from the first illegal character and an E_NOTICE is generated."

@k00ni
Copy link
Collaborator

k00ni commented Feb 23, 2023

@Stasky745 Thank you for this pull request.

Could you add a test to proof its working?

@LucianoHanna can you check it out if it works for you? (#549)

@Stasky745
Copy link
Contributor Author

Stasky745 commented Feb 23, 2023

caca.txt
I've got the contents of the file that cause the error.
Doing this crashes:

<?php
$text = file_get_contents("caca.txt");
echo iconv("CP1252", "UTF-8", $text);

While doing it this way returns the text perfectly:

<?php
$text = file_get_contents("caca.txt");
echo iconv("CP1252", "UTF-8//TRANSLIT//IGNORE", $text);

Is this useful for you?

@k00ni
Copy link
Collaborator

k00ni commented Feb 24, 2023

Is this useful for you?

Not really, sorry :) Were your examples meant to be used for a test? I get the point about demonstrating the failing code though.

If you need help with writing a test, let me know.

The failing Windows tests are not because of your change, I will propably fix them in a separate pull request, so ignore them for now.

@Stasky745
Copy link
Contributor Author

If you need help with writing a test, let me know.

Yeah I don't really know how to write a test. Is there a manual or something?

@k00ni
Copy link
Collaborator

k00ni commented Mar 10, 2023

Yeah I don't really know how to write a test. Is there a manual or something?

No problem. Give us some time and we will bring this forward.

@k00ni
Copy link
Collaborator

k00ni commented Mar 30, 2023

Sorry for the late response. Besides obvious changes I also merged in master branch.

To verify your code change I have to first create a mini-demo which triggers the notice. Your linked file + example code didn't produce the notice on my machine (PHP 8.1.16). I then tried to create an artificial example, but for some reason it always uses UTF-8, so no luck either:

// these 3 lines should trigger the following PHP notice: iconv(): Detected an illegal character in input string
$text = 'äüüß';
$encoded = mb_convert_encoding($text, 'UTF-8', 'CP1252');
iconv("CP1252", "UTF-8", $encoded);

When running this code part (using PHPUnit) locally it should at least mention a notice.

This doesn't mean I don't believe your code works. It is about the fact, that each code change must be accompanied by at least one test, which proofs (to some extent) the code works as expected. Otherwise we would risk a situation in which this library is not reliable anymore.

Could you provide another example or a hint, how to create a string which triggers this notice?

@k00ni k00ni self-assigned this Apr 5, 2023
@eapacheco
Copy link

Hi @k00ni @Stasky745 ! I've just added my issue to #549 and this PR does fix it!

DISCLAIMER: I also mentioned an alternate implementation as I was concerned that //TRANSLIT//IGNORE could compromise the data output, given that transliterating characters can corrupt them (e.g. ß -> ss).
But it turns out that that issue would not affect my current use case, so both approaches are good!

@k00ni
Copy link
Collaborator

k00ni commented Apr 14, 2023

@eapacheco thank you for reporting this.

I would love to merge it in right away, but I can't get my head around the fact, that my test approach doesn't work. Also, changing character-related functionality resulted in issues in the past, therefore I think its important to back such changes with test code.

My spare time is to rare currently to look deeper into it, so if anyone has an idea how to test this fix, I would appreciate it!

@fbett
Copy link

fbett commented Apr 19, 2023

@k00ni

I tried also to fix the test, it was tricky.

Is it possible, that all the characters used in your test are already included in the encodings CP1252 and UTF8 - so no warning should be thrown and it's not a problem, that there is no error?

So here are two unit tests using the provided "caca.txt" file.

public function testError() {
        $sample = file_get_contents( 'caca.txt');
        $this->expectError();
        iconv('CP1252', 'UTF-8', $sample);
    }

    public function testEncoding() {
        $sample = file_get_contents( 'caca.txt');
        $this->assertEquals(
            '     Parcel·la a l’oest del vial d’accés a la C-66: 17016A003000200000EP, que es correspon amb la parcel·la 20 del polígon 3, ',
            iconv('CP1252', 'UTF-8//TRANSLIT//IGNORE', $sample)
        );
    }

The first unit tests the error. The second the expected result (i'm not speaking Spanish? - i hope i got the expected result)

Using PHPUnit 9.5 with configuration:

convertErrorsToExceptions="true"
convertNoticesToExceptions="true"
convertWarningsToExceptions="true"
stopOnFailure="true"

Tested on PHP 8.0.27 and 8.1.14

I have to say, i have expected a ISO-8859-1 or ISO-8859-15 encoding of the provided file, but using this encoding scheme using iconv('iso-8859-1', 'UTF-8', $sample), i get this (wrong) result:

    � Parcel·la a l�oest del vial d�accés a la C-66: 17016A003000200000EP, que es correspon amb la parcel·la 20 del polígon 3, 

-> Also mb_detect_encoding prefers 'iso-8859-1' over 'CP1252' (which is not helping)

Since a while, i'm using the library forceutf8 for problems like this and i gave it a try using \ForceUTF8\Encoding::toUTF8($sample):
It delivers also the same string as expected in testEncoding.

If i'm looking at their code, there seems more things to do, to convert a (faulty?) string to UTF8.
Would it probably better to use a library like this for the utf8 conversion?

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbett thank you for the feedback.

If i'm looking at their code, there seems more things to do, to convert a (faulty?) string to UTF8. Would it probably better to use a library like this for the utf8 conversion?

Encoding and character sets is a huge field and many libraries struggle when using encoding, multi byte strings etc. Using a library which does this for us might be a good approach, but my argument against is, that it might apply unexpected string transformations and produces broken output. We should rather believe the data coming from the PDF. If this data is broken in the first place, than there is a mess one has to make the best out of it. One way could be: you can read the raw data and transform it afterwards with a library like forceUTF8. Or use another PDF generator which doesn't generate faulty PDFs.

As far as I know there is no best way here, but I am open for suggestions.


How I fixed my test?

The trick was to find a character in CP1252 which has no valid counterpart in UTF-8. In Google I found: https://stackoverflow.com/a/26330256, which mentioned the following bytes:

 0x81, 0x8D, 0x8F, 0x90, 0x9D

They can't be assigned in UTF-8, which lead to the iconv-warning when used. This is backed by https://www.ibm.com/docs/en/rational-synergy/7.2.1?topic=uc-text-encoding-illegal-character-detection-tool.

Also, I had to use PHPUnit 9 to see the notice, with PHPUnit 10 there wasn't any. It seems they changed the config in some way. I wll leave this open for a few days to see if there is any feedback. Will merge it soon.

Acknowledgement: This revelation was possible because of feedback and the help of @zozlak. Thank you!

@warna720
Copy link

Can we get a new release? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notice: iconv(): Detected an illegal character in input string
5 participants