Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DETECTION] Wrong encodings for GBK text #365

Closed
felixonmars opened this issue Oct 17, 2023 · 1 comment · Fixed by #366
Closed

[DETECTION] Wrong encodings for GBK text #365

felixonmars opened this issue Oct 17, 2023 · 1 comment · Fixed by #366
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence

Comments

@felixonmars
Copy link

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file
gbk.txt

Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

2023-10-17 09:34:34,788 | Level 5 | override steps (5) and chunk_size (512) as content does not fit (71 byte(s) given) parameters.
2023-10-17 09:34:34,788 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xcf in position 0: ordinal not in range(128)
2023-10-17 09:34:34,788 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xcf in position 0: invalid continuation byte
2023-10-17 09:34:34,788 | Level 5 | Code page big5 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,789 | Level 5 | big5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 30.300000 %.
2023-10-17 09:34:34,789 | Level 5 | Code page big5hkscs is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,789 | Level 5 | big5hkscs was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 30.300000 %.
2023-10-17 09:34:34,790 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 119.400000 %.
2023-10-17 09:34:34,790 | Level 5 | cp1006 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 290.900000 %.
2023-10-17 09:34:34,790 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,791 | Level 5 | cp1125 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 236.400000 %.
2023-10-17 09:34:34,791 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,792 | Level 5 | cp1250 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 221.500000 %.
2023-10-17 09:34:34,792 | Level 5 | cp1251 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 60.600000 %.
2023-10-17 09:34:34,793 | Level 5 | cp1252 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 410.300000 %.
2023-10-17 09:34:34,793 | Level 5 | Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xaa in position 27: character maps to <undefined>
2023-10-17 09:34:34,793 | Level 5 | cp1254 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 410.300000 %.
2023-10-17 09:34:34,794 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xda in position 3: character maps to <undefined>
2023-10-17 09:34:34,794 | Level 5 | cp1256 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 139.400000 %.
2023-10-17 09:34:34,794 | Level 5 | cp1257 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 326.000000 %.
2023-10-17 09:34:34,795 | Level 5 | cp1258 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 372.000000 %.
2023-10-17 09:34:34,795 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,795 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xcf in position 0: character maps to <undefined>
2023-10-17 09:34:34,795 | Level 5 | cp437 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 242.400000 %.
2023-10-17 09:34:34,795 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,796 | Level 5 | cp720 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 236.400000 %.
2023-10-17 09:34:34,796 | Level 5 | cp737 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 236.400000 %.
2023-10-17 09:34:34,796 | Level 5 | cp775 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 187.900000 %.
2023-10-17 09:34:34,797 | Level 5 | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,797 | Level 5 | cp852 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 347.900000 %.
2023-10-17 09:34:34,797 | Level 5 | cp855 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 206.100000 %.
2023-10-17 09:34:34,798 | Level 5 | Code page cp856 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd6 in position 1: character maps to <undefined>
2023-10-17 09:34:34,798 | Level 5 | cp857 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 517.600000 %.
2023-10-17 09:34:34,798 | Level 5 | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,798 | Level 5 | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,799 | Level 5 | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,799 | Level 5 | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,799 | Level 5 | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,800 | Level 5 | cp864 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 200.000000 %.
2023-10-17 09:34:34,800 | Level 5 | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,800 | Level 5 | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,800 | Level 5 | cp869 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 200.000000 %.
2023-10-17 09:34:34,801 | Level 5 | cp874 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-17 09:34:34,803 | Level 5 | cp874 should target any language(s) of ['Thai']
2023-10-17 09:34:34,803 | Level 5 | We detected language [('Thai', 0.1515)] using cp874
2023-10-17 09:34:34,804 | Level 5 | cp875 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 151.500000 %.
2023-10-17 09:34:34,804 | Level 5 | Code page cp932 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,804 | Level 5 | cp932 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 193.900000 %.
2023-10-17 09:34:34,805 | Level 5 | Code page cp949 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,805 | Level 5 | cp949 passed initial chaos probing. Mean measured chaos is 11.100000 %
2023-10-17 09:34:34,805 | Level 5 | cp949 should target any language(s) of ['Korean']
2023-10-17 09:34:34,805 | Level 5 | Code page cp950 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,805 | Level 5 | cp950 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 30.300000 %.
2023-10-17 09:34:34,806 | Level 5 | Code page euc_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,806 | Level 5 | euc_jis_2004 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2023-10-17 09:34:34,806 | Level 5 | Code page euc_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,806 | Level 5 | euc_jisx0213 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2023-10-17 09:34:34,806 | Level 5 | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,807 | Level 5 | Code page euc_kr is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,807 | Level 5 | euc_kr passed initial chaos probing. Mean measured chaos is 11.100000 %
2023-10-17 09:34:34,807 | Level 5 | euc_kr should target any language(s) of ['Korean']
2023-10-17 09:34:34,807 | Level 5 | Code page gb18030 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,807 | Level 5 | gb18030 passed initial chaos probing. Mean measured chaos is 11.100000 %
2023-10-17 09:34:34,807 | Level 5 | gb18030 should target any language(s) of ['Chinese']
2023-10-17 09:34:34,807 | Level 5 | We detected language [('Chinese', 0.1667)] using gb18030
2023-10-17 09:34:34,808 | Level 5 | Code page gb2312 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,808 | Level 5 | gb2312 passed initial chaos probing. Mean measured chaos is 11.100000 %
2023-10-17 09:34:34,808 | Level 5 | gb2312 should target any language(s) of ['Chinese']
2023-10-17 09:34:34,808 | Level 5 | We detected language [('Chinese', 0.1667)] using gb2312
2023-10-17 09:34:34,808 | Level 5 | Code page gbk is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,808 | Level 5 | gbk passed initial chaos probing. Mean measured chaos is 11.100000 %
2023-10-17 09:34:34,808 | Level 5 | gbk should target any language(s) of ['Chinese']
2023-10-17 09:34:34,808 | Level 5 | We detected language [('Chinese', 0.1667)] using gbk
2023-10-17 09:34:34,808 | Level 5 | hp_roman8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 197.300000 %.
2023-10-17 09:34:34,809 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,809 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,809 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,809 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,809 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,810 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,810 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,810 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,810 | Level 5 | iso8859_10 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 237.900000 %.
2023-10-17 09:34:34,810 | Level 5 | iso8859_11 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-17 09:34:34,811 | Level 5 | iso8859_11 should target any language(s) of ['Thai']
2023-10-17 09:34:34,811 | Level 5 | We detected language [('Thai', 0.1515)] using iso8859_11
2023-10-17 09:34:34,811 | Level 5 | iso8859_13 is deemed too similar to code page cp1257 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,811 | Level 5 | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,811 | Level 5 | iso8859_15 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,812 | Level 5 | iso8859_16 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 162.100000 %.
2023-10-17 09:34:34,812 | Level 5 | iso8859_2 is deemed too similar to code page cp1250 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,812 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc3 in position 7: character maps to <undefined>
2023-10-17 09:34:34,812 | Level 5 | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,813 | Level 5 | iso8859_5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 102.800000 %.
2023-10-17 09:34:34,813 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xb3 in position 4: character maps to <undefined>
2023-10-17 09:34:34,813 | Level 5 | Code page iso8859_7 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xae in position 58: character maps to <undefined>
2023-10-17 09:34:34,813 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xcf in position 0: character maps to <undefined>
2023-10-17 09:34:34,813 | Level 5 | iso8859_9 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,813 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xcf in position 0: illegal multibyte sequence
2023-10-17 09:34:34,814 | Level 5 | koi8_r was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 139.400000 %.
2023-10-17 09:34:34,814 | Level 5 | Code page koi8_t does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xbc in position 21: character maps to <undefined>
2023-10-17 09:34:34,814 | Level 5 | koi8_u was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 133.300000 %.
2023-10-17 09:34:34,814 | Level 5 | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,815 | Level 5 | latin_1 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,815 | Level 5 | mac_cyrillic was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 138.300000 %.
2023-10-17 09:34:34,815 | Level 5 | mac_greek was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 112.500000 %.
2023-10-17 09:34:34,816 | Level 5 | mac_iceland was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 374.400000 %.
2023-10-17 09:34:34,816 | Level 5 | mac_latin2 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 224.600000 %.
2023-10-17 09:34:34,816 | Level 5 | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-17 09:34:34,816 | Level 5 | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-17 09:34:34,816 | Level 5 | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-17 09:34:34,817 | Level 5 | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xf5 in position 44: illegal multibyte sequence
2023-10-17 09:34:34,817 | Level 5 | Code page shift_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,817 | Level 5 | shift_jis_2004 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 193.900000 %.
2023-10-17 09:34:34,817 | Level 5 | Code page shift_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-17 09:34:34,817 | Level 5 | shift_jisx0213 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 193.900000 %.
2023-10-17 09:34:34,817 | Level 5 | tis_620 passed initial chaos probing. Mean measured chaos is 0.000000 %
2023-10-17 09:34:34,818 | Level 5 | tis_620 should target any language(s) of ['Thai']
2023-10-17 09:34:34,818 | Level 5 | We detected language [('Thai', 0.1515)] using tis_620
2023-10-17 09:34:34,818 | Level 5 | Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-17 09:34:34,818 | Level 5 | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 36-37: illegal UTF-16 surrogate
2023-10-17 09:34:34,818 | Level 5 | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 2-3: illegal UTF-16 surrogate
2023-10-17 09:34:34,818 | Level 5 | Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-17 09:34:34,818 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-17 09:34:34,818 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-17 09:34:34,818 | Level 5 | Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
2023-10-17 09:34:34,818 | DEBUG | Encoding detection: Found cp874 as plausible (best-candidate) for content. With 2 alternatives.
{
    "path": "/home/felix/projects/arch/packages/gbk.txt",
    "encoding": "cp874",
    "encoding_aliases": [],
    "alternative_encodings": [
        "iso8859_11",
        "tis_620"
    ],
    "language": "Thai",
    "alphabets": [
        "Basic Latin",
        "Thai"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 15.15,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding
GB2312, GBK, or GB18030.

chardet detected it correctly:

In [14]: chardet.detect(open("gbk.txt", "rb").read())
Out[14]: {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.11.5
  • Package version 3.3.0
@felixonmars felixonmars added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Oct 17, 2023
@Ousret
Copy link
Member

Ousret commented Oct 19, 2023

This was reproduced, I confirm.
It is fixed in #366

It should be GA soon.

@Ousret Ousret removed the help wanted Extra attention is needed label Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence
Development

Successfully merging a pull request may close this issue.

2 participants