Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encodings reported by chardetng-py don't always match up to python's decoding #11

Open
john-parton opened this issue Apr 27, 2023 · 18 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@john-parton
Copy link
Owner

LookupError: unknown encoding: windows-874

In a previous version, I just passed the entire buffer to encoding_rs and had it handle decoding entirely in rust, but that was a little confusing.

Need more robust aliases

@john-parton
Copy link
Owner Author

Actually it might be even more complicated than that. https://bugs.python.org/issue25416

I think it's possible that encoding_rs and the python codec will have different output for legacy single byte encodings

@Mr0grog
Copy link
Contributor

Mr0grog commented Oct 17, 2023

For what it’s worth, I went through and:

  1. Compared the names/aliases. windows-874 is the only one chardetng can produce that does not map correctly in Python, so that’s already fully addressed.

  2. Wrote a quick script to compare the single byte WHATWG encodings definitions (found at https://encoding.spec.whatwg.org/#legacy-single-byte-encodings or more conveniently in git at https://github.com/whatwg/encoding) with the Unicode consortium definitions (found at https://unicode.org/Public/MAPPINGS/). They are all basically identical except for a few exceptions in KOI8-U and windows-1255:

    MATCH: IBM866
    MATCH: ISO-8859-2
    MATCH: ISO-8859-4
    MATCH: ISO-8859-5
    MATCH: ISO-8859-6
    MATCH: ISO-8859-7
    MATCH: ISO-8859-8
    MATCH: ISO-8859-13
    Definitions for KOI8-U do not match!
      Mismatch at byte 174 (0xae)! WHATWG = point 1118 (0x045e) | Unicode = point 9565 (0x255d)
      Mismatch at byte 190 (0xbe)! WHATWG = point 1038 (0x040e) | Unicode = point 9580 (0x256c)
    MATCH: windows-874
    MATCH: windows-1250
    MATCH: windows-1251
    MATCH: windows-1252
    MATCH: windows-1253
    MATCH: windows-1254
    Definitions for windows-1255 do not match!
      Mismatch at byte 202 (0xca)! WHATWG = point 1466 (0x05ba) | Unicode = point <UNDEFINED>
    MATCH: windows-1256
    MATCH: windows-1257
    MATCH: windows-1258
    

    (The format here is byte <decimal> (<hex>) is the actual byte being decoded and point <decimal> (<hex>) is the unicode code point it should decode as.)

    Caveat: I allowed points that are control characters to be treated as undefined for this comparison, which I think is probably reasonable. There are a bunch more encodings that don’t match up otherwise (i.e. a byte is defined as some control character in WHATWG’s definition and is entirely undefined/unmapped in the Unicode Consortium’s definition, or vice versa).

These mismatches are definitely not ideal, but I think are also small enough not to be a problem (at least for the use case here: detecting the encoding of some bytes; not actually decoding/encoding them). None of these bytes show up in chardetng’s frequency data in src/data.rs (assuming I’m reading it correctly), which I think means they aren’t really being considered anyway, so their mapping shouldn’t have a notable impact.

On the other hand, the windows-1255 case does feature a byte that could cause errors when [strictly] decoding in Python, and based on chardetng’s docs, I think it wouldn’t otherwise give you an answer that does that, since it says it discards options with decoding failures/unmapped bytes. That said, I think it’s good advice to avoid ever decoding a sniffed encoding strictly in the first place; IIRC chardet and cChardet both occasionally give answers that are wrong enough that they don’t successfully decode the bytes and code I had using them seems to always use errors="ignore" or errors="replace" when decoding with their guesses. (This might all be worth mentioning in some addendum to the the docs, though… idunno.)

@Mr0grog
Copy link
Contributor

Mr0grog commented Oct 17, 2023

Ah, some historical details on the ones that differ between WHATWG and Unicode, if you’re curious:

@john-parton
Copy link
Owner Author

Thanks for looking into that. So it looks like if you use python to decode, it might not be correct. I think probably it's best to just document this and move on.

@john-parton john-parton added documentation Improvements or additions to documentation good first issue Good for newcomers labels Oct 25, 2023
@john-parton
Copy link
Owner Author

I'm going to keep this issue open because it's lacking documentation, but it looks like the functional problems are well-addressed.

@Mr0grog
Copy link
Contributor

Mr0grog commented Oct 28, 2023

Yes, good point! I should not have marked that PR as solving this.

I started looking into the multibyte encodings (I think I covered the single-byte ones well), and it seems like the situation is a bit more messy. I’m leaving some notes here so I or someone can document based on them.

Multibyte codecs that Chardetng detects:

  • GBK
  • Big5
  • Shift_JIS
  • ISO-2022-JP
  • EUC-JP
  • EUC-KR

GBK

See #149 for details. Basically, gb18030 is a superset of gbk, which is a superset of gb2312. Chardetng expects decoding with encoding_rs, which treats all of them the same (as gb18030). They are all different decoders in Python, though, so we should (and now do) transform GBK → gb18030, which will work as a decoder in Python for all [correctly detected] cases.

Shift_JIS

Oh dear, this is messy and complicated. Shift_JIS turns out to be a really old and limited spec that has been extended in a multitude of ways that are incompatible with each other.

⚠️ TL;DR: If Chardetng returns Shift_JIS, the right way to decode in Python is probably:

  1. Try decoding as ms932.
  2. If that fails, try decoding as shift_jis_2004 (which is a superset of all the others in this family).
  3. If that fails, do whatever your fallback action is for bad detection.

(Caveat: if you want/need to behave like a web browser, skip step 2).

I think it would be lovely if that behavior was bundled up into the detect() function here (but in a way where you could call it separately if you are using EncodingDetector directly), and maybe also use it to inform the confidence of compat.detect(), but I think you could also make the case that it’s out of scope. It also adds in a fair amount of overhead if you aren’t ultimately planning to decode the bytes you’re dealing with (a good reason to put it in detect() and not EncodingDetector).

⚠️ End TL;DR

The most popular versions of Shift_JIS are:

  • Shift_JIS is the original (although in many languages/frameworks, the thing named "shift_jis" is not actually this).

  • Windows-31J, a.k.a. Windows-932 (in Python: ms932 or cp932) is the most widely used version according to WHATWG and Wikipedia, but I couldn’t find evidence for that cited anywhere (probably just because this was the official/first-party Japanese encoding on Windows). It is a superset of Shift_JIS, with a four exceptions (see below).

    The exceptions to being a superset: two notable exceptions (0x5C and 0x7E are treated differently) and two minor ones (0x815F and 0x8160 get visually similar, but not exact same characters). However, in Python, there is a STRICT_BUILD flag that, if false or not set (and it appears to be false on my machine) causes shift_jis to treat 0x5E and 0x7E like Windows-932, and makes Windows-932 handles errors in a slightly different way. I have no idea what environments might set this to true and get “accurate” behavior.

  • Shift_JISx0213 is a standards-body superset of Shift_JIS that adds a lot of characters. (Note: because of the above strict build stuff, this may or may not be an exact superset of Shift_JIS on any given Python install. 😬)

  • Shift_JIS_2004 is a standards-body update of of Shift_JISx0213. It adds 10 characters.

    On Python, it is not quite an exact superset because one character is different in a weird a subtle way that I’m not sure if is correct or a bug: The bytes 0xFC5A equate to the same glyph, but a different code point! In Shift_JISx0213, it’s U+9B1D and in Shift_JIS_2004, it’s U+9B1C. I think this is mainly a non-issue because the same bytes work with both, and decode to a perfectly equivalent glyph. The only problem is if you do:

    some_bytes.decode('shift_jisx0213').encode('shift_jis_2004')

    Or vice-versa. That issue is way outside the scope of this package, though.

WHATWG (and Chardetng) treats Windows-932 as Shift_JIS. So technically if Chardetng detects Shift_JIS, it really means windows-932 (or ms932 in Python terms). BUT the detection is loose enough and the overlap in these encodings is big enough that will typically detect Shift_JISx0213 or Shift_JIS_2004 or Shift_JIS content as Windows-932 and give Shift_JIS as a result.

So if Chardetng returns Shift_JIS, the right way to decode in Python is probably:

  1. Try decoding as ms932.
  2. If that fails, try decoding as shift_jis_2004 (which is a superset of all the others in this family).
  3. If that fails, do whatever your fallback action is for bad detection.

(Caveat: if you want/need to behave like a web browser, skip step 2).

In an ideal world, I think that behavior would be bundled up into the detect() function here (but in a way where you could call it separately if you are using EncodingDetector directly), and maybe also use it to inform the confidence of compat.detect(), but I think you could also make the case that it’s out of scope. It also adds in a fair amount of overhead if you aren’t ultimately planning to decode the bytes you’re dealing with (a good reason to put it in detect() and not EncodingDetector).


I didn’t have a chance to dive into the other multibyte ones yet. Quick notes on EUC-KR and Big5, but these need more investigation:

EUC-KR

WHATWG’s and encoding_rs’s definition of EUC-KR is actually UHC/Windows-949, which is an extension on the original EUC-KR. I’m not sure whether that’s a strict superset (in which case this is straightforward and we should swap the name like we did for GBK/gb18030) or not, or any idiosyncracies in how Python treats the two (they are at least separate codecs in Python).

Big5

WHATWG’s and encoding_rs’s definition of Big5 is actually Big5-HKSCS. I’m not sure how much a strict superset this is, or anything else here. Needs more research.

ISO-2022-JP

Haven’t looked into at all yet.

EUC-JP

Haven’t looked into at all yet.

@john-parton
Copy link
Owner Author

Hm, I think writing up a condensed version of this--something like "these encodings tend to be difficult in general: "--and putting it in the docs might be sufficient.

It's obviously pretty tricky and I appreciate you looking into it.

If someone really wants to do what a browser might do, they should probably take the binary stream and pass it directly to encoding_rs. It's not too bad to add bindings. We could create and encoding_rs_py (lol) binding.

@Mr0grog
Copy link
Contributor

Mr0grog commented Nov 2, 2023

That’s fair. Given that, I think we should probably also (in addition to the docs) rename Shift_JIS to ms932 (like was done with gbkgb18030) since that is really the closest thing in Python’s repertoire to what chardetng means.

I’m still hoping to write some docs for this, but want to do the research on the other remaining multibyte codecs here first.

@Mr0grog
Copy link
Contributor

Mr0grog commented Nov 8, 2023

More research:

ISO-2022-JP

ℹ️ TL;DR: None of Python’s built-in encodings is quite equivalent to the WHATWG version of this, but iso2022_jp_ext is the closest (and fairly narrow) superset. For safe usage, it probably makes sense to map ISO-2022-JP to iso2022_jp_ext. ℹ️

ISO-2022-JP has a similarly complicated backstory to GBK with incompatible families of extensions. ISO-2022 is a structure for making a codec that is stateful and switches between different sub-encodings when it encounters certain escape sequences (e.g. when it encounters the bytes ESC ( J it switches to interpreting everything after as if it were the roman component of JIS X 0201-1976, and when it encounters ESC $ B it switches to interpreting everything after as if it were JIS X 0208-1983).

Note: JIS X NNNN is an encoding standard, and JIS X NNNN-YYYY is the version of that standard published in year YYYY.

  • Baseline ISO-2022-JP can switch between:

    1. ASCII
    2. JIS X 0201-1976 (roman characters)
    3. JIS X 0208-1978, and JIS X 0208-1983

    Both Python and WHATWG treat both the JIS X 0208 components as JIS X 0208-1983 instead of treating them differently. The -1983 version is a strict superset, so this really just means they are slightly less strict than the specification.

  • WHATWG ISO-2022-JP is the same as baseline, but it treats the JIS X 0208-1978 version as if it were JIS X 0208-1983 (a superset) and it adds support for:

    1. JIS X 0201-1976 (kana characters) — this and ISO-2022-JP-EXT are the only versions that support the kana characters from JIS X 0201-1976.
  • ISO-2022-JP-EXT is the same as baseline + support for:

    1. JIS X 0201-1976 (kana characters) (like the WHATWG version above)
    2. JIS X 0212-1990

    In Python, this also treats JIS X 0208-1978 as JIS X 0208-1983, like WHATWG does. It’s basically WHATWG + JIS X 0212-1990.

  • ISO-2022-JP-1 is the same as baseline + adds support for:

    1. JIS X 0212-1990
  • ISO-2022-JP-2 is the same as ISO-2022-JP-1 + adds support for:

    1. gb2312 (Chinese)
    2. KS x 1001-1992 (Korean)
    3. ISO-8859-1 (the “high” or extended latin part — bytes above 0xA0)
    4. ISO-8859-7 (the “high” or basic greek part — byes above 0xA0)
  • ISO-2022-JP-3 does not follow from -JP-2. It is the same as baseline + adds support for:

    1. JIS X 0201-1976 (kana characters) (like the -ext and WHATWG versions above) BUT NOT IN PYTHON. Or at least, Wikipedia suggests this is supposed to be supported, but digging through the Python source confirms it is not in Python’s implementation. Trying this in actual Python (v3.12.0) also confirms it is not supported.
    2. JIS X 0213-2000

    The basic idea here was to replace all the complicated switches in ISO-2022-JP-2 with JIS X 0213, which pretty much supports it all in a more cohesive way.

  • ISO-2022-JP-2004 is the same as ISO-2022-JP-3 but replaces the JIS X 0213-2000 bits with JIS X 0213-2004 (effectively a superset).

  • There are more variations, but these are the most popular and the only ones relevant to Python or WHATWG/chardetng/encoding_rs.

So in Python, nothing is quite equivalent to the WHATWG version (which is what Chardetng is guessing when it says ISO-2022-JP). Straight-up iso2022_jp will fail to decode some things Chardetng thinks are OK, but iso2022_jp_ext is a strict superset of what Chardetng is looking for (the only superset!), so should always succeed. A bit of fiddling around with various characters that are only supported in some of the sub-encodings listed here seems to confirm this in practice (using C Python 3.12.0).

If we are going ahead with remapping names for compatibility, it probably makes sense to map ISO-2022-JPiso2022_jp_ext.

EUC-JP

ℹ️ TL;DR: As far as decoding goes WHATWG/chardet/encoding_rs’s concept of EUC-JP is the same as Python’s. We shouldn’t need to treat this one specially. ℹ️

EUC encodings are like ISO-2022 and Shift-JIS in that they contain several sub-encodings. Like ISO-2022, they support more/are a bit more flexible than Shift-JIS, but uses a switching mechanism based on what bytes are allowed where, like Shift-JIS, rather than statefully switching modes like ISO-2022. (If you’re familiar with UTF-8, it’s similar.)

For Japanese, there are three popular versions:

  • EUC-JP combines ASCII, JIS X 0201 kana characters (like ISO-2022-JP-EXT), JIS X 0208, and JIS X 0212. In some versions, the ASCII portion is JIS X 0201 roman characters instead, but both Python and WHATWG/encoding_rs/chardetng use ASCII.

    In WHATWG and encoding_rs, the EUC-JP codec does not support encoding JIS X 0212 characters. But they are supported for decoding so we don’t really care for our purposes here.

  • EUC-JISx0213 combines ASCII and JIS X 0213-2000 (0213 can handle all the stuff from 0201, 0208, and 0212).

    Python changes behavior slightly here based on the STRICT_BUILD build-time flag — the same as Shift-JIS. The impact of this is pretty minor and probably not worth worrying about here.

  • EUC-JIS-2004 is the same as EUC-JISx0213 but upgrades from JIS X 0213-2000 to JIS X 0213-2004 (just a newer version of the standard with a few additional characters).

Neither the WHATWG standard nor encoding_rs/chardetng use anything from JIS X 0213, and pretty much exactly match behavior. We don’t need to do anything special for this family of encodings.


Still remaining:

  • EUC-KR
  • Big5

@john-parton
Copy link
Owner Author

john-parton commented Nov 8, 2023

That all makes sense to me. I wonder if documenting it is sufficient, or if we should emit a warning of some kind.

By putting the mapping logic directly into the rust code, we have created a minor problem if the user wants to know what chardetng actually outputs. For instance, if they want to pass the output to a binding of encoding_rs.

Additionally, if we want to emit a warning using the python warnings module, I'm not sure we can really do that because the mapping is already done in rust.

I think perhaps deviating very slightly from the rust struct and adding a compat flag to the guess method might make sense

@Mr0grog
Copy link
Contributor

Mr0grog commented Nov 8, 2023

By putting the mapping logic directly into the rust code, we have created a minor problem if the user wants to know what chardetng actually outputs. For instance, if they want to pass the output to a binding of encoding_rs.

So my feel on this is:

  • Nothing we’ve put in so far really violates this in a problematic way — we’ve only mapped names to their exact equivalents in Python. That said, the name cp874 is not supported in encoding_rs, but there’s no overlap in aliases for that encoding between Python and encoding_rs. I think we made the right choice, given that this is principally a Python library. (Note gb18030 is fine because that name works and has the same meaning in both places).

  • Mapping names to safe supersets (e.g. ISO-2022-JPiso2022_jp_ext) is obviously a less clear choice. I definitely hear the argument that this is not quite what chardetng means, and if someone has a WHATWG-compatible codec available, they might want to use that instead, since it’s exactly equivalent. OTOH, the operation this package is fundamentally doing is fuzzily trying to find a workable decoding scheme for some bytes. From that perspective, it might be fair to say that a superset of the detected encoding is almost a better answer. 🤷

  • Regardless of the above, if this package returned the names chardetng uses, I think it should prefix them, e.g. return whatwg-Shift_JIS instead of Shift_JIS. That prevents confusing collisions, makes it extremely clear what you’re talking about, and lets someone register a WHATWG-compatible codec (e.g. from encoding_rs) with that name in Python so a user can then do some_bytes.decode('whatwg-Shift_JIS') and have it just work. (If there were a hypothetical encoding_rs_py package, I would hope it integrates with Python’s built in encoding machinery this way.)

(Side note: FWIW, after having read up and thought about it more, I think Shift_JISms932 belongs more to the “exact match” category than the “safe superset” category. Also, Python’s canonical name for that is cp932, but both it and encoding_rs recognize ms932, so that one is probably better to use if we did a mapping.)

Some ideas for splitting the difference here…

  1. Add an argument to the EncodingDetector constructor that tells it whether to return Python or WHATWG names. For example:

    detector = EncodingDetector(use_python_names=True)
    detector.feed(some_ms932_bytes)
    detector.guess(tld=None, allow_utf8=False)
    # "ms932"
    
    detector = EncodingDetector(use_python_names=False)
    detector.feed(some_ms932_bytes)
    detector.guess(tld=None, allow_utf8=False)
    # "whatwg-shift_jis" or "shift_jis" or "Shift_JIS"

    I think .detect() and compat.detect(), which go for simplicity, would set this to use Python names, but the default could be to use WHATWG/encoding_rs names.

    (Edit: I think your suggestion to add this as an argument to .guess()/.guess_assess() is probably better than the constructor idea I proposed here. The only bonus with the constructor is that it leaves the method signatures cleanly matching chardetng’s.)

  2. Always return a WHATWG name, but provide a way to register the equivalent Python decoders for it:

    chardetng_py.detect(some_ms932_bytes)
    # "whatwg-shift_jis"
    
    chardetng_py.register_loose_decoders()
    some_ms932_bytes.decode("whatwg-shift_jis")
    # Works by using Python’s `ms932` codec.
  3. Make detect() and compat.detect() use Python-compatible names and EncodingDetector use WHATWG names. Provide the transform as a Python function someone can use with the result from EncodingDetector if they want:

    chardetng_py.detect(some_ms932_bytes)
    # "ms932"
    
    detector = chardetng_py.EncodingDetector()
    detector.feed(some_ms932_bytes)
    result = detector.guess(tld=None, allow_utf8=False)
    # "whatwg-Shift_JIS"
    
    chardetng_py.compatible_codec_name(result)
    # "ms932"

(Edit: updated with a note on idea 2 and added idea 3, which I originally added as a separate comment: #11 (comment). Also fixed some typos where I wrote ms939 instead of ms932)

@john-parton
Copy link
Owner Author

Interesting, either way. I'll need to think about it for a while.

@Mr0grog
Copy link
Contributor

Mr0grog commented Nov 8, 2023

Oops, forgot a third: have detect() and compat.detect() return Python compatible names, and EncodingDetector return WHATWG names. Provide the transform as a Python function someone call with the result from EncodingDetector.guess() if they want.

@Mr0grog
Copy link
Contributor

Mr0grog commented Nov 9, 2023

Alright, wrapping up with the final two:

EUC-KR

ℹ️ TL;DR: As far as decoding goes, WHATWG/chardet/encoding_rs’s concept of EUC-KR is equivalent to Python’s uhc/ms949/cp949. Python’s euc-kr is a subset (sort of) and not safe to consider equivalent. ℹ️

EUC-KR is structured like EUC-JP, but just uses different sub-encodings. The original version had some major drawbacks, so both Microsoft and Apple developed their own extended versions:

  • EUC-KR is the original, and combines ASCII (or ISO-646:KR, which is almost the same) with KS X 1001 (for all the Korean characters). In Python, this is called euc_kr. It doesn’t exist in WHATWG-land.

  • UHC (Unified Hangul Code) or Windows-949 is Microsoft’s extensions on top of EUC-KR. In terms of bytes handled, it is a strict superset of EUC-KR, BUT it handles some byte sequences in a different way. It is by far the most popular version of EUC-KR and is also the most popular Korean-specific encoding. In Python, this is called cp949 or ms949 or uhc. In WHATWG, this is EUC-KR.

    What’s with the note about handling some byte sequences differently?

    In Korean (like Chinese and Japanese), most characters are made by combining several simpler characters. The big problem in EUC-KR was that only some of these more complex characters were directly supported. It handled others by recognizing certain sequences of simplified characters that make them up.

    For example, the character “샾” can be represented as a single 2-byte sequence in UHC, but not in EUC-KR. Instead, EUC-KR represents it with a special sequence start byte pair followed by the 3 characters that make up the character:

    Bytes:              A4 D4 A4 B5 A4 C1 A4 BD
                        ^^^^^ ----- ^^^^^ -----
                          │      │    │     └─ ㅍ
                          │      │    └─ ㅑ
                          │      └─ ㅅ
                          └─ <start of sequence>
    
    Decoded as EUC-KR:  샾
        (Code points):  U+C0FE (HANGUL SYLLABLE SYAP)
    

    Since UHC can encode this character as a single 2-byte pair without the special sequence handling (98 DE), it doesn't decode the sequence as a single character, it just treats it literally (i.e. the idea here is that if you want the combined character in UHC, you should use the literal combined character):

    Bytes:              A4 D4 A4 B5 A4 C1 A4 BD
    
       Decoded as UHC:  ㅅㅑㅍ
        (Code points):  U+3164 ("<invisible>" HANGUL FILLER)
                        U+3145 ("ㅅ" HANGUL LETTER SIOS)
                        U+3151 ("ㅑ" HANGUL LETTER YA)
                        U+314D ("ㅍ" HANGUL LETTER PHIEUPH)
    

    Note that U+3164 isn't a visible or meaningful character; it just indicates that this started a sequence so that encoding can safely do a round-trip. In fact, as far as I understand it, that is the whole reason the sequence decodes this way instead of to a single character in UHC:

    some_euc_kr_bytes == some_euc_kr_bytes.decode('uhc').encode('uhc')
  • Mac OS Korean is Apple’s extensions on top of EUC-KR. It’s a little more complicated. More importantly, it is not supported at all in Python or WHATWG, so I didn’t investigate this one in a lot more depth.

This one is ultimately pretty simple. WHATWG’s EUC-KR is the same as Python’s uhc or ms949. Unfortunately there is no alias that is commonly supported in both Python’s built-ins and encoding_rs (similar to the situation with cp874).

Big5

ℹ️ TL;DR: As far as decoding goes, WHATWG/chardet/encoding_rs’s concept of Big5 is a mix of the Big5-HKSCS extension (Python: big5-hkscs) and Windows-950 (Python: cp950). There is no equivalent or superset in Python, although big5-hkscs is closest (note this label, with the - works in both encoding_rs and Python) even though it is an older version of HKSCS and is missing a bunch of characters supported in the WHATWG version. ℹ️

The backstory on Big5 is pretty complex! Big5 was created in a somewhat ad-hoc way by by various computer manufacturers in Taiwan, and was eventually standardized for interoperability. It was still pretty limited, though, so there there is a whole mess of more specialized encodings that extend it. Some popular branches of the tree of extensions:

  • ETen adds a handful of control characters, Chinese characters, hiragana, katakana, and Cyrillic. It’s mostly pretty straightforward, but there is some dispute/confusion in the Unicode Consortium’s official mapping table, owing [I think] to conflicts with the various versions of basic Big5 (see above about ad-hoc standardization).

  • Windows-950 is Microsoft’s version of Big5, which adds some of the ETen extensions and a handful of other random characters, like the Euro sign, British Pound sign, Yen sign, etc. It was pretty widely used.

  • Hong Kong Supplementary Character Set (HKSCS) extends ETen and was originally created by the Hong Kong government (this original version was called GCCS) and was then further extended and standardized as HKSCS. It’s generally the most popular of many forks and forks-of-forks. It differs from Windows-950 for a few hundred characters. It’s also worth noting that this standard has evolved over time, and Python’s implementation looks to be HKSCS-2004, which has 4,941 characters. There have been two subsequent releases (2008, 2016) which bring the total to 5,009 and 5,033 characters, respectively.

Since both Windows-950 and HKSCS were quite popular, WHATWG wound up standardizing on a combination of the two. It seems like it is basically HKSCS-2016 + any additional byte sequences that don’t work in it but do in Windows-950. This basically works out to HKSCS + 12 characters:

Bytes "0xa1 0xc2" = U+00AF ("¯" MACRON)
Bytes "0xa1 0x45" = U+2027 ("‧" HYPHENATION POINT)
Bytes "0xa3 0xe1" = U+20AC ("€" EURO SIGN)
Bytes "0xa2 0x41" = U+2215 ("∕" DIVISION SLASH)
Bytes "0xa1 0xf2" = U+2295 ("⊕" CIRCLED PLUS)
Bytes "0xa1 0xf3" = U+2299 ("⊙" CIRCLED DOT OPERATOR)
Bytes "0xa1 0x4e" = U+FE51 ("﹑" SMALL IDEOGRAPHIC COMMA)
Bytes "0xa2 0x42" = U+FE68 ("﹨" SMALL REVERSE SOLIDUS)
Bytes "0xa1 0xe3" = U+FF5E ("~" FULLWIDTH TILDE)
Bytes "0xa2 0x46" = U+FFE0 ("¢" FULLWIDTH CENT SIGN)
Bytes "0xa2 0x47" = U+FFE1 ("£" FULLWIDTH POUND SIGN)
Bytes "0xa2 0x44" = U+FFE5 ("¥" FULLWIDTH YEN SIGN)

Unfortunately, there is nothing like this built-in for Python. First off, the big5-hkscs codec is based on the 2004 standard, while WHATWG is based on the 2016 version (92 more characters). You could probably handle this in Python by decoding as big5hkscs and using a custom error handler that handles the above sequences and the missing characters, but that’s not great. The right way to think about this is probably that it means “could be big5hkscs or cp950,” since I think what WHATWG was trying to do here is make a decoder that kinda sorta works for for both (even though you get somewhat messy results for a lot of Windows-950 text, it works for most characters).

Anyway!

  • We could map this to big5-hkscs since that is the closest match. It’s definitely more likely to work than Big5 (the label that Chardetng actually returns), but it’s not really equivalent or even a safe superset, like we have for all the other encodings.
  • Or, since there’s no clear matching/safe Python codec, we could just document this and leave it alone. 🤷

(Edit on 2023-11-13: Rewrote the section on Big5 when I ran into some edge cases today. It’s now much more accurate and detailed.)

@Mr0grog
Copy link
Contributor

Mr0grog commented Nov 9, 2023

Overall summary of encodings and their support/similarities/differences between WHATWG/encoding_rs and Python’s built-in codecs:

Single-byte? WHATWG Name Python Builtin Equivalent Notes
Yes IBM866 " ✅ Yes
Yes ISO-8859-2 " ✅ Yes
Yes ISO-8859-4 " ✅ Yes
Yes ISO-8859-5 " ✅ Yes
Yes ISO-8859-6 " ✅ Yes
Yes ISO-8859-7 " ✅ Yes
Yes ISO-8859-8 " ✅ Yes
Yes ISO-8859-13 " ✅ Yes
Yes KOI8-U " ❌ No WHATWG’s version is actually KOI8-RU, which Python does not have built-in support for. 2 mismatched bytes. 0xAE is “ў” in WHATWG and “╝” in Python. 0xBE is “Ў” in WHATWG and “╬” in Python. There is no better matching codec in Python.
Yes windows-874 cp874 ✅ Yes Different name, but exact same results.
Yes windows-1250 " ✅ Yes
Yes windows-1251 " ✅ Yes
Yes windows-1252 " ✅ Yes
Yes windows-1253 " ✅ Yes
Yes windows-1254 " ✅ Yes
Yes windows-1255 " ❌ No 1 mismatched byte. 0xCA is U+05BA [HEBREW POINT HOLAM HASER FOR VAV] in WHATWG and undefined in Python.
Yes windows-1256 " ✅ Yes
Yes windows-1257 " ✅ Yes
Yes windows-1258 " ✅ Yes
No GBK gb18030 ✅ Yes This Python name is an alias that works in both Python and encoding_rs.
No Big5 big5-hkscs ❌ No This is the closest equivalent in Python, but it’s not a safe superset and is missing 104 characters that work in the ultra-weird WHATWG version.
No Shift_JIS ms932 ✅ Yes This Python name is an alias that works in both Python and encoding_rs. These are basically the same, although there is a little complexity on the Python side, where a couple characters decode differently depending on the STRICT_BUILD flag that is set at build-time on Python.
No ISO-2022-JP iso2022_jp_ext ❌ No The Python codec handles a superset of the WHATWG encoding here. Some bytes that would fail in chardetng/encoding_rs will decode OK with it in Python.
No EUC-JP " ✅ Yes
No EUC-KR uhc ✅ Yes Different name, but exact same results.

Notes:

  1. This treats any control characters as equivalent even if one of the codecs doesn't support them. You may have to decode and ignore/skip errors to get exactly matching output if you are handling bytes that include control characters.
  2. " denotes names that are the same as the matching built-in codec in Python. The canonical names of them in Python are a little different, though, e.g. ISO-8859-2iso8859-2.

(Edit on 2023-11-13: Updated the entry for Big5. I wound up tripping over the edge cases on it today and discovered it it’s the weirdest one here; WHATWG just sort of slammed two similar encodings together for it. I’ve updated the earlier, detailed comment for Big5 with more info.)

@john-parton
Copy link
Owner Author

This is overkill in the best possible terms. I definitely think it makes sense to document this behavior. I would almost immediate accept a PR that updates the docs with some information on the more difficult charsets.

@Mr0grog
Copy link
Contributor

Mr0grog commented Nov 11, 2023

I’m happy to try and wrap this up into something more condensed and readable, but I think it would be good to figure out the actual strategy for what (if any) names are getting remapped under what circumstances first (don’t need to have implemented it yet).

@Mr0grog
Copy link
Contributor

Mr0grog commented Nov 13, 2023

Quick update: I ran into some weird behavior with Big5 today and had more energy to dive into the details. It turns out to be kinda weird! I updated my detailed comments above if you want to know more, but the short version is: it’s the only one where there is not a clear equivalent/safe superset in Python’s built-in codecs. big5-hkscs, which I’d recommended as equivalent before, is still close, but less close than I’d thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants