Codecs revisited #28

harjitmoe · 2022-07-20T19:33:58Z

Since I haven't done anything on this for several days (x-mac-japanese would be interesting, since it has some of the mapping issues of x-mac-korean plus the fact that it is (on the conservative side) two different encodings in a trenchcoat—so I should probably keep that one on the backburner for now), I should probably PR what I have. This is basically several things I've been wanting to address for some time now, the main ones are:

Improvements to xraydict to accept a filter function as well as the exclusion list. This means that long hardcoded lists of exclusions (for encodings defined by their difference from other similar encodings) are less necessary.
Seven 1978 JIS (JIS C 6226-1978) mappings have been revised. Five of these (蝉蟬, 騨驒, 箪簞, 剥剝, 屏屛) take into consideration disunifications in 2000 JIS (JIS X 0213-2000) and 2004 JIS—i.e. where the 1978 character actually corresponded to a different (usually less simplified) character in the 2004 standard and should be mapped to Unicode as such—while previously they only followed disunifications made in 1990 JIS (JIS X 0208-1990 with JIS X 0212-1990). The other two (昻 vs 昂) are swapped between a standard position and a position in the NEC Selection of IBM Extensions, since this is apparently closer to the 1978 revision, and is indeed one of the swaps between the "old" (partly 1978 based) and "new" (fully 1983+ based) JIS sequences as implemented by IBM. These have minor effects on the jis_encoding codec (and therefore also the decoding behaviour for the ISO-2022-JP family except for iso-2022-jp itself, but only when the older ESC $ @ rather than ESC $ B appears in input).
Speaking of ISO-2022-JP, I have added a documentation section explaining how the two decoders' response to sequences unlikely to be generated by a single encode operation differs from the UTR#36/WHATWG approach, the Python approach, and the two "end states" of UTC L2/20-202. I have not changed this part of their behaviour, only documented it.
An x-mac-korean codec. This brings the number of Python's “temporary mac CJK aliases, will be replaced by proper codecs in 3.1” (which never were and still bear that notice, lol) with (by contrast) proper Kuroko support up to three out of four. Of all legacy Macintosh encodings, MacKorean is easily the one with the largest number of characters that don't exist in Unicode (all of them exist in Adobe-Korea though, although not Adobe-KR). I have deliberately deviated from the three Apple and one Adobe mappings (some partial, some with kludge mappings) I have for them to ① take advantage of closer (usually newer) Unicode representations, ② avoid decoding non-ASCII to sequences with non‑alphanumeric ASCII substrings, since they could be syntactically significant, ③ generally avoid using Apple's Corporate Private Use Area, at the expense of roundtripping.
The johab-ebcdic decoder is likewise changed to avoid using IBM's Corporate Private Use Area, at the expense of roundtripping.

Add a filter_function to xraydict, allowing fewer big data structures. Make uses of xraydict prefer exclusion sets to exclusion lists, to avoid repeated linear search of a list.

Mappings for 25-23 and 90-22 were previously the same as those used for 97JIS; they have been swapped to correspond with how the IBM extension versus the standard code are mapped in the "old sequence" (78JIS-based) as opposed to the "new sequence". Mappings for 32-70, 34-45, 35-29, 39-77 and 54-02 in 78JIS have been changed to reflect disunifications made in 2000-JIS and 2004-JIS, assigning the 1978-edition unsimplified variants of those characters separate coded forms (where previously, only swaps and disunifications in 83JIS and disunifications in 90JIS (including JIS X 0212) had been considered). This only affects the `jis_encoding` codec (including the decoding direction for `iso-2022-jp-2`, `iso-2022-jp-3` and `iso-2022-jp-2004`), and the decoding is only affected when `ESC $ @` (not `ESC $ B`) is used. The `iso-2022-jp` codec is unaffected, and remains similar to (but more consistently pedantic than) the WHATWG specification, thus using the same table for both 78JIS and 97JIS.

Many-to-one decodes are not uncommon in CJK encodings (e.g. Windows-31J), and mapping to the IBM Corporate PUA (code page 1449) would probably make it render as completely the wrong character if at all in practice.

…ist.

harjitmoe · 2022-07-20T19:48:35Z

Withdrawing review request while I fix something I just noticed.

harjitmoe · 2022-07-20T19:55:07Z

Okay, done.

harjitmoe added 18 commits July 12, 2022 18:48

xraydict functionality and usage improvements

bf2fa84

Add a filter_function to xraydict, allowing fewer big data structures. Make uses of xraydict prefer exclusion sets to exclusion lists, to avoid repeated linear search of a list.

Make big5_coded_forms_from_hkscs a set, remove set trailing commas.

d58d35e

Remove big5_coded_forms_from_hkscs in favour of a filter function.

75102d9

Similarly, use sets for 7-bit exclusion lists except when really short.

fdb96ba

Make johab-ebcdic decoder use many-to-one, not corporate PUA.

84f5d05

Many-to-one decodes are not uncommon in CJK encodings (e.g. Windows-31J), and mapping to the IBM Corporate PUA (code page 1449) would probably make it render as completely the wrong character if at all in practice.

Switch cp950_no_eudc_encoding_map away from a hardcoded exclusion l…

3614846

…ist.

Codec support for x-mac-korean.

e60eab8

Add a test bit for the UTF-8 wrapper.

521c0e6

Document the unique error-condition definition of the ISO-2022-JP codec.

e7e50ab

Update docs now there is an actual implementation for x-mac-korean.

6a1d114

Further explanations of the hazards of jis_encoding.

e5e6fb8

Sanitised → Sanitised or escaped.

5f63171

Further clarify the status with not verifying Shift In.

93bd62f

Corrected description of End State 2.

2d56343

Changes to MacKorean to avoid mapping non-ASCII using ASCII punctuation.

9bf7193

Merge remote-tracking branch 'origin/master' into codecs-revisited

e1e17a9

Extraneous word "still".

4915f52

harjitmoe requested review from klange and removed request for klange July 20, 2022 19:41

Fix omitting MacKorean single-byte codes.

f3457a5

harjitmoe requested a review from klange July 20, 2022 19:55

klange merged commit a580a83 into master Jul 22, 2022

klange deleted the codecs-revisited branch July 22, 2022 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codecs revisited #28

Codecs revisited #28

harjitmoe commented Jul 20, 2022 •

edited

Loading

harjitmoe commented Jul 20, 2022

harjitmoe commented Jul 20, 2022

Codecs revisited #28

Codecs revisited #28

Conversation

harjitmoe commented Jul 20, 2022 • edited Loading

harjitmoe commented Jul 20, 2022

harjitmoe commented Jul 20, 2022

harjitmoe commented Jul 20, 2022 •

edited

Loading