Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codecs revisited #28

Merged
merged 19 commits into from
Jul 22, 2022
Merged

Codecs revisited #28

merged 19 commits into from
Jul 22, 2022

Conversation

harjitmoe
Copy link
Contributor

@harjitmoe harjitmoe commented Jul 20, 2022

Since I haven't done anything on this for several days (x-mac-japanese would be interesting, since it has some of the mapping issues of x-mac-korean plus the fact that it is (on the conservative side) two different encodings in a trenchcoat—so I should probably keep that one on the backburner for now), I should probably PR what I have.  This is basically several things I've been wanting to address for some time now, the main ones are:

  • Improvements to xraydict to accept a filter function as well as the exclusion list.  This means that long hardcoded lists of exclusions (for encodings defined by their difference from other similar encodings) are less necessary.
  • Seven 1978 JIS (JIS C 6226-1978) mappings have been revised.  Five of these (蝉蟬, 騨驒, 箪簞, 剥剝, 屏屛) take into consideration disunifications in 2000 JIS (JIS X 0213-2000) and 2004 JIS—i.e. where the 1978 character actually corresponded to a different (usually less simplified) character in the 2004 standard and should be mapped to Unicode as such—while previously they only followed disunifications made in 1990 JIS (JIS X 0208-1990 with JIS X 0212-1990).  The other two (昻 vs 昂) are swapped between a standard position and a position in the NEC Selection of IBM Extensions, since this is apparently closer to the 1978 revision, and is indeed one of the swaps between the "old" (partly 1978 based) and "new" (fully 1983+ based) JIS sequences as implemented by IBM.  These have minor effects on the jis_encoding codec (and therefore also the decoding behaviour for the ISO-2022-JP family except for iso-2022-jp itself, but only when the older ESC $ @ rather than ESC $ B appears in input).
  • Speaking of ISO-2022-JP, I have added a documentation section explaining how the two decoders' response to sequences unlikely to be generated by a single encode operation differs from the UTR#36/WHATWG approach, the Python approach, and the two "end states" of UTC L2/20-202.  I have not changed this part of their behaviour, only documented it.
  • An x-mac-korean codec.  This brings the number of Python's “temporary mac CJK aliases, will be replaced by proper codecs in 3.1” (which never were and still bear that notice, lol) with (by contrast) proper Kuroko support up to three out of four.  Of all legacy Macintosh encodings, MacKorean is easily the one with the largest number of characters that don't exist in Unicode (all of them exist in Adobe-Korea though, although not Adobe-KR).  I have deliberately deviated from the three Apple and one Adobe mappings (some partial, some with kludge mappings) I have for them to ① take advantage of closer (usually newer) Unicode representations, ② avoid decoding non-ASCII to sequences with non‑alphanumeric ASCII substrings, since they could be syntactically significant, ③ generally avoid using Apple's Corporate Private Use Area, at the expense of roundtripping.
  • The johab-ebcdic decoder is likewise changed to avoid using IBM's Corporate Private Use Area, at the expense of roundtripping.

harjitmoe added 18 commits July 12, 2022 18:48
Add a filter_function to xraydict, allowing fewer big data structures. Make
uses of xraydict prefer exclusion sets to exclusion lists, to avoid
repeated linear search of a list.
Mappings for 25-23 and 90-22 were previously the same as those used for
97JIS; they have been swapped to correspond with how the IBM extension
versus the standard code are mapped in the "old sequence" (78JIS-based)
as opposed to the "new sequence".

Mappings for 32-70, 34-45, 35-29, 39-77 and 54-02 in 78JIS have been
changed to reflect disunifications made in 2000-JIS and 2004-JIS, assigning
the 1978-edition unsimplified variants of those characters separate coded
forms (where previously, only swaps and disunifications in 83JIS and
disunifications in 90JIS (including JIS X 0212) had been considered).

This only affects the `jis_encoding` codec (including the decoding
direction for `iso-2022-jp-2`, `iso-2022-jp-3` and `iso-2022-jp-2004`),
and the decoding is only affected when `ESC $ @` (not `ESC $ B`) is used.
The `iso-2022-jp` codec is unaffected, and remains similar to (but more
consistently pedantic than) the WHATWG specification, thus using the same
table for both 78JIS and 97JIS.
Many-to-one decodes are not uncommon in CJK encodings (e.g. Windows-31J),
and mapping to the IBM Corporate PUA (code page 1449) would probably make
it render as completely the wrong character if at all in practice.
@harjitmoe harjitmoe requested review from klange and removed request for klange July 20, 2022 19:41
@harjitmoe
Copy link
Contributor Author

Withdrawing review request while I fix something I just noticed.

@harjitmoe
Copy link
Contributor Author

Okay, done.

@harjitmoe harjitmoe requested a review from klange July 20, 2022 19:55
@klange klange merged commit a580a83 into master Jul 22, 2022
@klange klange deleted the codecs-revisited branch July 22, 2022 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants