-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codecs revisited #28
Merged
Merged
Codecs revisited #28
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add a filter_function to xraydict, allowing fewer big data structures. Make uses of xraydict prefer exclusion sets to exclusion lists, to avoid repeated linear search of a list.
Mappings for 25-23 and 90-22 were previously the same as those used for 97JIS; they have been swapped to correspond with how the IBM extension versus the standard code are mapped in the "old sequence" (78JIS-based) as opposed to the "new sequence". Mappings for 32-70, 34-45, 35-29, 39-77 and 54-02 in 78JIS have been changed to reflect disunifications made in 2000-JIS and 2004-JIS, assigning the 1978-edition unsimplified variants of those characters separate coded forms (where previously, only swaps and disunifications in 83JIS and disunifications in 90JIS (including JIS X 0212) had been considered). This only affects the `jis_encoding` codec (including the decoding direction for `iso-2022-jp-2`, `iso-2022-jp-3` and `iso-2022-jp-2004`), and the decoding is only affected when `ESC $ @` (not `ESC $ B`) is used. The `iso-2022-jp` codec is unaffected, and remains similar to (but more consistently pedantic than) the WHATWG specification, thus using the same table for both 78JIS and 97JIS.
Many-to-one decodes are not uncommon in CJK encodings (e.g. Windows-31J), and mapping to the IBM Corporate PUA (code page 1449) would probably make it render as completely the wrong character if at all in practice.
Withdrawing review request while I fix something I just noticed. |
Okay, done. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Since I haven't done anything on this for several days (
x-mac-japanese
would be interesting, since it has some of the mapping issues ofx-mac-korean
plus the fact that it is (on the conservative side) two different encodings in a trenchcoat—so I should probably keep that one on the backburner for now), I should probably PR what I have. This is basically several things I've been wanting to address for some time now, the main ones are:xraydict
to accept a filter function as well as the exclusion list. This means that long hardcoded lists of exclusions (for encodings defined by their difference from other similar encodings) are less necessary.jis_encoding
codec (and therefore also the decoding behaviour for the ISO-2022-JP family except foriso-2022-jp
itself, but only when the olderESC $ @
rather thanESC $ B
appears in input).x-mac-korean
codec. This brings the number of Python's “temporary mac CJK aliases, will be replaced by proper codecs in 3.1” (which never were and still bear that notice, lol) with (by contrast) proper Kuroko support up to three out of four. Of all legacy Macintosh encodings, MacKorean is easily the one with the largest number of characters that don't exist in Unicode (all of them exist in Adobe-Korea though, although not Adobe-KR). I have deliberately deviated from the three Apple and one Adobe mappings (some partial, some with kludge mappings) I have for them to ① take advantage of closer (usually newer) Unicode representations, ② avoid decoding non-ASCII to sequences with non‑alphanumeric ASCII substrings, since they could be syntactically significant, ③ generally avoid using Apple's Corporate Private Use Area, at the expense of roundtripping.johab-ebcdic
decoder is likewise changed to avoid using IBM's Corporate Private Use Area, at the expense of roundtripping.