-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Half-width Katakana should be representable in ISO-2022-JP #105
Comments
It seems this is a special feature of the encoder only: "ミミ" both encode to 0x25 0x5F. I wonder if all Japanese encoders first convert halfwidth to fullwidth now. |
No, ISO-2022-JP only. |
You'd think someone would have already written and published an algorithm for this conversion. I guess I'll just find the mapping for each code point myself. |
Okay, so I guess what we want to do is to apply Unicode Normalization Form KC on any code point in the range U+FF65 to U+FF9F, inclusive. |
const start = 0xFF61,
end = 0xFF9F + 1;
for(let i = start; i < end; i++) {
const cp = String.fromCodePoint(i),
fullwidthCP = cp.normalize("NFKC");
// ...
} If I write those out and use @hsivonen's demo I get the results I was expecting per the above analysis. |
Correct me if I'm wrong, but wouldn't halfwidth katakana involve a switch to JIS X 0201 (Roman) mode? There's no need to destroy the round-trip by normalizing to fullwidth. |
@aphillips that is not what implementations do. |
@annevk Yes, although that seems like a bug in the coders. I saw this thread this morning and was surprised, since I recall having to implement this when I was writing an ISO-2022 coder about 15 years ago. I can't imagine that the encoding's formal definition has changed, so I'm surprised to see implementations doing this. |
Sure, but after such a long time bugs become features. |
@aphillips @annevk On the Internet/Web, only the original ISO-2022-JP defined in RFC 1468 was "widely" (relative to subsequent versions) used, but subsequent versions of ISO-2022-JP, ISO-2022-JP-[123] never got much traction. Why would anybody use ISO-2022-JP-* to encode Chinese, Korean, Latin beyond ASCII, and Greek? And, JIS X 0212 (supported in ISO-2022-JP-1 or later) is not critical enough to Japanese users (Shift_JIS does not support it, either). The original ISO-2022-JP does not support Halfwidth Katakana. That's why ICU has a fallback encoding for Halfwidth Katakana for the original ISO-2022-JP. It's only ISO-2022-JP-3 that supports Halfwidth Katakana. ICU supports ISO-2022-JP-3 as defined and does not have fallback encoding for Halfwidth Katakana in ISO-2022-JP-3. |
Note that we do support decoding halfwidth Katakana: https://encoding.spec.whatwg.org/#iso-2022-jp-decoder-katakana. Should we remove that then? |
Why? If we see the byte sequence and it isn't invalid, why not decode it? Note that this encoding is primarily used for email, not web pages. |
Sorry, that suggestion was rather flippant and I should have looked at https://w3c-test.org/encoding/iso-2022-jp-decoder.html in various browsers first, which shows it's supported (though not sure about Edge). It just shows that @jungshik's story above is not really complete as browsers support ISO-2022-JP-3's halfwidth Katakana extension on the decoder side (in what they call ISO-2022-JP). To be 100% clear: suggestion retracted. |
Sorry for the confusion. It turned out that ICU's ISO-2022-JP converter (and other converters used in browsers) supports Halfwidth Katakana ("ESC ( I") in the spirit of 'be lenient in what you accept and be strict in what you emit'. For instance, it's explicitly commented in ICU's ucnv2022.cpp
|
A query string-based (and, therefore, IE/Edge-incompatible) test shows that Gecko, WebKit, Blink and Presto can encode half-width Katakana as ISO-2022-JP without NCRs.
The spec should be amended to match (both encoder and decoder side).
The text was updated successfully, but these errors were encountered: