Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows-1255 encoding: add mapping for 0xCA #73

Closed
bhaible opened this issue Oct 3, 2016 · 11 comments
Closed

windows-1255 encoding: add mapping for 0xCA #73

bhaible opened this issue Oct 3, 2016 · 11 comments

Comments

@bhaible
Copy link

bhaible commented Oct 3, 2016

The windows-1255 specified through the spec does NOT map the byte 0xCA.

However, the main use of windows-1255 is as a codepage on Windows, and the native Windows converter (function MultiByteToWideChar) maps 0xCA to U+05BA, already since Windows 2000, i.e. for 15 years.

On the other hand, the codepage chart at Microsoft https://msdn.microsoft.com/en-us/library/cc195057.aspx marks this position as "not used", and the majority of non-Windows conversion software does not map the byte 0xCA.

For details of these mapping tables, see
http://haible.de/bruno/charsets/conversion-tables/index.html
http://haible.de/bruno/charsets/conversion-tables/CP1255.html

The implementation of the change would be to edit index-windows-1255.txt, adding a line
74 0x05BA (HEBREW POINT HOLAM HASER FOR VAV)

@annevk
Copy link
Member

annevk commented Oct 4, 2016

Per https://www.w3.org/International/tests/repo/results/encoding-sb-dec#windows-1255 it's indeed only Microsoft that has failures here. I can't seem to run the test however in Edge and the note indicates it's mostly about PUA code points. @r12a?

(Note that to implement this change we'd update the JSON resource and run tools-index.py, but it's not entirely clear to me that we want too given that the majority of implementations is aligned.)

@bhaible
Copy link
Author

bhaible commented Oct 4, 2016

it's not entirely clear to me that we want too given that the majority of implementations is aligned

Yes, usually I follow this "majority of implementations" argument. But here, given that the main use of windows-1255 is as "a code page used under Microsoft Windows" [see https://en.wikipedia.org/wiki/Windows-1255], I would follow what the implementation of MultiByteToWideChar under Windows does: it maps 0xCA to U+05BA.

@r12a
Copy link
Collaborator

r12a commented Oct 4, 2016

@annevk i had no problem running the test. If you continue to have a problem, let me know.

Here's a snap of the results.

1255

0xCA is mapped to U+05BA and called out as an error.

@vyv03354
Copy link
Collaborator

vyv03354 commented Oct 4, 2016

The "best fit" mappings for windows-1255 (http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1255.txt) have the 0xCA to U+05BA mapping, by the way.

I'm OK with adding the mapping to Gecko's implementation.

@vyv03354
Copy link
Collaborator

vyv03354 commented Oct 5, 2016

On the other hand, the codepage chart at Microsoft https://msdn.microsoft.com/en-us/library/cc195057.aspx marks this position as "not used"

This is an archaic archive and should not be considered as a reference these days. For example, it does not contain a mapping to euro sign.

Recently Microsoft removed the former reference site and put a link to the "best fit" mappings on unicode.org. So the "best fit" mappings should be considered as the latest reference now.

@annevk
Copy link
Member

annevk commented Oct 5, 2016

@jungshik @hsivonen okay with you too?

@bhaible
Copy link
Author

bhaible commented Oct 5, 2016

The "best fit" mappings for windows-1255 (http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1255.txt) have the 0xCA to U+05BA mapping, by the way.

Thank you for the pointer to these tables. I've updated the mapping table comparison in http://haible.de/bruno/charsets/conversion-tables/CP1255.html.

FWIW, I made the corresponding change in GNU libiconv: http://git.savannah.gnu.org/gitweb/?p=libiconv.git;a=commitdiff;h=500b967b8f4bcb2bd656c293c5412dc611c5720b

@hsivonen
Copy link
Member

I'm OK with adding this mapping.

@mathiasbynens
Copy link
Member

Unless @jungshik objects, it seems this is ready to be merged.

annevk added a commit that referenced this issue Oct 24, 2016
Microsoft Windows has had this mapping for over fifteen years. Despite
it not being universally adopted, it seems best to align with Windows
here.

Fixes #73.
@annevk
Copy link
Member

annevk commented Oct 24, 2016

I created a PR, let me know if you see any problems. I plan on merging by end-of-day.

@jungshik
Copy link

I don't have any objection. I'll add that to Blink's mapping.

annevk added a commit that referenced this issue Oct 24, 2016
Microsoft Windows has had this mapping for over fifteen years. Despite
it not being universally adopted, it seems best to align with Windows
here.

Fixes #73.
mathiasbynens added a commit to mathiasbynens/windows-1255 that referenced this issue Oct 24, 2016
Microsoft Windows has had this mapping for over fifteen years. Despite
it not being universally adopted, the Encoding Standard aligned with Windows here.

whatwg/encoding#73
whatwg/encoding#77
jungshik added a commit to jungshik/web-platform-tests that referenced this issue Oct 27, 2016
hsivonen added a commit to hsivonen/encoding_rs that referenced this issue Oct 31, 2016
annevk pushed a commit to web-platform-tests/wpt that referenced this issue Nov 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants