Require UTF-8 For Windows? #2220

DennisHeimbigner · 2022-02-07T22:17:43Z

DennisHeimbigner
Feb 7, 2022
Collaborator

This is a followup to Issue #2190

It used to be that Windows, by default supported its own CP-1252
character set for 8-bit characters. The 1252 character set is similar
to, but not identical with, the ISO-Latin-1 character set.
At the same time, Windows also supported a wide-character set
(utf16-LE) capability; somewhat like Java.

As of a couple of years ago, Windows began to support (almost)
the use of the utf8 character set. It now advises new
applications to use utf8 instead of either utf16 or cp-1252.
This is discussed here:

https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

Technically, this is still Beta, but it seems pretty solid at
this point. The way to enable it is as follows:

In the "run" toolbar, execute the command "intl.cpl".
Move to the Administrative tab.
Move to "Change system locale"
Check the box at the bottom labeled something like
"Beta: Use Unicode UTF-8 for worldwide language support"

This still causes some problems because utf-16 cannot support
the whole utf character set, while utf8 can support it. So there
are some Asian and other characters that cause failures.

In any case, the question I pose is this:

Should Unidata/NetCDF-C impose a requirement that we will only support
the utf8 character set on Windows?

The important consequence is that the 1252 character set is deprecated
and users can use it, but we (NetCDF-C) will make no attempt to support it.

Dave-Allured · 2022-02-09T15:35:39Z

Dave-Allured
Feb 9, 2022

What is the context? My understanding is that netCDF has assumed UTF-8 for file object names (variables, attributes, etc.) since version 3.6.3, at least 14 years ago.

https://docs.unidata.ucar.edu/netcdf-c/current/faq.html#NetCDF-363-permits-UTF-8-encoded-Unicode-names-Wont-this-break-backward-compatibility-with-previous-software-releases-that-didnt-allow-such-names

0 replies

DennisHeimbigner · 2022-02-09T19:50:45Z

DennisHeimbigner
Feb 9, 2022
Collaborator Author

At the spec level yes. But in practice, it was not doing that for windows.

0 replies

Dave-Allured · 2022-02-10T14:51:12Z

Dave-Allured
Feb 10, 2022

Okay. For file and path names, definitely support UTF-8 by default on Windows. However, if easy, maintain some kind of legacy option to support UTF-16 or CP-1252. This is only a suggestion, I do not personally need this.

For data in character and string data types, please support transparent storage with no encoding restrictions. I think this has been the status quo since the start of netCDF. The actual encoding is decided by application context or external labeling. The modern assumption is UTF-8, but it should not be required.

0 replies

DennisHeimbigner · 2022-02-10T19:10:53Z

DennisHeimbigner
Feb 10, 2022
Collaborator Author

The issue of what the "char" and "string" types mean has always been ambiguous.
In practice, the netcdf-c library only assumes that a char is an arbitrary 8-bit value
and a string is a sequence of chars with a trailing nul character as end of string.
The place where the encoding matters is in ncdump output or ncgen .cdl files.
If you use something other than utf8, then ncdump will just print the characters, which
usually prints them as \ddd format. Ncgen will take \ddd as input, but it does otherwise assume
utf-8.

1 reply

Dave-Allured Feb 10, 2022

Yes, that was my understanding for "char" and "string" data types in netCDF. Also my experience with ncdump is that I can get the correct character display by selecting the appropriate locale for a console or text window. I think this system (arbitrary bytes) is best to accommodate the widest range of usages.

DennisHeimbigner · 2022-02-10T20:43:35Z

DennisHeimbigner
Feb 10, 2022
Collaborator Author

I agree, although this has never been formally decided. Some years ago there was an
extensive discussion about providing and "_Encoding" attribute for char and string valued
variables. It may be in the CF conventions for all I know.

0 replies

dopplershift · 2022-02-11T18:46:33Z

dopplershift
Feb 11, 2022
Maintainer

By requiring UTF-8, aren't you essentially deprecating support for any older version of Windows? If that's the case, it would be really helpful to know how many users this would impact.

0 replies

DennisHeimbigner · 2022-02-11T20:22:58Z

DennisHeimbigner
Feb 11, 2022
Collaborator Author

Currently, turning on full utf8 support in Windows 10 is optional and off by default.
The PR I submitted attempts to check the windows code page and act accordingly.
Remember that the netcdf spec specifies utf8 (at least for names, it is in practice agnostic
with respect to char and string data). So until now, the windows implementation
has violated this standard.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require UTF-8 For Windows? #2220

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Require UTF-8 For Windows? #2220

DennisHeimbigner Feb 7, 2022 Collaborator

Replies: 7 comments · 1 reply

Dave-Allured Feb 9, 2022

DennisHeimbigner Feb 9, 2022 Collaborator Author

Dave-Allured Feb 10, 2022

DennisHeimbigner Feb 10, 2022 Collaborator Author

Dave-Allured Feb 10, 2022

DennisHeimbigner Feb 10, 2022 Collaborator Author

dopplershift Feb 11, 2022 Maintainer

DennisHeimbigner Feb 11, 2022 Collaborator Author

DennisHeimbigner
Feb 7, 2022
Collaborator

Replies: 7 comments 1 reply

Dave-Allured
Feb 9, 2022

DennisHeimbigner
Feb 9, 2022
Collaborator Author

Dave-Allured
Feb 10, 2022

DennisHeimbigner
Feb 10, 2022
Collaborator Author

DennisHeimbigner
Feb 10, 2022
Collaborator Author

dopplershift
Feb 11, 2022
Maintainer

DennisHeimbigner
Feb 11, 2022
Collaborator Author