Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification that text-valued variables and attributes can be Unicode string or UTF-8 char arrays #543

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

JonathanGregory
Copy link
Contributor

See issue #141 for discussion of these changes.

Release checklist

  • [NA] Authors updated in cf-conventions.adoc? Add in two places: on line 3 and under .Additional Authors in About the authors.
  • [NA] Next version in cf-conventions.adoc up to date? Versioning inspired by SemVer.
  • [Y] history.adoc up to date?
  • [NA] Conformance document up to date?

@JonathanGregory JonathanGregory linked an issue Sep 17, 2024 that may be closed by this pull request
@JonathanGregory JonathanGregory added this to the 1.12 milestone Sep 17, 2024
@ChrisBarker-NOAA
Copy link
Contributor

This is a challenge! I did a bit more (unsatisfying" research into netcdf, string, and Unicode.

See my comment on #141, but I don't think this is ready to merge :-(

Copy link
Contributor

@ChrisBarker-NOAA ChrisBarker-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only substantial question I have is if we can strengthen "may" to "should" or "must".

(For the UTF-8 encoding part)

maybe split it up:

  • A text string in a variable or an attribute may be stored in either a variable-length string or a fixed-length char array.

  • Text (either char arrays or variable-length strings) should be as NFC normalized UTF-8 encoded Unicode.

For example, a character array variable of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.
The other strings, such as "May", should be padded with trailing NULL or space characters so that every array element is filled.
If the atomic string option is chosen, each element of the variable can be assigned a string with a different length.
A text string in a variable or an attribute may be represented either as Unicode text in a variable-length **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such thing as "Unicode text". I suggest:

A text string in a variable or an attribute may be stored as NFC normalized UTF-8 encoded Unicode data [bytes?] in either a variable-length string or a fixed-length char array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- should that be "may", rather than "must" or "should" ?

The other strings, such as "May", should be padded with trailing NULL or space characters so that every array element is filled.
If the atomic string option is chosen, each element of the variable can be assigned a string with a different length.
A text string in a variable or an attribute may be represented either as Unicode text in a variable-length **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
Note that the ASCII one-byte character codes (hexadecimal `00`-`7F`) are a subset of UTF-8.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the ASCII one-byte character codes (decimal 0-127, hexadecimal 00-7F) are a subset of UTF-8.


Before version 1.12, CF did not require text in **`char`** arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used.
If the array is stored in a variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
If the data-user has no information about the encoding, we suggest UTF-8 as a first guess.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm -- could we say that using anything other than UTF-8 while not specifying the _Encoding is an error? or is that just a free for all anyway :-(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Before version 1.12, CF did not require text in char arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used."

was there any requirement for strings? Did the NUG specify UTF-8 for strings from the start (of allowing strings...)?

If so , maybe we can specifically require UTF-* for strings.

NOTE: I'm pretty sure that some tools, e.g. netCDF4-python, does the _Encoding thing for both char arrays and strings, though it does default to UTF-8, do maybe any distinction is moot.

If the array is stored in a variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
If the data-user has no information about the encoding, we suggest UTF-8 as a first guess.

An __n__-dimensional array of strings may be implemented as a variable or an attribute of type **`string`** with _n_ dimensions (only _n_=1 is allowed for an attribute) or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick with UTF-8 is that it's a multi-byte encoding -- the number of bytes required may be more than the number of characters (code points) in the string. Should we mention that? or buyer beware if you are using non-ASCII code points?

(NOTE: this is why I was hoping we could restrict char arrays to ASCII -- but that boat has sailed :-( ) -- could we still suggest that?)


An __n__-dimensional array of strings may be implemented as a variable or an attribute of type **`string`** with _n_ dimensions (only _n_=1 is allowed for an attribute) or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.
For example, a **`char`** variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.
The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null is the default fill value for char anyway -- though it doesn't hurt to say this :-)

@@ -7,6 +7,7 @@

=== Working version (most recent first)

* {issues}141[Issue #141]: Clarification that text-valued variables and attributes can be Unicode vlen strings or UTF-8 char arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for attributes of type string
2 participants