Skip to content

Commit

Permalink
CLDR-17951 Update tr35.md, hoist Unicode BCP 47 Locale Identifier. (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
macchiati authored and conradarcturus committed Sep 25, 2024
1 parent 540b1f0 commit 49805dc
Showing 1 changed file with 24 additions and 11 deletions.
35 changes: 24 additions & 11 deletions docs/ldml/tr35.md
Original file line number Diff line number Diff line change
Expand Up @@ -369,15 +369,30 @@ The following are additional well-formedness constraints:
2. [ wfc: The sequence of variant subtags in a tlang must not have any duplicates. ]
3. [ wfc: The private use extension (-x-) must come after all other extensions. ]

For historical reasons, this is called a Unicode locale identifier. However, it really functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see _[Language and Locale IDs](#Language_and_Locale_IDs)_.
For historical reasons, this is called a Unicode locale identifier. However, it also functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see _[Language and Locale IDs](#Language_and_Locale_IDs)_.

As of the release of this specification, there were no other_extensions defined. The other_extensions are present in the syntax to allow implementations to preserve that information.

As for terminology, the term _code_ may also be used instead of "subtag", and "territory" instead of "region". The primary language subtag is also called the _base language code_. For example, the base language code for "en-US" (American English) is "en" (English). The _type_ may also be referred to as a _value_ or _key-value_.

All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [[BCP47](#BCP47)], especially when a Unicode locale identifier is used for locale data exchange in software protocols.

The identifiers can vary in case and in the separator characters. The "-" and "\_" separators are treated as equivalent, although "-" is preferred.

All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [[BCP47](#BCP47)], especially when a Unicode locale identifier is used for locale data exchange in software protocols.
A _Unicode **BCP 47** locale identifier_ (<a name="unicode_bcp47_locale_id" href="#unicode_bcp47_locale_id">`unicode_bcp47_locale_id`</a>) is a <a href="#unicode_locale_id">`unicode_locale_id`</a> that meets the following additional constraints:
- [ wfc: The EBNF `sep` is restricted to only [-] in <a name="unicode_language_id" href="#unicode_language_id"><code>unicode_language_id</code></a> and <a href="#unicode_locale_id">`unicode_locale_id`</a>.]
- [ wfc: The first subtag must be a <a name="unicode_language_subtag" href="#unicode_language_subtag"><code>unicode_language_subtag</code></a>.] Thus it can be **neither** of the following:
- a <a href="#unicode_script_subtag"><code>unicode_script_subtag</code></a>.
- a "root" subtag (the "und" <a href="#unicode_language_subtag"><code>unicode_language_subtag</code></a> is used instead of "root").

A well-formed _Unicode BCP 47 locale identifier_ is also a well-formed _BCP 47 language tag_. The reverse, however, is not guaranteed; a well-formed _BCP 47 language tag_ might not be a well-formed _Unicode BCP 47 locale identifier_.

A _Unicode **CLDR** locale identifier_ (<a name="unicode_cldr_locale_id" href="#unicode_cldr_locale_id">`unicode_cldr_locale_id`</a>) is a <a href="#unicode_locale_id">`unicode_locale_id`</a> that meets the following additional constraints:
- [ wfc: The EBNF `sep` is restricted to only [_] in <a name="unicode_language_id" href="#unicode_language_id"><code>unicode_language_id</code></a> and <a href="#unicode_locale_id">`unicode_locale_id`</a>.]
- [ wfc: The <a href="#unicode_language_id"><code>unicode_language_id</code></a> "und" is replaced by "root".]
- [ wfc: The first subtag cannot be a <a href="#unicode_script_subtag"><code>unicode_script_subtag</code></a>.]

**Note:** The current version of CLDR data uses _Unicode **CLDR** locale identifiers_ for backward compatibility. This might be changed in future CLDR releases.

#### <a name="Canonical_Unicode_Locale_Identifiers" href="#Canonical_Unicode_Locale_Identifiers">Canonical Unicode Locale Identifiers</a>

Expand Down Expand Up @@ -406,12 +421,6 @@ NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rath
* The ordering in does not result in a determinant order for canonicalization, since the mechanism for determining “importance” is not specified: ca-valencia-fonipa and ca-fonipa-valencia could both be ‘canonical’ variants of one another.
* Pure alphabetical order is determinant and simple to implement while the ordering in is indeterminant, more complex, and provides no significant benefit in modern applications.

**Note:** The current version of CLDR data uses some non-preferred _syntax_ for backward compatibility. This might be changed in future CLDR releases.

* It uses uppercase letters for variant subtags, while the preferred forms are all lowercase.
* It uses "\_" as the separator, while the preferred form of the separator is "-".
* It uses "root", while the preferred form is "und".

A [`unicode_locale_id`](#unicode_locale_id) is in _canonical form_ when it has canonical syntax and contains no aliased subtags. A [`unicode_locale_id`](#unicode_locale_id) can be transformed into canonical form according to [Annex C. LocaleId Canonicalization](#LocaleId_Canonicalization).

A [`unicode_locale_id`](#unicode_locale_id) is _maximal_ when the [`unicode_language_id`](#unicode_language_id) and tlang (if any) have been transformed by the Add Likely Subtags operation in _[Likely Subtags](#Likely_Subtags)_, excluding "und".
Expand Down Expand Up @@ -443,10 +452,14 @@ Unicode language and locale identifiers inherit the design and the repertoire of
* The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
* The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.

There are thus two subtypes of Unicode locale identifiers:
There are thus two subtypes of Unicode locale identifiers, as defined above.

* _Unicode **BCP 47** locale identifier_ (<a href="#unicode_bcp47_locale_id">`unicode_bcp47_locale_id`</a>).
- A well-formed _Unicode BCP 47 locale identifier_ is also a well-formed _BCP 47 language tag_.
- A well-formed _BCP 47 language tags_ might not be a well-formed _Unicode BCP 47 locale identifier_,
* _Unicode **CLDR** locale identifier_ (<a href="#unicode_cldr_locale_id">`unicode_cldr_locale_id`</a>)

* the term _Unicode CLDR locale identifier_ applies where the backwards compatibility syntax is used.
* the term _Unicode BCP 47 locale identifier_ applies otherwise. A _Unicode BCP 47 locale identifier_ is also a valid BCP 47 language tag.
These can both be easily converted to and from _BCP 47 language tags_ as described below.

#### <a name="BCP_47_Language_Tag_Conversion" href="#BCP_47_Language_Tag_Conversion">BCP 47 Language Tag Conversion</a>

Expand Down

0 comments on commit 49805dc

Please sign in to comment.