CLDR-17951 Update tr35.md, hoist Unicode BCP 47 Locale Identifier. (#…

…4062)
unicode-org · Sep 25, 2024 · 49805dc · 49805dc
1 parent 540b1f0
commit 49805dc
Showing 1 changed file with 24 additions and 11 deletions.
diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md
@@ -369,15 +369,30 @@ The following are additional well-formedness constraints:
   2. [ wfc: The sequence of variant subtags in a tlang must not have any duplicates. ]
   3. [ wfc: The private use extension (-x-) must come after all other extensions. ]
 
-For historical reasons, this is called a Unicode locale identifier. However, it really functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see _[Language and Locale IDs](#Language_and_Locale_IDs)_.
+For historical reasons, this is called a Unicode locale identifier. However, it also functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see _[Language and Locale IDs](#Language_and_Locale_IDs)_.
 
 As of the release of this specification, there were no other_extensions defined. The other_extensions are present in the syntax to allow implementations to preserve that information.
 
 As for terminology, the term _code_ may also be used instead of "subtag", and "territory" instead of "region". The primary language subtag is also called the _base language code_. For example, the base language code for "en-US" (American English) is "en" (English). The _type_ may also be referred to as a _value_ or _key-value_.
 
+All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [[BCP47](#BCP47)], especially when a Unicode locale identifier is used for locale data exchange in software protocols.
+
 The identifiers can vary in case and in the separator characters. The "-" and "\_" separators are treated as equivalent, although "-" is preferred.
 
-All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [[BCP47](#BCP47)], especially when a Unicode locale identifier is used for locale data exchange in software protocols.
+A _Unicode **BCP 47** locale identifier_ (<a name="unicode_bcp47_locale_id" href="#unicode_bcp47_locale_id">`unicode_bcp47_locale_id`</a>) is a <a href="#unicode_locale_id">`unicode_locale_id`</a> that meets the following additional constraints:
+- [ wfc: The EBNF `sep` is restricted to only [-] in <a name="unicode_language_id" href="#unicode_language_id"><code>unicode_language_id</code></a> and <a href="#unicode_locale_id">`unicode_locale_id`</a>.]
+- [ wfc: The first subtag must be a <a name="unicode_language_subtag" href="#unicode_language_subtag"><code>unicode_language_subtag</code></a>.] Thus it can be **neither** of the following:
+    - a <a href="#unicode_script_subtag"><code>unicode_script_subtag</code></a>.
+    - a "root" subtag (the "und" <a href="#unicode_language_subtag"><code>unicode_language_subtag</code></a> is used instead of "root").
+
+A well-formed _Unicode BCP 47 locale identifier_ is also a well-formed _BCP 47 language tag_. The reverse, however, is not guaranteed; a well-formed _BCP 47 language tag_ might not be a well-formed _Unicode BCP 47 locale identifier_.
+
+A _Unicode **CLDR** locale identifier_ (<a name="unicode_cldr_locale_id" href="#unicode_cldr_locale_id">`unicode_cldr_locale_id`</a>) is a <a href="#unicode_locale_id">`unicode_locale_id`</a> that meets the following additional constraints:
+- [ wfc: The EBNF `sep` is restricted to only [_] in <a name="unicode_language_id" href="#unicode_language_id"><code>unicode_language_id</code></a> and <a href="#unicode_locale_id">`unicode_locale_id`</a>.]
+- [ wfc: The <a href="#unicode_language_id"><code>unicode_language_id</code></a> "und" is replaced by "root".]
+- [ wfc: The first subtag cannot be a <a href="#unicode_script_subtag"><code>unicode_script_subtag</code></a>.]
+
+**Note:** The current version of CLDR data uses _Unicode **CLDR** locale identifiers_ for backward compatibility. This might be changed in future CLDR releases.
 
 #### <a name="Canonical_Unicode_Locale_Identifiers" href="#Canonical_Unicode_Locale_Identifiers">Canonical Unicode Locale Identifiers</a>
 
@@ -406,12 +421,6 @@ NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rath
   * The ordering in does not result in a determinant order for canonicalization, since the mechanism for determining “importance” is not specified: ca-valencia-fonipa and ca-fonipa-valencia could both be ‘canonical’ variants of one another.
   * Pure alphabetical order is determinant and simple to implement while the ordering in is indeterminant, more complex, and provides no significant benefit in modern applications.
 
-**Note:** The current version of CLDR data uses some non-preferred _syntax_ for backward compatibility. This might be changed in future CLDR releases.
-
-  * It uses uppercase letters for variant subtags, while the preferred forms are all lowercase.
-  * It uses "\_" as the separator, while the preferred form of the separator is "-".
-  * It uses "root", while the preferred form is "und".
-
 A [`unicode_locale_id`](#unicode_locale_id) is in _canonical form_ when it has canonical syntax and contains no aliased subtags. A [`unicode_locale_id`](#unicode_locale_id) can be transformed into canonical form according to [Annex C. LocaleId Canonicalization](#LocaleId_Canonicalization).
 
 A [`unicode_locale_id`](#unicode_locale_id) is _maximal_ when the [`unicode_language_id`](#unicode_language_id) and tlang (if any) have been transformed by the Add Likely Subtags operation in _[Likely Subtags](#Likely_Subtags)_, excluding "und".
@@ -443,10 +452,14 @@ Unicode language and locale identifiers inherit the design and the repertoire of
   * The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
   * The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.
 
-There are thus two subtypes of Unicode locale identifiers:
+There are thus two subtypes of Unicode locale identifiers, as defined above.
+
+* _Unicode **BCP 47** locale identifier_ (<a href="#unicode_bcp47_locale_id">`unicode_bcp47_locale_id`</a>).
+    - A well-formed _Unicode BCP 47 locale identifier_ is also a well-formed _BCP 47 language tag_.
+    - A well-formed _BCP 47 language tags_ might not be a well-formed _Unicode BCP 47 locale identifier_,
+* _Unicode **CLDR** locale identifier_ (<a href="#unicode_cldr_locale_id">`unicode_cldr_locale_id`</a>)
 
-* the term _Unicode CLDR locale identifier_ applies where the backwards compatibility syntax is used.
-* the term _Unicode BCP 47 locale identifier_ applies otherwise. A _Unicode BCP 47 locale identifier_ is also a valid BCP 47 language tag.
+These can both be easily converted to and from _BCP 47 language tags_ as described below.
 
 #### <a name="BCP_47_Language_Tag_Conversion" href="#BCP_47_Language_Tag_Conversion">BCP 47 Language Tag Conversion</a>