Compare language tags after normalizing to lower case. #55

gkellogg · 2023-07-27T21:35:29Z

Mixed in with #48, which has since been removed from that PR, is text to compare language tags after normalizing to lower case. This is consistent with the suggestion that language tags can be converted to lower case when language-tagged strings are introduced, but was never part of RDF 1.0 nor RDF 1.1. It arguably intrudes on D-entailment where "foo"@en and "foo"@EN could be considered to have the same value but still be separate terms.

The key commit which reverted the wording is c45d947.

The text was updated successfully, but these errors were encountered:

afs · 2023-07-28T20:13:43Z

Reference to RDF Semantics:
https://www.w3.org/TR/rdf12-semantics/#D_interpretations

The issue with c45d947 is that it is a required part of term-equals where as earlier it was "MAY" followed by "The value space of language tags is always in lower case."

I think this is the only case where two things would be RDF term-equals without them being character-by-character equals (after escape processing).

Antoine-Zimmermann · 2023-09-12T09:38:30Z

I think the specification is quite clear (quote from RDF 1.1 Concepts):

Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal, character by character. Thus, two literals can have the same value without being the same RDF term.

The tags en-US and en-us use different characters, therefore are different, regardless of semantics.

Normalising language tags has the same effect as normalising lexical forms: it changes the graph. The fact that the normalised graph means the same does not imply that they are the same.

gkellogg · 2023-09-15T21:35:04Z

This was discussed at TPAC

discussion

gkellogg: last issue is about BCP47 case issue; do we want to take this after the break?

addison: this one seems easy

gkellogg: the problem is: are two triples differing only by the language tag case two separate triples or a single triple?
… this raises issues for RDF C14N.

<AndyS> pfps - Issue: w3c/rdf-concepts#9 // PR w3c/rdf-concepts#48

gkellogg: no PR on this, only an issue.

addison: BCP47 is clearly made to be case insensitive
… it is perfectly valid to normalize things or to XXX
… I would not recommend people to only use the lowercase form -- many people want to make the tags pretty.

gkellogg: currently, literal term equality is term sensitive
… we could change that to make the comparison of the language tag case-insensitive
… this has consequences when you insert triples in a store

AndyS: what is the approach in XML?
… I believe there is a SPARQL test related to case-sensitivity with language tags.
… Should we push this to the meaning domain or the value domain?

addison: from what you described earlier, this is probably one triple

AndyS: then we need to decide which noramlization to use

<AZ> The fact that the lower case and upper case mean the same does not imply that they are the same tag in the syntax

<ora> Thanks Addison!

ktl: I think we have what we need from i18n, thank you very much.
… we'll continue the discussion between us.

addison: I will share some reference material

My takeaway is that RDF was wrong to interpret "foo"@en-US and "foo"@en-us as different literals. If we updated language to require the internal representation as being in lower case, then serializations would be free to either representative them as originally specified, in lower case, or based on suggested BCP47 formatting without changing their lexical value.

Antoine-Zimmermann · 2023-09-15T21:54:44Z

@gkellogg Ok, but if this change is made, that would be a backward incompatibility change. If a SPARQL query counts the number of literals there are in the data, then in SPARQL 1.1, with "foo"@en and "foo"@EN, the answer would be 2, and in SPARQL 1.2 the answer it would be 1. Maybe it is not a big deal, but backward compatibility is taken very seriously in W3C standards.

gkellogg · 2023-09-15T23:12:03Z

I agree that we need to consider this seriously. But, the tacit advice in RDF concepts that implementations may normalize to lower case gives us cover. AFAIK, many implementations follow this option (my own does).

Needs more discussion.

TallTed · 2023-09-18T15:46:31Z

@gkellogg -- In your #55 (comment), I think you should wrap the "foo"@en-us and "foo"@en-us" (the latter of which should probably be "foo"@en-US, i.e., capital US and no trailing ") in backticks, so the @en-us user is not pinged about this thread, and so your meaning is clearer...

gkellogg · 2023-12-01T23:10:10Z

After the discussion on This week's call I believe we' agreed to separate this into two issues:

Require some form of normalization in the abstract syntax (implementation dependent, but consistent) so that parsing two literals that differ only in the language-tag case would result in just a single triple. Now this is only permitted by implementations, leaving a gray area that this would close, at the cost of being a breaking change for some implementations.
To the degree that we suggest how graphs are serialized, provide some guidance in the form of language tags. For N-Quads/Triples canonicalization, if this were lower-case (e.g., "foo"@en-us), it would be consistent with the RDF Dataset Canonicalization Candidate Recommendation (see note in introduction). Alternatively, it could be changed to use the recommended format from BCP47/RFC4646 (e.g., "foo"@en-US), but this would immediately conflict with RDF Canonicalization, even thought it is based on RDF 1.1 and not RDF 1.2.

Proposed changes

change to 3.3 Literals
- if and only if the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString, a non-empty language tag as defined by [BCP47]. The language tag MUST be well-formed according to section 2.2.9 of [BCP47] and MUST be case normalized consistently (e.g., to lower case).
- A literal is a language-tagged string if the third element is present and the fourth element is not present. Lexical representations of language tags MUST be case normalized and MAY be converted to lower case. The value of language tags is always treated as being in lower case.
change to G. Changes between RDF 1.1 and RDF 1.2
- Language tags were previously allowed to be normalized to lower case, which made it ambiguous if two literals with language tags different only by case represented the same literal, or distinct literals. RDF 1.2 requires that language tags be case normalized, but does not specify excactly how this is to be performed. Implementations can either follow the advice to normalize to lower case, use the recommended BCP47 format, or something else, as long it is performed consistently.

TallTed · 2023-12-04T15:29:48Z

I'll have some text tweaks... but these proposed changes look like the right direction.

Fixes #55.

afs · 2023-12-09T20:43:04Z

Require some form of normalization in the abstract syntax (implementation dependent, but consistent) so that parsing two literals that differ only in the language-tag case would result in just a single triple. Now this is only permitted by implementations, leaving a gray area that this would close, at the cost of being a breaking change for some implementations.

I believe we agreed that to within case-sensitivity parsing two literals that differ only in the language-tag case would result in just a single literal. Consistent formatting is a way of doing; there are other ways (e.g. dictionaries).

We have the opportunity to get away from RDF preferring "lower-case" when BCP-47 says something different.

afs · 2023-12-09T20:55:25Z

BTW the BCP47 terminology is "format" (Although in one place later-on about extensions, it slips in "normalize").

2.1.1. Formatting of Language Tags

afs · 2023-12-09T20:59:59Z

As for Dataset canonicalization, it only has to add that language tags are lower-cased during canonicalization.

Systems exist which today do not lower-case ("EN-gb" becomes "en-GB") and have unique language tags - they are not wrong.

* Case normalization of language tags. Fixes #55. --------- Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com> Co-authored-by: Andy Seaborne <andy@apache.org>

gkellogg added the spec:enhancement Issue or proposed change to enhance the spec without changing the normative content substantively label Jul 27, 2023

gkellogg added the needs discussion Proposed for discussion in an upcoming meeting label Jul 28, 2023

This was referenced Aug 7, 2023

Note that datatype can't be used when canonicalizing language-tagged strings w3c/rdf-n-quads#46

Closed

Canonical representation of language-tagged string w3c/rdf-n-quads#45

Closed

gkellogg added the discuss-f2f Proposed for discussion during the next face-to-face meeting label Sep 5, 2023

ktk removed the discuss-f2f Proposed for discussion during the next face-to-face meeting label Oct 3, 2023

gkellogg removed the needs discussion Proposed for discussion in an upcoming meeting label Dec 1, 2023

gkellogg added a commit that referenced this issue Dec 5, 2023

Case normalization of language tags.

1f00e72

Fixes #55.

gkellogg mentioned this issue Dec 5, 2023

Case normalization of language tags. #74

Merged

gkellogg closed this as completed in #74 Jan 11, 2024

gkellogg added a commit that referenced this issue Jan 11, 2024

Case normalization of language tags. (#74)

3eeba16

* Case normalization of language tags. Fixes #55. --------- Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com> Co-authored-by: Andy Seaborne <andy@apache.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare language tags after normalizing to lower case. #55

Compare language tags after normalizing to lower case. #55

gkellogg commented Jul 27, 2023 •

edited

Loading

afs commented Jul 28, 2023

Antoine-Zimmermann commented Sep 12, 2023

gkellogg commented Sep 15, 2023 •

edited

Loading

Antoine-Zimmermann commented Sep 15, 2023 •

edited

Loading

gkellogg commented Sep 15, 2023

TallTed commented Sep 18, 2023

gkellogg commented Dec 1, 2023

TallTed commented Dec 4, 2023

afs commented Dec 9, 2023 •

edited

Loading

afs commented Dec 9, 2023 •

edited

Loading

afs commented Dec 9, 2023 •

edited

Loading

Compare language tags after normalizing to lower case. #55

Compare language tags after normalizing to lower case. #55

Comments

gkellogg commented Jul 27, 2023 • edited Loading

afs commented Jul 28, 2023

Antoine-Zimmermann commented Sep 12, 2023

gkellogg commented Sep 15, 2023 • edited Loading

Antoine-Zimmermann commented Sep 15, 2023 • edited Loading

gkellogg commented Sep 15, 2023

TallTed commented Sep 18, 2023

gkellogg commented Dec 1, 2023

Proposed changes

TallTed commented Dec 4, 2023

afs commented Dec 9, 2023 • edited Loading

afs commented Dec 9, 2023 • edited Loading

afs commented Dec 9, 2023 • edited Loading

gkellogg commented Jul 27, 2023 •

edited

Loading

gkellogg commented Sep 15, 2023 •

edited

Loading

Antoine-Zimmermann commented Sep 15, 2023 •

edited

Loading

afs commented Dec 9, 2023 •

edited

Loading

afs commented Dec 9, 2023 •

edited

Loading

afs commented Dec 9, 2023 •

edited

Loading