Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare language tags after normalizing to lower case. #55

Closed
gkellogg opened this issue Jul 27, 2023 · 11 comments · Fixed by #74
Closed

Compare language tags after normalizing to lower case. #55

gkellogg opened this issue Jul 27, 2023 · 11 comments · Fixed by #74
Labels
spec:enhancement Issue or proposed change to enhance the spec without changing the normative content substantively

Comments

@gkellogg
Copy link
Member

gkellogg commented Jul 27, 2023

Mixed in with #48, which has since been removed from that PR, is text to compare language tags after normalizing to lower case. This is consistent with the suggestion that language tags can be converted to lower case when language-tagged strings are introduced, but was never part of RDF 1.0 nor RDF 1.1. It arguably intrudes on D-entailment where "foo"@en and "foo"@EN could be considered to have the same value but still be separate terms.

The key commit which reverted the wording is c45d947.

@afs
Copy link
Contributor

afs commented Jul 28, 2023

Reference to RDF Semantics:
https://www.w3.org/TR/rdf12-semantics/#D_interpretations

The issue with c45d947 is that it is a required part of term-equals where as earlier it was "MAY" followed by "The value space of language tags is always in lower case."

I think this is the only case where two things would be RDF term-equals without them being character-by-character equals (after escape processing).

@gkellogg gkellogg added the needs discussion Proposed for discussion in an upcoming meeting label Jul 28, 2023
@gkellogg gkellogg added the discuss-f2f Proposed for discussion during the next face-to-face meeting label Sep 5, 2023
@Antoine-Zimmermann
Copy link

I think the specification is quite clear (quote from RDF 1.1 Concepts):

Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal, character by character. Thus, two literals can have the same value without being the same RDF term.

The tags en-US and en-us use different characters, therefore are different, regardless of semantics.

Normalising language tags has the same effect as normalising lexical forms: it changes the graph. The fact that the normalised graph means the same does not imply that they are the same.

@gkellogg
Copy link
Member Author

gkellogg commented Sep 15, 2023

This was discussed at TPAC

discussion

gkellogg: last issue is about BCP47 case issue; do we want to take this after the break?

addison: this one seems easy

gkellogg: the problem is: are two triples differing only by the language tag case two separate triples or a single triple?
… this raises issues for RDF C14N.

<AndyS> pfps - Issue: w3c/rdf-concepts#9 // PR w3c/rdf-concepts#48

gkellogg: no PR on this, only an issue.

addison: BCP47 is clearly made to be case insensitive
… it is perfectly valid to normalize things or to XXX
… I would not recommend people to only use the lowercase form -- many people want to make the tags pretty.

gkellogg: currently, literal term equality is term sensitive
… we could change that to make the comparison of the language tag case-insensitive
… this has consequences when you insert triples in a store

AndyS: what is the approach in XML?
… I believe there is a SPARQL test related to case-sensitivity with language tags.
… Should we push this to the meaning domain or the value domain?

addison: from what you described earlier, this is probably one triple

AndyS: then we need to decide which noramlization to use

<AZ> The fact that the lower case and upper case mean the same does not imply that they are the same tag in the syntax

<ora> Thanks Addison!

ktl: I think we have what we need from i18n, thank you very much.
… we'll continue the discussion between us.

addison: I will share some reference material

My takeaway is that RDF was wrong to interpret "foo"@en-US and "foo"@en-us as different literals. If we updated language to require the internal representation as being in lower case, then serializations would be free to either representative them as originally specified, in lower case, or based on suggested BCP47 formatting without changing their lexical value.

@Antoine-Zimmermann
Copy link

Antoine-Zimmermann commented Sep 15, 2023

@gkellogg Ok, but if this change is made, that would be a backward incompatibility change. If a SPARQL query counts the number of literals there are in the data, then in SPARQL 1.1, with "foo"@en and "foo"@EN, the answer would be 2, and in SPARQL 1.2 the answer it would be 1. Maybe it is not a big deal, but backward compatibility is taken very seriously in W3C standards.

@gkellogg
Copy link
Member Author

I agree that we need to consider this seriously. But, the tacit advice in RDF concepts that implementations may normalize to lower case gives us cover. AFAIK, many implementations follow this option (my own does).

Needs more discussion.

@TallTed
Copy link
Member

TallTed commented Sep 18, 2023

@gkellogg -- In your #55 (comment), I think you should wrap the "foo"@en-us and "foo"@en-us" (the latter of which should probably be "foo"@en-US, i.e., capital US and no trailing ") in backticks, so the @en-us user is not pinged about this thread, and so your meaning is clearer...

@ktk ktk removed the discuss-f2f Proposed for discussion during the next face-to-face meeting label Oct 3, 2023
@gkellogg gkellogg removed the needs discussion Proposed for discussion in an upcoming meeting label Dec 1, 2023
@gkellogg
Copy link
Member Author

gkellogg commented Dec 1, 2023

After the discussion on This week's call I believe we' agreed to separate this into two issues:

  1. Require some form of normalization in the abstract syntax (implementation dependent, but consistent) so that parsing two literals that differ only in the language-tag case would result in just a single triple. Now this is only permitted by implementations, leaving a gray area that this would close, at the cost of being a breaking change for some implementations.
  2. To the degree that we suggest how graphs are serialized, provide some guidance in the form of language tags. For N-Quads/Triples canonicalization, if this were lower-case (e.g., "foo"@en-us), it would be consistent with the RDF Dataset Canonicalization Candidate Recommendation (see note in introduction). Alternatively, it could be changed to use the recommended format from BCP47/RFC4646 (e.g., "foo"@en-US), but this would immediately conflict with RDF Canonicalization, even thought it is based on RDF 1.1 and not RDF 1.2.

Proposed changes

  • change to 3.3 Literals
    • if and only if the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString, a non-empty language tag as defined by [BCP47]. The language tag MUST be well-formed according to section 2.2.9 of [BCP47] and MUST be case normalized consistently (e.g., to lower case).

    • A literal is a language-tagged string if the third element is present and the fourth element is not present. Lexical representations of language tags MUST be case normalized and MAY be converted to lower case. The value of language tags is always treated as being in lower case.

  • change to G. Changes between RDF 1.1 and RDF 1.2
    • Language tags were previously allowed to be normalized to lower case, which made it ambiguous if two literals with language tags different only by case represented the same literal, or distinct literals. RDF 1.2 requires that language tags be case normalized, but does not specify excactly how this is to be performed. Implementations can either follow the advice to normalize to lower case, use the recommended BCP47 format, or something else, as long it is performed consistently.

@TallTed
Copy link
Member

TallTed commented Dec 4, 2023

I'll have some text tweaks... but these proposed changes look like the right direction.

@afs
Copy link
Contributor

afs commented Dec 9, 2023

  1. Require some form of normalization in the abstract syntax (implementation dependent, but consistent) so that parsing two literals that differ only in the language-tag case would result in just a single triple. Now this is only permitted by implementations, leaving a gray area that this would close, at the cost of being a breaking change for some implementations.

I believe we agreed that to within case-sensitivity parsing two literals that differ only in the language-tag case would result in just a single literal. Consistent formatting is a way of doing; there are other ways (e.g. dictionaries).

We have the opportunity to get away from RDF preferring "lower-case" when BCP-47 says something different.

@afs
Copy link
Contributor

afs commented Dec 9, 2023

BTW the BCP47 terminology is "format" (Although in one place later-on about extensions, it slips in "normalize").

2.1.1. Formatting of Language Tags

@afs
Copy link
Contributor

afs commented Dec 9, 2023

As for Dataset canonicalization, it only has to add that language tags are lower-cased during canonicalization.

Systems exist which today do not lower-case ("EN-gb" becomes "en-GB") and have unique language tags - they are not wrong.

gkellogg added a commit that referenced this issue Jan 11, 2024
* Case normalization of language tags. Fixes #55.

---------

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Co-authored-by: Andy Seaborne <andy@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:enhancement Issue or proposed change to enhance the spec without changing the normative content substantively
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants