Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[I18N] References to ISO-639 vs. BCP47 #959

Closed
aphillips opened this issue Jun 24, 2019 · 21 comments
Closed

[I18N] References to ISO-639 vs. BCP47 #959

aphillips opened this issue Jun 24, 2019 · 21 comments
Labels
dcat dcat-primer Issues useful for dcat-primer feedback Issues stemming from external feedback to the WG future-work issue deferred to the next standardization round i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Milestone

Comments

@aphillips
Copy link

6.4.9 Property: language (and other locations)
https://www.w3.org/TR/2019/WD-vocab-dcat-2-20190528/

dct:LinguisticSystem Resources defined by the Library of Congress (ISO 639-1, ISO 639-2) SHOULD be used. If a ISO 639-1 (two-letter) code is defined for language, then its corresponding IRI SHOULD be used; if no ISO 639-1 code is defined, then IRI corresponding to the ISO 639-2 (three-letter) code SHOULD be used.

Dublin Core, which is the source for this reference, is unclear (AFAICT) in its relationship to BCP47 and language tags. The use of ISO 639 parts 1 and 2 (without reference to part 3) and the lack of support for the various kinds of subtags in BCP47 (not to mention the stability guarantees found in the BCP registry) are somewhat outdated. It's also notable that the description in DC of "LingusticsSystem" seems to be more like language tags than 639 provides.

Some systems (e.g. library technologies) might be limited to the 639 codes simply because they are intended for a specific industry segment (which is fine, since BCP47 includes the 639 codes). For general interchange, however, it would be better if DCAT permitted language tags and not just 639 codes, as this is the modern standard and used broadly on the Internet/Web.

[This comment is part of the I18N horizontal review.]

@andrea-perego andrea-perego added feedback Issues stemming from external feedback to the WG dcat labels Jun 24, 2019
@andrea-perego andrea-perego added this to the DCAT CR milestone Jun 24, 2019
@makxdekkers
Copy link
Contributor

To me, this seems to be a comment addressed to DCMI. DCAT just reuses/imports the property dct:language.
The DCMI term is aimed at providing a way to advertise a language used in the described resource by linking to a resource (through its URI) that represents that language. In Linked Data terms, one can resolve that link and find out more about that language. E.g. resolving http://id.loc.gov/vocabulary/iso639-1/en.html or https://www.wikidata.org/wiki/Q1860 gives you information about the language, e.g. its relation to other languages, its name in other languages, places where it is spoken etc.
BCP47 is a list of tags with a powerful mechanism to specify regional variants, different scripts etc. but there are, as far as I know, no URIs for all the potential BCP47 tags, so they cannot be used for dct:language.
However, the full power of BCP47 is available for tagging any text field in metadata and data as specified at https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal.
Maybe there could be a statement in the DCAT specification somewhere that encourages the use of language tags on all pieces of text in metadata and data?

@aisaac
Copy link
Contributor

aisaac commented Jul 2, 2019

I find this quite puzzling actually and went below my radar while I should have perhaps flagged it earlier... Some comments that could have an impact on discussion here:

  • the reference to LC's URIs doesn't come from DCMI, see http://purl.org/dc/terms/language
  • DCMI is going to soften its stance on ranges, i.e. (among others) dct:language's range will "include" LinguisticSystem but this will not rule out other types of resources, which includes strings (like the regular, old-fashioned tags, possibly with subtags)
  • I am not sure that we should be so focused on LoC's URIs in our recommendations. As @makxdekkers suggests there are plenty other resources like Wikidata that could provide with suitable URIs. Has there been a decision to not recommend them?

@makxdekkers
Copy link
Contributor

@aisaac If I remember correctly, the GLD WG thought it would be good to recommend one particular set of URIs for dct:language, to help interoperability. After some discussion, the group decided to refer to LoC's URIs. The reasons were (a) that those URIs were the most authoritative, as LoC is the maintenance agency for ISO639-2 and (b) that LoC maintains resolvable URIs for both ISO639-1 and ISO639-2.

@aisaac
Copy link
Contributor

aisaac commented Jul 3, 2019

@makxdekkers thanks for the clarification. GLD is not DCMI indeed. I can agree with the fact that LoC's URIs are probably the best around for ISO codes, but am still a bit unsure about making a strong recommendation for something that would capture only a part of the multilinguality needs (especially, no subtags). We could still leave the door open to BCP47.

Note that's also a problem we have for Europeana: we would like to normalize languages to URIs, and we were looking especially at the NAL from the European Commission (for example http://publications.europa.eu/resource/authority/language/FRA). But that would have implied losing all the subtags (which is something we've realized only quite recently, in fact, see https://europeana.atlassian.net/browse/RD-64), so for the moment we normalize to ISO639-3 and keep subtags according to BCP47 (for the moment just hoping that the subtags are correct, which is a bit optimistic but we plan to check that some day too).

@makxdekkers
Copy link
Contributor

@aisaac As far as I have understood the intention of dct:language, in many cases that I have seen it being used, is to give users information about what kind of content they can expect to find in a resource. In most cases, the fact that a resource has content in a particular language can help the user to decide that the resource could be useful -- that it is English is the main concern, while the fact that it is US or GB English would often be less important.
It is true that using the current dct:language range does not allow the kind of detail on subtags that BCP47 has (primary, secondary, script, region, variant, extension, private, grandfathered) and I think it is unlikely that someone will create (resolving) URIs for all the possible combinations of BCP47. However, consuming applications would need to include code to make sense of the BCP47 code -- and I would think that in most cases applications would really be only interested in the first 2 or 3-letter code. Or would you think that a search interface would allow people to filter on fr-CA rather than just on French?

@aisaac
Copy link
Contributor

aisaac commented Jul 9, 2019

@makxdekkers We should forget the limitation of dct:language in the base DCMI namespace. This is going to be lifted very soon. Then there will remain a soft recommendation to use URIs for LinguisticSystems. And examples including ISO 639-2 and 639-5 from Library of Congress, but also BCP46 including subtags. So it will still be rather open to using subtags for scripts and other variants.

I agree that most cases will focus on simple tags, but there are for sure cases (as in academic research) where finer-grained info is relevant.

I guess I don't mind giving prominent mention to the simple codes (and the URIs that represent them) but it would be good to be not too exclusive - which I guess brings us back to the original comment.

@makxdekkers
Copy link
Contributor

@aisaac I wasn't talking about the limitation of dct:language. I was arguing, as I often do, that increasing flexibility for data providers comes at a price of requiring data consumers to implement the logic to handle the additional flexibility.

@aisaac
Copy link
Contributor

aisaac commented Jul 9, 2019

@makxdekkers I think I understand this point. I could point to the fact that some data consumers could be happy to invest in more logic, if they can get more value. But I won't do it too much. My main point here was really to make 100% clear to someone else than us that the constrain wouldn't come from the dct:language property itself, as the thread got a bit convoluted and may have hinted that it was the case.

@smrgeoinfo
Copy link
Contributor

can dct:language be multiple-valued? Then the spec could recommend ('should') use of LoC 639-1 and 639-2 URIs, but allow ('May') additional dct:language elements with values from other more granular schemes like BCP46. This would make client logic simpler I think.

@aisaac
Copy link
Contributor

aisaac commented Jul 9, 2019

@smrgeoinfo yes this could be an interesting option. There is nothing against having multiple values for dct:language, if just because a resource could be in several languages.
It may be awkward to have multiple values when they are (slightly) redundant together, though, like a URI and a string that represents the same language with an extra script info. But formally there is nothing against it.

@smrgeoinfo
Copy link
Contributor

following "Recommendation 7. Encoding schemes should be implemented using a 'scheme' attribute of the XML element for the property. The encoding scheme name should be given as the attribute value. For example:" in http://www.dublincore.org/specifications/dublin-core/dc-xml-guidelines/2002-04-14/, (has that been superseded?), one could go as far as something like
<dct:language scheme = 'ISO639-2'> en </dct:language>
but I'm not sure how that would work in an RDF encoding.

@davebrowning
Copy link
Contributor

Thanks for the useful input, and interesting discussions.

The DCAT editing group have discussed this issue at some length and our consensus view is that now is not the best time to change the DCAT position (as in DCAT 2014) ahead of DCMI softening its stance on ranges. We’d hope that this could be re-visited as part of a future revision that could take account of any such changes, and at that point we'd look at exploiting approaches such as those suggested by @smrgeoinfo as well as any others that are applicable.

The broader point in this thread – that flexibility for data providers often incurs costs for data consumers – is something that we’d like to address (perhaps in a primer document or some further examples) when we have time/people to do this.

To make sure we don't forget this, this issue will remain open, but taged to 'DCAT Future Work" so it can be picked up in future revisions.

@davebrowning davebrowning added the dcat-primer Issues useful for dcat-primer label Jul 24, 2019
@davebrowning davebrowning added the future-work issue deferred to the next standardization round label Sep 25, 2019
@aphillips
Copy link
Author

I'm a little confused by the outcome. I think the situation is necessarily confusing. For example, the purl.org link on the RDF property dct:language in the draft takes you to a page that outright references RFC4646 (a version of BCP47), but the range in the document has normative SHOULD language for both the use of ISO 639-1/2 and normative SHOULD (which is actually much weaker than BCP47) for using the -1 code when both a -1 and -2 code exists.

I realize the desire for URLs, particularly resolvable ones, for valid language tags remains an issue for some users/implementations and that is something that hasn't been solved for BCP47 yet. It's something I will take an action item to pursue separately in the I18N WG.

The key problem here is that BCP47 language tags are widely used in Web-based specifications, making specs that don't support them a potential interoperability risk. Of course, the reverse is also true (that the sudden appearance of language tags could also be a problem). I do think it would be better if DCAT could insert at least some health warning or consider some guidance to at least call out the potential for change in this area so that implementers are not surprised if later the range restrictions are relaxed.

@aphillips
Copy link
Author

Note: I have opened an action item in I18N to address the lack of resolvable URLs for language tags and have contacted various people at IETF/IANA about potentially hosting it.

@makxdekkers
Copy link
Contributor

@aphillips Maybe you also want to include the people at http://www.lexvo.org/ in your discussions? They have done quite a lot of work on minting URIs for language-related objects which might be useful.

@andrea-perego
Copy link
Contributor

Thanks for suggesting the inclusion of a "health warning", @aphillips . This would indeed be important to address the possible confusion caused by the pointer to the inconsistent definition in DCMI you pointed out - where the textual definition says dct:language should be used with ISO language codes, whereas the defined range is not a literal, but class dct:LinguisticSystem.

So, we can deal with this by adding a note, clarifying this point.

@aphillips , if you think this can solve the issue, we'll create a draft PR for you to review.

Just for our records I dug a bit into this.

The inconsistent DCMI defintion was actually discussed by the GLD WG while working on the first release of DCAT - see https://www.w3.org/2011/gld/track/issues/26 - and ended up in deciding to recommend the use of URIs.

Checking the different DCTERMS guidelines, the confusion is not solved. E.g., in the chapter about creating metadata of the Dublin Core User Guide, they keep on saying:

For the identification of languages please follow RFC 4646. Best practice would be to select a value from the three letter language tags of ISO 639 (e.g. http://www.sil.org/iso639-3/codes.asp).

However, the associated links to examples point to the relevant section of the Publishing Metadata chapter, which instead states:

The range of dcterms:language it [sic!] the class dcterms:LinguisticSystem. All values used with dcterms:language have to be instances of this class. Therefore the property may only be used with non-literal values.

ex:myBook dcterms:title "A great deliverance" ;
  dcterms:language [ rdf:value "eng"^^dcterms:RFC4646 ] .

or

ex:myBook dcterms:title "A great deliverance" ;
  dcterms:language <http://lexvo.org/id/iso639-3/eng>
...
ex:mySong dcterms:title "The Power of Orange Knickers"
  dcterms:language _:eng

_:eng rdfs:Label "English"
  ex:639-1 "en"  
  ex:639-2 "eng"

So, irrespective of the inconsistent free-text statements, dct:language seems to be, in the intention of the DCMI editors, an object property (because of its range), and the relevant examples confirm that is not meant to be used with literals.

If this is the case, language codes are meant to be used only with class dct:LinguisticSystem, for describing a language, and this (i.e., describing a language) is not in scope of DCAT, but rather of reference registers / controlled vocabularies.

Of course, this may change in the future, if DCMI is going to relax its axioms, going maybe so far as to make an object property also a datatype property (as @aisaac noted).

However, as @makxdekkers was arguing, this will lead to a backward compatibility issue (at least when dct:language is used in the scope of DCAT), and it won't help interoperability.

Actually, there may be the option of using the corresponding property from DCMI Elements with literals, namely, dc:language, which is indeed a datatype property. However, considering the current status of the DCAT2 specification, and that no use case was submitted to motivate the support for language tags in DCAT, I don't think that this alternative can be included in DCAT2.

@r12a r12a added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Nov 20, 2019
@riccardoAlbertoni
Copy link
Contributor

We discussed this issue in the last DCAT subgroup meeting.

@aphillips: do you think a "health warning" in the specification can solve the issue? And in such a case, could you provide us with a text to include?

@aphillips
Copy link
Author

Having re-read this thread this morning and gone back to the editor's draft... the text here says:

Resources defined by the Library of Congress (ISO 639-1, ISO 639-2) SHOULD be used. If a ISO 639-1 (two-letter) code is defined for language, then its corresponding IRI SHOULD be used; if no ISO 639-1 code is defined, then IRI corresponding to the ISO 639-2 (three-letter) code SHOULD be used.

I think this should say something entirely different. It should probably say something like:

Different standards are used to identify languages. Where possible, BCP47 language tags SHOULD be used. Resources defined by ISO 639 MAY also be used. If an ISO 639-1 (two-letter) code is defined for a language, then its corresponding IRI SHOULD be used in preference to the ISO 639-2 or ISO 639-3 code.

Alternatively, if you don't want to change the recommendation, a health warning could be something like:

NOTE: requirements for identification of natural language in linked data specifications are evolving. Many applications use [BCP47] language tags for this purpose. ISO 639 also provides additional codes in ISO 639-3 which might be required for some uses.

I'm happy to discuss the details here. If you need me to attend a future call, please let me know.

(A couple of minor side notes. BCP47 is a better reference than any of its constituent RFCs; the current "core" RFC of BCP47 is 5646 (4646 was an older edition). ISO 639-3 and -2 are linked together (3 is a superset of 2), but the reference for 3, as far as I recall, is not the Library of Congress, hence my omission of LOC in my suggested text)

@riccardoAlbertoni riccardoAlbertoni added the due for closing Issue that is going to be closed if there are no objection within 6 days label May 30, 2022
@riccardoAlbertoni
Copy link
Contributor

@aphillips: We have considered your points, and after some discussions, we included the health warning you suggested.
Is it ok if we close this issue?

@riccardoAlbertoni
Copy link
Contributor

Closing applying the "due for closing" policy, and also considering we have implemented Addison's suggestion.

@aphillips
Copy link
Author

In our teleconference today, the I18N WG agreed that we are satisfied by this change. Please remember to add this to DCAT3 also.

@davebrowning davebrowning removed the due for closing Issue that is going to be closed if there are no objection within 6 days label Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dcat dcat-primer Issues useful for dcat-primer feedback Issues stemming from external feedback to the WG future-work issue deferred to the next standardization round i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Projects
None yet
Development

No branches or pull requests

8 participants