Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling the mul "language" #1475

Open
pfps opened this issue Aug 31, 2024 · 3 comments
Open

handling the mul "language" #1475

pfps opened this issue Aug 31, 2024 · 3 comments

Comments

@pfps
Copy link

pfps commented Aug 31, 2024

Wikidata is adding a "mul" language, to be used for labels, in particular, when many languages have the same label for an item.

I think that this means that using @en@rdfs:label will not work well, as many items will not have an en label, but instead a mul label.

Is it possible to string this construct together, so that @en@mul@rdfs:label will get the en label if there is one and the mul label otherwise? Or is there some other construct that would work (aside from the three-line construct that gets both and does a COALESCE)?

@hannahbast
Copy link
Member

@pfps Can you give an example? What is the semantics of FILTER(LANG(?literal) = "en") when ?literal has the language tag @mul? And what is the motivation for this?

@pfps
Copy link
Author

pfps commented Sep 3, 2024

Wikidata is trying to cut down on the number of triples in the RDF dump. One thing that contributes to the large number of triples is repeated labels, e.g., for https://www.wikidata.org/wiki/Q892 where you can see the repeated labels. At https://www.wikidata.org/wiki/Q42 you can see the new way, with a mul "language" label (showing up under "default for all languages") and many of the other languages just using that. (The grey ones.)

What this means is that to get the English label for an item one has to do something like
OPTIONAL { ?x rdfs:label ?xLabelm. FILTER ( lang(?x) = "mul") }
OPTIONAL { ?x rdfs:label ?xLabele. FILTER (lang(?x) = "en" )}
BIND (COALESCE(?xLabele, ?xLabelm) AS ?xLabel)
The issue is that there will often not be an "en" label if there is a "mul" label.

I'm not saying that this is a good thing at all.

@tuukka
Copy link

tuukka commented Sep 3, 2024

I'm not saying that this is a good thing at all.

As I understand it, "mul" is being introduced because of Wikidata's internal reasons, with no other way forward found regarding Wikidata's scalability. I don't think anyone wanted to break compatibility with all the existing queries and tools, but this is the current situation. Also, my current impression is that the semantics haven't been fully figured out and it will depend on how Wikidata's editors will start to use this new feature in the software.

One way to handle this in QLever is to preprocess the dumps by copying "mul" labels to "en" labels where there isn't one already. This would restore compatibility with existing queries. (The weird thing is that you won't know which of "mul" labels will work in English and which won't, but apparently this is as designed. There may be some useful heuristics such as "copy the labels only if the writing system is Latin or the item is an instance of Q5 (human).")

The other way is to try to do the same as WDQS and keep a representation of Wikidata's internal model. This is useful for "maintenance" queries by people and tools that edit Wikidata ("I want to see all the items where the en label and mul label disagree."). To keep query writing practical, QLever would need some new syntax. It might make sense to support the same label service as WDQS (for compatibility and for query performance).

(Now that I think of it, you could combine these two approaches by providing yet another language code "en without mul" for those queries that want to keep en and mul separate.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants