-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for handling localizable texts (writeup of the F2F discussions) #354
Comments
If there is an agreement on this writeup, a separate PR will have to be prepared (probably by me...) |
On the title of metadata, about the necessity of ruby in Japanese. Reason 1: In these cases, HTML tags can not be used for metadata title and author name. The use of HTML tags is permitted only for content descriptions.
Reason 2: |
I'm very uncomfortable with the HTML proposal for reasons that have been pointed out before, as well as new ones based on your writeup:
Overall, I would be much more comfortable reframing the current discussion just in the context of the title. |
I would propose to leave #299 out of the discussion for now. The impression I got from the discussion with @danbri is that schema.org processors may not process language maps anyway, in which case that question may be moot. |
That may depend on areas of the world, but I agree the HTML version will not be used widely. But three-way choice means that, in most of the cases, the HTML version will not be necessary, ie, this is not a real load on users/authors. (I recognize it is a load on UA-s.)
If we really reduce the set to just a few attributes and elements, it may be relatively easy to handle these, including in a checker.
I am not sure what you mean. |
To go back to what I mentioned about the title of the publication, currently in our spec:
There's an inconsistency in our spec that we need to solve, but if we decide that using
|
In the RDF vocabulary of the Japanese National Diet Library, they introduce dcndl:transcription everywhere. For example:
|
I found a document (*) that explains very clearly, in the scope of the Arabic language, how Unicode bidi characters can solve the issue of mixing languages in a Unicode string. I'm therefore not convinced that inventing an XML syntaxe (even if it's a subset of HTML, this is in practice an XML dialect) inside a json value would be a better solution for production tools, databases and parsers all together, for a use case which may represent only 1% of the total production of metadata. @iherman, note that in your list, it seems that if the html tagging was accepted, there would also be a need to define the (*) in French, sorry; I guess English texts can be also found easily |
@HadrienGardeur (on #354 (comment)) You are right that the title is the most problematic use case and that the But we also have the (That being said, the 'alt text' in HTML for images is also restricted to simple texts...) |
@iherman You'd need a very twisted mind to include bold text, ruby and multiple languages in a company name. For person names, this was handled just fine in EPUB. For |
Actually, @HadrienGardeur, what you wrote in #354 (comment) may not work 100%. Indeed, the definition of the title element defines it to have "text" as a content model which, in my reading, means that it is possible to set the base direction and the language for the title, but no other HTML tag based language trick is possible. It may be "good enough" for us, but we should know that. @r12a ? |
@iherman quite frankly, I'm fine with that as well. It means that we're aligned with the Web which is one of the purpose of this group in the first place. |
According to the I18N group which met with us at TPAC Unicode isn't always sufficient. See this document for more: https://w3c.github.io/string-meta/#unicode-enough Additionally, there is an ongoing TAG review of the broader issues around multiple languages in JSON formats. There's also some WICG exploration around "purification" of HTML content down to specific tag and attribute sets: https://github.com/WICG/purification Browser defined HTML subsets are in their list of intended interests/use cases, so having a defined "language-focused subset" would have value there also.
I'd prefer we not alias anything until it's official. |
No one is suggesting that Unicode is enough. Right now we have Unicode + BCP 47 language tags with script subtags in our draft, which would correspond to the following note: https://w3c.github.io/string-meta/#script_subtag HTML/XML is also covered in there and they do include some important cons to that approach:
|
@toshiakikoike, @murata2makoto I want to briefly address this comment and Murata-san's reply. I agree that it is not necessary to add a transcription/pronunciation field to every natural language string. However, publication metadata should have a slot for each title and author to include this information. That's because sorting Japanese and Chinese depends on this information, which cannot generally be computed (particularly in Japanese) from the string itself. Without this information, sorting sets of content is generally reduced to radical based sorting, which is much less natural for the user. Overall, this is an aside from this issue's topic, but is still important for users of East Asian ideographic languages. |
In today's I18N teleconference I was tasked with replying to Ivan's email as well as this thread. I'm going to do that primarily with this comment. The current set of proposals (and the draft's text) covers language metadata reasonably well. The use of BCP47 language tags (which is also to say Unicode locale identifiers) at the publication and What remains at issue is the provisioning of direction metadata. We continue to feel that the construct One of the sticking points at TPAC was whether the resulting construct was compatible with JSON-LD or with RDF. From the above thread, it seems like this is something that could be worked around? What specific pros/cons need addressing here? Using markup for this, as championed by @BigBlueHat, would do this fine. Our sense is that this is not currently the consensus choice, however. We also note that the current draft contains language describing how to perform "first strong" detection of direction. We have a nit about the current text. It should say that the Unicode Bidirectional Algorithm is used to determine the base direction. In practice this means "usually the first strongly directional character", but there are certainly cases where the first strong character doesn't determine the direction (notably, when that character is surrounded by an "isolating" control character). @HadrienGardeur noted:
I think what is important is understanding why Unicode (by which I mean the use of Unicode bidi control characters) is insufficient. The Unicode embedding bidi controls do not solve all textual problems in a "wrong base direction" context, particularly with neutrals and directionally sensitive paired punctuation (such as parentheses). While LRM/RLM can be used with the controls to fix things up, quite a bit of introspection must be done by the content author to ensure the right result--this is inconvenient (and difficult to do automatically) for plain strings coming from databases or content management systems. Having the base direction for the string is often available to the producer and solves the problem (without having to edit invisible control characters not naturally occurring in the text!). Using "Bidi isolation" could help with this problem to some degree, although knowing the base direction is still necessary. The isolating controls make it much easier to construct, store, present, and exchange mixed direction text dynamically. However, because these controls are new, implementation support at the operating system and user agent level lags. Currently the controls are often just some "invisible junk" that don't have the desired effect. Finally, BCP47 tags are useful for identifying the language and presentation of content, but the artificial introduction of script subtags I think is a Bad Idea. The script is a property of the text itself. The script subtag exists to externally identify language/locale variations between content items. For most languages and most content--including the preponderance of languages using a bidirectional script--the script subtag is not recommended. In order to use a script subtag to set a base direction, content providers would have to determine what direction they wanted and infect the language tag with it. This might be fine if we follow common practice (as with CLDR/ICU) and infer suppressed scripts for text that follows its language (e.g. the tag Wouldn't it be better to just insert a metadata field than write code at the serialization layer (where I don't think it belongs) to inspect and insert additional bidi controls (which persist downstream), add script subtags, or wrap things in markup? |
Unfortunately, it is not possible. the JSON-LD rules are fairly strict, because they reflect the strict rules surrounding RDF Literals. This was discussed in the JSON-LD Working Group, and the Working Group decided that JSON-LD 1.1 would not deviate from RDF either. In other words, something like
Is not possible, it would be invalid JSON-LD, would be therefore rejected by JSON-LD processors. As a consequence, I do not see any possible workaround at this moment... Note, however, that there has been some interesting discussion on RDF that may affect this issue. A long and complex discussion has been started recently on the future of RDF; see, e.g., the recently set up github repository which collects lots of issues around RDF, including issues related to literals (see, e.g., w3c/EasierRDF#22, w3c/EasierRDF#21). It is unclear where this discussion will be heading and whether there will be a new version of RDF at some point, but that is certainly the long-term goal. If that happens, it will, eventually, affect the future evolution of JSON-LD as well. I think it would be worthwhile for the i18n community to be involved in that discussion on RDF, and make this recurring directionality issue very clear, hoping that this long-lasting problem may be solved once and for all. |
Would it be possible for you to provide a clear text, either in the form of a PR or simply sending us a replacement text? That would make the changes faster and cleaner. |
I try to summarize my thoughts...
Based on these facts, my personal proposal is:
|
I think we were envisaging something along the lines of: "auto: indicates that the textual values are explicitly directionally set to the direction of the first character with a strong directionality , following the rules of the Unicode Bidirectional Algorithm." (bold added to highlight changes only)
Two points there:
|
|
What i'm still not clear about is this: You added This makes me assume that it is possible for a consumer of the string to figure out that the base direction of a string should be RTL, as long as it knows enough about the structure of the WPub metadata to recognise The thing we appear to be stuck on is the use of a mechanism to indicate the item-specific base direction. And my understanding of the reason is that, when it comes to base direction, JSON-LD doesn't have a construct equivalent to that used for language. So here's why i'm confused. It seems to me that either: (a) we could add an item-specific field for base direction just like the one for language that may not be representable in JSON-LD, but presumably could still pass useful metadata to the consumer in a similar way to the use of the (b) there isn't actually a way of using the information provided by I'm fully prepared to be told that these questions expose large chasms separating my (mis)understanding and the way this all works, so please help me get that straight. |
I agree it is a bit confusing, sorry about that. From a purely JSON-LD (and RDF) point of view However... what does not work is this statement of yours:
the problem being that there is no way to do that per JSON-LD syntax. Something like:
does not work. A (JSON) object using (This was raised as a a JSON-LD WG issue (w3c/json-ld-syntax#11) and was closed by the Working Group as a "won't fix".) The introduction of |
Thanks Ivan, but that still doesn't really answer my question.
Yes, i understand that. It has been said many times. But if the So why can't we have an additional item-specific field ( That's what i don't understand. |
My 2c: Any validation or processing of the JSON as JSON-LD would raise an error or throw away that information. Meaning that you've turned what was JSON-LD into something that isn't. It's not just meaningless, it's wrong. You could call it just JSON ... but that would be a shame, instead of gathering together the use cases for this feature as a core part of the RDF data model. |
@azaroth42 , does "it's wrong" apply here to |
For maximum clarity, btw (should have said this earlier), the scenario i'm asking about wrt item-specific data is
|
Apologies, to be clearer by way of example: It is fine to use predicates in the graph to manage this data, by assigning them to a full resource. Any other information could be added as well. This is valid:
It does not work with individual strings, as the extra properties beyond value and language are not allowed due to the RDF data model. This (and your example) is (thus) invalid:
As it would try to generate a literal: And thus does not work with language maps on resources, which rely on the language tag of the string:
The language map is a short cut for the more verbose, less familiar Instead you would need to have multiple title fields, each with exactly one content string, one language and one direction.
It would be a significant improvement to RDF if the direction form were allowed, and by demonstrating the lack here (and in the JSON-LD group, plus elsewhere) I feel that there's a better chance to fix it properly rather than patching over it in a non-standard way, thereby reducing the desire for a prompt solution. |
What you have below as a value of "name" is an object with some properties, but it is *not* a literal in terms of JSON LD. Tha distinction means that, e.g., schema.org will not understand it.
If you replace "value" with "@value" and "language" with "@language" then you do get a valid representation of a literal (which is also ok with schema.org)... except that this will not work (per JSON LD) if one keeps *any* other term, ie, "direction":-(
Ie: what you propose *may* be valid JSON LD but means something fundamentally different and would not work for us...
(Written on my iPad. Excuses for brevity and misspellings...)
… On 12 Apr 2019, at 19:48, r12a ***@***.***> wrote:
For maximum clarity, btw (should have said this earlier), the scenario i'm asking about wrt item-specific data is
String with explicit language setting, i.e.,
"name": {
"value": "some text",
"language": "en",
"direction": "rtl"
}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I just realize a source of a terrible confusion, sorry about that. The previous comment, referring to value and name were taken isolation. However, it is a widespread pattern in the JSON LD world to define aliases for @value to be... value, for @language to be... language. I other words your example, as well as the examples in the document, are meant to be literals indeed, but the. you hit the issue with @value and JSON LD...
Sorry I did not realize this before.
I.
——
Ivan Herman
(Written on my iPad. Excuses for brevity and misspellings...)
… On 12 Apr 2019, at 20:42, Ivan Herman ***@***.***> wrote:
What you have below as a value of "name" is an object with some properties, but it is *not* a literal in terms of JSON LD. Tha distinction means that, e.g., schema.org will not understand it.
If you replace "value" with ***@***.***" and "language" with ***@***.***" then you do get a valid representation of a literal (which is also ok with schema.org)... except that this will not work (per JSON LD) if one keeps *any* other term, ie, "direction":-(
Ie: what you propose *may* be valid JSON LD but means something fundamentally different and would not work for us...
(Written on my iPad. Excuses for brevity and misspellings...)
> On 12 Apr 2019, at 19:48, r12a ***@***.***> wrote:
>
> For maximum clarity, btw (should have said this earlier), the scenario i'm asking about wrt item-specific data is
>
> String with explicit language setting, i.e.,
> "name": {
> "value": "some text",
> "language": "en",
> "direction": "rtl"
> }
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Thanks to @azaroth42 and a chat with Ivan yesterday and i finally understand that the suggestion in #354 (comment) involves, by convention, a special JSON-LD construct (because it contains a mapping to @value) that represents an RDF string rather than an object, and for this to hold, an object containing an @value cannot contain anything else other than language and type. Adding anything else breaks its meaning for automated tools. I then found myself wondering whether (just for the specific instances where we know that 1st strong heuristics will fail, eg a title like "HTML و CSS: تصميم و إنشاء مواقع الويب") we could use a different approach, such as
for which the this spec would provide advice about how applications can convert the relevant parts to RDF (minus the direction info), or otherwise spot that they need to override the 1st strong heuristics. Ivan told me that this would create too much repetition in the manifest, and @azaroth42 appears to be saying that it would create a getout that takes the pressure off the JSON-LD/RDF folks to properly address direction. And i readily admit that it's not an elegant solution. Therefore i conclude that, for the manifests created per this spec, wpub's preferred way of dealing with such a problematic title is to append an RLM to the beginning of the string, like this:
Is that correct? |
@r12a, yes, your summary is correct. The current editors' draft does have a paragraph (the paragraph after the note in section 2.6.4.4.2 that refers to this possibility. We would appreciate, however, if you could have a look and give us a feedback that would make this clearer to the reader. If necessary, we can also add some more examples in the BiDi example table. (It may be a good idea to provide such feedback in a separate issue, though, and close this one which came out of the TPAC F2F discussion.) |
@r12a I kind of disagree that this should be "wpub's preferred way of dealing with this". This is a workaround. For it to be automatic it would require wpub implementations to introspect string values and insert an RLM (or LRM in selected cases) marker (changing the data, which is generally a bad idea). I don't think this should be normative. This is really more "advice to content authors" for dealing with RDF (etc.'s) shortcomings. To that I would add advice to wpub consumers that they can use other mechanisms described in string-meta, such as inferring the direction from the language tag. Effectively, we have no solution to the problem and no short term path to a solution. We (I18N) should engage RDF-NG, JSON-LD, and schema folks about building a long term solution. @iherman I agree with your resolution. Consider referencing the examples in string-meta to save space in your document. @r12a and I will review what you have and raise individual specific new issues. |
@aphillips I see your point about the difficulties for implementers. Just a layperson's informative question, though. If (and I agree that is a big 'if') the author/editor of a book produces a proper string using, e.g., I am looking forward to your comments on the text in the current draft; we may then have an editorial run at those paragraphs and/or the examples. I am happy to participate in the work around RDF & directions if an RDF-NG works is indeed initiated. |
@iherman If a string includes one of the strongly directional markers and the string is displayed in a One of the reasons to have metadata is so that implementations can supply the (above has |
If @r12a's proposal, the
Earlier, @llemeurfr points out that if we're not doing HTML processing (which would not be the case here because a
@iherman are we planning to require implementing those "magic" entities in WPUB strings? or are they just "stand-ins" for the otherwise invisible Unicode characters that @aphillips mentioned? |
as far as I am concerned, the latter. Ie, we would have
|
@r12a @aphillips can y'all confirm that the above is "sufficient" for what's been discussed here? Also, these approaches don't handle multi-language strings...for which we'd need HTML (and ideally a limited subset). |
@BigBlueHat Including strongly directional characters, such as the RLM mark, is something authors do in their content. You could have an advisory message to do this, but I would oppose normative language requiring wpub applications to provide these characters. No approach handles multi-language plain-text strings. That does require markup or other mechanisms. These are relatively rare. |
Just to be clear, this was not my proposal: it was my understanding of the wpub proposal after talking with Ivan.
For plain text strings i believe you'd be looking at either the invisible formatting character itself, or perhaps, since this is Javascript,
The escape could also be written I wouldn't expect to see |
#440 is now a separate issue on the editorial aspect of the document. |
This is an attempt for a write-up for the result of the discussion on language and direction of the F2F in Lyon on how to handle localizable texts in the Web Publication Manifest. (Note that this issue is really a JSON-LD one, hence the cc below to people outside of the WG.)
Global setting of language and base directions
The current draft remains unchanged: we use the schema.org
inLanguage
and our owninDirection
term. The only change is that, eventually, the latter may become a schema.org term, and it should be removed, eventually, from the WPM ContextItem specific language (ie, Localizable text)
The mechanism that we agreed upon is as follows. There are three ways of setting a (localizable) literal. These literal are used for the title (ie,
name
) property, the names of creators, name of publishers, or (alt text) descriptions.The three possibilities are, in an increasing level of complexity:
"name":"some text"
where the value is a text in UTF8.
value
is a text in UTF8, andlanguage
may use any valid bcp47 tag.value
is an HTML snippet. We will have to define a very minimal subset of HTML that fits the purpose of internationalization, and require thatvalue
MUST NOT go beyond that. The exact set of allowed HTML tags and attributes are still to be decided (hopefully in cooperation with other parties), but it will probably mean restricting to the usage ofspan
,ruby
,rt
,rb
,bdi
, andbdo
elements, and thedir
andlang
attributes.Notes:
zh-Hant
andzh-Hans
for Chinese written in traditional, resp. Simplified, characters.datatype
.inLanguage
value. That makes the mapping to WebIDL's representation in Javascript more uniform (this is already the case in the current draft).datatype
and thelanguage
terms within the same object. If HTML is used and the language must be set explicitly, this should be done by enclosing the content into a span like this:<span lang="..>...</span>
datatype
is an alias for@type
in JSON-LD (to be added to the WPM context). This anticipates JSON-LD 1.1 where@datatype
is planned to become a new keyword.Cc in the group: @BigBlueHat @HadrienGardeur @TzviyaSiegman @GarthConboy @wareid
Cc in I18n: @r12a @aphillips
Cc in JSON-LD: @azaroth42 @gkellogg
Cc in WoT: @mkovatsc
The text was updated successfully, but these errors were encountered: