Multiple-char operators in the Operator Dictionary #143

fred-wang · 2019-09-21T11:44:27Z

I'm not aware when this was decided, but the operator dictionary contains the following entries with multiple characters:

U00021-00021 !!  {'lspace': 1, 'rspace': 0}
U00021-0003D !=  {'lspace': 4, 'rspace': 4}
U00026-00026 &&  {'lspace': 4, 'rspace': 4}
U0002A-0002A **  {'lspace': 1, 'rspace': 1}
U0002A-0003D *=  {'lspace': 4, 'rspace': 4}
U0002B-0002B ++  {'lspace': 0, 'rspace': 0}
U0002B-0003D +=  {'lspace': 4, 'rspace': 4}
U0002D-0002D --  {'lspace': 0, 'rspace': 0}
U0002D-0003D -=  {'lspace': 4, 'rspace': 4}
U0002D-0003E ->  {'lspace': 5, 'rspace': 5}
U0002E-0002E ..  {'lspace': 0, 'rspace': 0}
U0002E-0002E-0002E ...  {'lspace': 0, 'rspace': 0}
U0002F-0002F //  {'lspace': 1, 'rspace': 1}
U0002F-0003D /=  {'lspace': 4, 'rspace': 4}
U0003A-0003D :=  {'lspace': 4, 'rspace': 4}
U0003C-0003D <=  {'lspace': 5, 'rspace': 5}
U0003C-0003E <>  {'lspace': 1, 'rspace': 1}
U0003D-0003D ==  {'lspace': 4, 'rspace': 4}
U0003E-0003D >=  {'lspace': 5, 'rspace': 5}
U0007C-0007C ||  {'lspace': 2, 'symmetric': True, 'stretchy': True, 'rspace': 2}
U0007C-0007C ||  {'lspace': 0, 'symmetric': True, 'stretchy': True, 'rspace': 0}
U0007C-0007C ||  {'lspace': 0, 'symmetric': True, 'stretchy': True, 'rspace': 0}
U0007C-0007C-0007C |||  {'lspace': 2, 'symmetric': True, 'stretchy': True, 'rspace': 2}
U0007C-0007C-0007C |||  {'lspace': 0, 'symmetric': True, 'stretchy': True, 'rspace': 0}
U0007C-0007C-0007C |||  {'lspace': 0, 'symmetric': True, 'stretchy': True, 'rspace': 0}
U0223D-00331 ∽̱  {'lspace': 3, 'rspace': 3}
U02242-00338 ≂̸  {'lspace': 5, 'rspace': 5}
U0224E-00338 ≎̸  {'lspace': 5, 'rspace': 5}
U0224F-00338 ≏̸  {'lspace': 5, 'rspace': 5}
U02266-00338 ≦̸  {'lspace': 5, 'rspace': 5}
U0226A-00338 ≪̸  {'lspace': 5, 'rspace': 5}
U0226B-00338 ≫̸  {'lspace': 5, 'rspace': 5}
U0227F-00338 ≿̸  {'lspace': 5, 'rspace': 5}
U02282-020D2 ⊂⃒  {'lspace': 5, 'rspace': 5}
U02283-020D2 ⊃⃒  {'lspace': 5, 'rspace': 5}
U0228F-00338 ⊏̸  {'lspace': 5, 'rspace': 5}
U02290-00338 ⊐̸  {'lspace': 5, 'rspace': 5}
U029CF-00338 ⧏̸  {'lspace': 5, 'rspace': 5}
U029D0-00338 ⧐̸  {'lspace': 5, 'rspace': 5}
U02A7D-00338 ⩽̸  {'lspace': 5, 'rspace': 5}
U02A7E-00338 ⩾̸  {'lspace': 5, 'rspace': 5}
U02AA1-00338 ⪡̸  {'lspace': 5, 'rspace': 5}
U02AA2-00338 ⪢̸  {'lspace': 5, 'rspace': 5}
U02AAF-00338 ⪯̸  {'lspace': 5, 'rspace': 5}
U02AB0-00338 ⪰̸  {'lspace': 5, 'rspace': 5}
U02ADD-00338 ⫝̸  {'lspace': 5, 'rspace': 5}

Currently this is not supported in browsers ( https://bugs.webkit.org/show_bug.cgi?id=124828 ).
We have some tests to check that operators render the same with implicit and without explicit operator properties specified by the dictionary.

Probably they will need to be handled in a separate table, which will make the code a bit more complex/larger. Multiple vertical bars are even stretchy but OpenType only provides per-glyph stretching, which means we will have to add more spec description/test/implementation if we really want to support multi-char stretching.

So I wonder how important all of these are? It seems many of them are just equivalent to a single unicode character (or are waiting for such a character to be introduced in Unicode). Can't people just use explicit lspace/rspace or the equivalent unicode code point? At least there are already code points for double/triple stretchy vertical bars which are stretchy and supported by OpenType fonts / browsers.

The text was updated successfully, but these errors were encountered:

fred-wang · 2019-09-21T11:47:41Z

My proposal would be

remove them from MathML Core (this has never been implemented in browsers so that does not break anything + it was not normative)
make MathML Full refer to the MathML Core dictionary (to avoid duplicating content) and extend it with multiple characters for backward compatibility.

fred-wang · 2019-09-21T11:57:28Z

I stand corrected, multi-char seems to be implemented in Gecko:

https://searchfox.org/mozilla-central/source/layout/mathml/mathfont.properties#72

spacing seems to work but not stretching. Currently, it uses a hash table of strings (see https://bugzilla.mozilla.org/show_bug.cgi?id=1336437) while WebKit uses a sorted table of Unicode code point. cc @emilio

davidcarlisle · 2019-09-21T12:05:29Z

They are really two flavours of these some with duplicated ascii like ++ and += etc might be thought of legacy approximations to ⧺ or but there are not really enough symbols and in any case people often prefer the look of the repeated operators when laying out += for C-style assignments.

Other duplicated operators with a combining character such as the combining negation slash or the variant selector are harder to get rid of as Unicode as a rule would be reluctant to add new pre-composed characters that are equivalent to a combination with a negation slash. The ones that do have pre-composed negations are some arbitrary list based on legacy font encodings (mostly).

stretching of multiple character operators is likely to be difficult (pretty much impossible in TeX as well) so you could probably say explicitly that that isn't supported in core (and we could make all multiple character entries have stretchy set to false ?

If supporting multiple character entries for spacing is likely to be problematic in core then it would be easy enough (I think) to extract a table for core spec without them and extract something for full spec that says something or adds them back, but of spacing is Ok and just stretchy property is difficult as I say I think we could just make them all stretchy=false even in full.

davidcarlisle · 2019-09-21T12:24:36Z

note that if you use the entities the multiple character nature is hidden.

if you look at greater than, not greater than, much greater than, not much greater than then

>, &ngt;, &GreaterGreater;, &NotGreaterGreater;

>, ≯, ⪢, ≫̸

look like four similar inputs, but the fact that one negation is pre-composed and one made up of a base and combining character is the sort of low level Unicode details that in an ideal world authors would not need to know about.

rwlbuis · 2019-09-23T16:05:41Z

This seems reasonable and people will probably expect the multi char to be supported given the past. I'll soon try to implement this since at least for Full it will be in the specification, so WebKit would need to support it anyway.

NSoiffer · 2019-09-23T17:06:00Z

My 2 cents:

A number of these, such as ++ and += were added for compatibility with programming languages. People sometimes write pseudo code that has subscripts and other math notations but also use programming language symbols. I don't have a clue how often that is done with MathML though.
I believe some multi-char operators were added as approximations prior to Unicode adding the symbol. I could be wrong about that. If anyone cared, you can look at when symbols were added. This class of characters could easily be dropped. || and ||| as prefix/postfix operators fall into that category. As infix operators, at least || falls into the programming language category.
I agree that any remaining multiple-character symbols should not be stretchy by default. Note that the MathML spec says "In practice, typical renderers will only be able to stretch a small set of characters, and quite possibly will only be able to generate a discrete set of character sizes." Hence, stretchiness is dependent on the renderer and the font. For core, we want all renders to do the same thing, but what they do depends on the font. I'm not sure how we specify that...
In most(?) cases, Unicode will say that a combining character involving a slash (to create a "not" form) is equivalent to a pre-combined character. That has to be supported in core as this comes from Unicode, not MathML.

rwlbuis · 2019-09-26T09:00:17Z

We now support enough of multi-char in chromium that the test passes:
https://w3c-test.org/mathml/presentation-markup/operators/operator-dictionary-001.html

khaledhosny · 2019-10-01T07:43:43Z

Unicode as a rule would be reluctant to add new pre-composed characters that are equivalent to a combination with a negation slash

AFAIK while Unicode has a policy against encoding new pre-composed characters, combining marks that over strike their bases are exempted from this (but they will not be made canonically equivalent to the decomposed form).

fred-wang · 2019-10-01T16:19:21Z

So I didn't comment here, but two weeks ago we agree to keep multi-char support. Rob already fixed our chromium branch and there is https://bugs.webkit.org/show_bug.cgi?id=124828 in webkit.

fred-wang · 2019-10-04T06:12:26Z

Consensus from previous meetings:

keep multi-char operators
remove "stretchy" property for them (since stretching is not supported in that case anyway)

fred-wang · 2020-04-11T02:09:37Z

@NSoiffer @davidcarlisle I still see a log of multiple-char entries with symmetric/stretchy (and fence). Can we remove these properties?

davidcarlisle · 2020-04-11T10:15:53Z

Yes I agree we shouldn't imply these stretch. Neil have you pending changes, or should I do that?

NSoiffer · 2020-04-11T19:50:53Z

Yes, I have some changes pending. I'll remove any stretchy properties from them. Since symmetric only applies to stretchy chars, I'll make sure those go too. Removing "fence" though doesn't make sense unless you are saying you want to remove that property from MathML ("separator" would then go also). That would be something to raise in its own issue and something to discuss on a call.

…

On Sat, Apr 11, 2020 at 3:16 AM David Carlisle ***@***.***> wrote: Yes I agree we shouldn't imply these stretch. Neil have you pending changes, or should I do that? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#143 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALZM3AH66L6THRQ4SQGVILRMA7OLANCNFSM4IY56Y3Q> .

fred-wang · 2020-04-15T11:21:52Z

Yes, I have some changes pending. I'll remove any stretchy properties from them. Since symmetric only applies to stretchy chars, I'll make sure those go too.

Thanks.

Removing "fence" though doesn't make sense unless you are saying you want to remove that property from MathML ("separator" would then go also). That would be something to raise in its own issue and something to discuss on a call.

Yes, that's why I put that one in parenthesis. fence/separator don't have any use for layout so implementers can just ignore them for now anyway, which is probably what we will do in Chromium for now. The question of whether this will be used for browsers' accessibility tree is still open but I'm not aware of any use or plan to use it (they ar exposed by webkit on iOS/macOS but not sure if VoiceOver handles them). It seems there are not many of operators with these properties so there is also the option of handling them separately if they turn out to be necessary.

fred-wang · 2020-05-06T14:12:19Z

I opened #209 for the separate fence/separator discussion.

The following entries seem still weird to me, can the spacing be tweaked so that they can be moved to another pre-existing category?

** infix: {'form': 'infix', 'lspace': 1, 'rspace': 1}
// infix: {'form': 'infix', 'lspace': 1, 'rspace': 1}
<> infix: {'form': 'infix', 'lspace': 1, 'rspace': 1}

I still see a lot of repeated ASCII characters and I'm not sure how relevant these entries are. I would rather see them in prefformated text, not math layout...

fred-wang · 2020-05-07T07:26:41Z

Multichar entries are now handled by https://mathml-refresh.github.io/mathml-core/#operator-dictionary-compact

For the record, current estimated size is 770*2 = 1540 bytes. The cost of supporting multi char entries is quite significant, (154+49)*2 = 406 bytes so 26% of the dictionary size.

If some entries are not essential, it would be very good to try and simplify things. For example restricting to 2-char strings would avoid the extra character necessary for nulll-terminated strings. And it looks like ASCII forms are not important at all, they should be replaced with the proper Unicode code point (or people should use preformated text rather than math formulas). I wonder whether we could just restrict to negated XXXX-00338 entries?

* "|||" does not seem to be used as a programming language operator. * For (stretchy) fences, U+2980 is more appropriate than "|||" w3c/mathml#143 w3c/mathml#176

It seems to be used as a punctuation sign rather than an operator. The ellipsis character … U+2026 seems more appropriate for that purpose. w3c/mathml#143 w3c/mathml#176

fred-wang · 2020-05-07T11:57:28Z

If some entries are not essential, it would be very good to try and simplify things. For example restricting to 2-char strings would avoid the extra character necessary for nulll-terminated strings.

For this point, I opened
mathml-refresh/xml-entities#25
mathml-refresh/xml-entities#26

fred-wang · 2020-05-12T06:51:32Z

This is now a table of 2-char ASCII operators (38 bytes): Operators_2_ascii_chars
https://mathml-refresh.github.io/mathml-core/#operator-dictionary-compact

Text has been changed to handle case of 2-char op with the second character is either U+338 COMBINING LONG SOLIDUS OVERLAY or U+20D2 COMBINING LONG VERTICAL LINE OVERLAY. I'm not sure if there is an easy way in browsers to check for combining characters, and only these two seemed important per yesterday's discussion. But we can change that later if more single char + combining are needed.

The two surrogate pairs for Arabic operators are also handled specially.

I'm closing this as the tests are already written, they just need to be regenerated.

fred-wang mentioned this issue Sep 21, 2019

Agenda for Core Meeting #8

Closed

fred-wang mentioned this issue Nov 18, 2019

Operator dictionary: Provide a compact form in MathML Core #176

Closed

fred-wang mentioned this issue May 7, 2020

Remove operators "|||" from the dictionary mathml-refresh/xml-entities#25

Merged

fred-wang mentioned this issue May 7, 2020

Replace 3-char operator "..." with "U+2026 …" instead. mathml-refresh/xml-entities#26

Merged

fred-wang closed this as completed May 12, 2020

fred-wang removed need resolution Issues needing resolution at MathML Refresh CG meeting need specification update Issues requiring specification changes need tests Issues related to writing WPT tests need polyfill Issues requiring implementation changes labels May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple-char operators in the Operator Dictionary #143

Multiple-char operators in the Operator Dictionary #143

fred-wang commented Sep 21, 2019

fred-wang commented Sep 21, 2019

fred-wang commented Sep 21, 2019

davidcarlisle commented Sep 21, 2019 •

edited

Loading

davidcarlisle commented Sep 21, 2019 •

edited

Loading

rwlbuis commented Sep 23, 2019

NSoiffer commented Sep 23, 2019

rwlbuis commented Sep 26, 2019

khaledhosny commented Oct 1, 2019

fred-wang commented Oct 1, 2019

fred-wang commented Oct 4, 2019

fred-wang commented Apr 11, 2020

davidcarlisle commented Apr 11, 2020

NSoiffer commented Apr 11, 2020 via email

fred-wang commented Apr 15, 2020

fred-wang commented May 6, 2020

fred-wang commented May 7, 2020

fred-wang commented May 7, 2020

fred-wang commented May 12, 2020

Multiple-char operators in the Operator Dictionary #143

Multiple-char operators in the Operator Dictionary #143

Comments

fred-wang commented Sep 21, 2019

fred-wang commented Sep 21, 2019

fred-wang commented Sep 21, 2019

davidcarlisle commented Sep 21, 2019 • edited Loading

davidcarlisle commented Sep 21, 2019 • edited Loading

rwlbuis commented Sep 23, 2019

NSoiffer commented Sep 23, 2019

rwlbuis commented Sep 26, 2019

khaledhosny commented Oct 1, 2019

fred-wang commented Oct 1, 2019

fred-wang commented Oct 4, 2019

fred-wang commented Apr 11, 2020

davidcarlisle commented Apr 11, 2020

NSoiffer commented Apr 11, 2020 via email

fred-wang commented Apr 15, 2020

fred-wang commented May 6, 2020

fred-wang commented May 7, 2020

fred-wang commented May 7, 2020

fred-wang commented May 12, 2020

davidcarlisle commented Sep 21, 2019 •

edited

Loading

davidcarlisle commented Sep 21, 2019 •

edited

Loading