Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.] #474

hartman · 2013-05-19T19:10:02Z

Because MathJax looks at individual code points it has trouble dealing with scripts that require bidirectionality, context shaping etc. This is visible whenever trying to use hebrew or arabic for instance.

It would be good if MathJax would be able to identify these ranges and be able to keep those as blocks instead of dividing it into individual characters. At the very least in \text mode.

http://en.wikipedia.org/wiki/Complex_text_layout

dpvc · 2013-05-19T21:13:41Z

Note that if you set mtextFontInherit to true in the HTML-CSS and SVG sections of your configuration, then MathJax will process \text{} as a single <span>, and so that should do as you request. You are right that MathJax could do better when mtextFontInherit is false. It should group "unknown" characters into a single collection, rather than putting each into a separate <span>.

dpvc · 2013-05-19T21:30:17Z

PS, I saw the report on the Wikimedia bugzilla and was planning to add it to the list of things to fix. Thanks for staring the issue here to track that.

hartman · 2013-05-22T20:26:03Z

Thanks for the mtextFontInherit tip. I was going to enable that anyways, but this is one more reason to do that.

dpvc · 2014-03-21T21:11:45Z

Some support for RTL was added in v2.3, but the issue of multiple-character sequences being treated as a unit remains. For \text{}, these characters should already be grouped into a single <span>, so that would be one way to handle it, though not very convenient.

Ideally, MathJax would put each sequence that forms one group into a single <mi> or <mo>, just as it does for single Latin letters now. I've looked into this to some degree, and there are some difficulties handling it. It is possible to have combining characters grouped with their preceding characters, but it is not clear to me how some characters work. For example, it seems that the virama (U+0D4D) combines not just the character on its left, but also on the right, though I might be misunderstanding it. It also seems that some of these grouping are handled by ligatures within the fonts, not by combining characters. Unfortunately, MathJax does not have access to ligature information from the fonts. While it would be possible to add ligature data to MathJax's font tables, this could be a significant amount of data very little of which would be used by any one page.

I'm really not familiar enough with the languages that use these features to know if what I'm trying out would be sufficient or not. I'm wondering if it is possible to get some examples from a variety of languages that show the range of situations that need to be accommodated.

One approach might be to put the data needed for each language's script into an individual extension that gets loaded for those pages that need it (either explicitly in the MathJax configuration, or via \require{} within the math on the page). Do you think that would be acceptable?

hartman · 2014-03-22T13:32:10Z

Perhaps @amire80 of our WMF language engineering is able to help out a bit here...

pkra · 2015-02-26T09:16:44Z

@hartman do you think you could poke @amire80 some time? We'd love to improve this, especially if Wikipedia wants to roll out the SVG output more widely.

amire80 · 2015-02-26T09:24:18Z

I'm right here :)

How can I help?

Testing? - Gladly, just tel me what to test exactly.

Examples of how non-Latin scripts work in formulas? - It's not used in Hebrew textbooks, but it is used in textbooks in Arabic and Persian. Maybe @ebraminio can chime in here.

Anything else?

pkra · 2015-02-26T09:35:05Z

Thanks for stopping by @amire80 :-)

How can I help?

I'm hoping we can improve handling of combined characters in non-Latin scripts. This has come up on WMF bugzilla/phabricator repeatedly. To quote Davide from #474 (comment) :

Ideally, MathJax would put each sequence that forms one group into a single or , just as it does for single Latin letters now. I've looked into this to some degree, and there are some difficulties handling it. It is possible to have combining characters grouped with their preceding characters, but it is not clear to me how some characters work. For example, it seems that the virama (U+0D4D) combines not just the character on its left, but also on the right, though I might be misunderstanding it. It also seems that some of these grouping are handled by ligatures within the fonts, not by combining characters. Unfortunately, MathJax does not have access to ligature information from the fonts. While it would be possible to add ligature data to MathJax's font tables, this could be a significant amount of data very little of which would be used by any one page.

I'm really not familiar enough with the languages that use these features to know if what I'm trying out would be sufficient or not. I'm wondering if it is possible to get some examples from a variety of languages that show the range of situations that need to be accommodated.

So our question would be: does anyone have expertise they can share with us? @hartman was kind enough to point to you ;-)

(Perhaps we should split this out into a separate issue.)

amire80 · 2015-02-26T09:53:20Z

The (very) basic idea of virama is that the sequence of consonant + virama + consonant has three Unicode characters, which appear as occupying the space of one glyph (but it can get far more complicated).

More generally, I'd love to understand MathJax's current situation. What should I do to test the current rendering? Install my own instance? Or is there an online instance where a current version can be tested?

pkra · 2015-02-26T10:30:00Z

consonant + virama + consonant has three Unicode characters, which appear as occupying the space of one glyph

Right. Combined characters are common enough in mathematical layout so we understand the situation in general.

(but it can get far more complicated).

That's our problem. We lack the specifics for most natural language, non-Latin scripts.

Or is there an online instance where a current version can be tested?

You can do this on MediaWiki (using the MathML/SVG mode of the math extension), in the browser (this sample or this codepen) or use a local copy of MathJax -- whichever you like.

A basic example: ത്ര will be converted to ത്ര and since we don't have any routines to identify these kinds of combined characters, the TeX input converts this internally to MathML as

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mrow class="MJX-TeXAtom-ORD">
    <mo>&#xD24;</mo>
  </mrow>
  <mrow class="MJX-TeXAtom-ORD">
    <mo>&#xD4D;</mo>
  </mrow>
  <mrow class="MJX-TeXAtom-ORD">
    <mo>&#xD30;</mo>
  </mrow>
</math>

Which the MathJax output will in turn split across three span's (in the HTML outputs) or three g's (in the SVG output) -- and of course this breaks the rendering of the combined character.

(I just noticed that Firefox sometimes combines the spans in the HTML outputs e.g., ത്ര but not the subscript in കു_ശ. Chrome is more "consistent" in that nothing is combined)

So for us the problem is: is there a concise set of data (or some efficient heuristic) that we could use to identify all relevant situations where we need to re-combine into one mi/mo element in the MathML? Once we have that, the rendering will work as well.

davidcarlisle · 2015-03-02T16:07:02Z

So for us the problem is: is there a concise set of data (or some efficient heuristic) that we could use to > identify all relevant situations where we need to re-combine into one mi/mo element in the MathML?

Sorry for the long comment, bringing a bit of off site discussion back to the issue tracker.

How feasible/expensive would it be to make the Unicode UCD database
combining class available to mathjax for each character? Basically (or
at least as a good first approximation) any character with non zero
combining class (field 4 in UnicodeData.txt) needs to stay with the
preceding one, and in addition if it's class 9 (virama) the following
character needs to be kept together as well.

It's probably also worth noting that tex, even unicode tex like xetex
or luatex are almost certainly not going to get this right without
markup
that is you will need \text{abc} or \mathit{abc} or some other such
command to force a string of characters to be typeset as text with a
single font rather than TeX's normal habit of splitting things up
character by character. Even if the construct looks like a single
character to the author.

In classic tex it is not an issue as fonts can only have 256 characters
and while composed characters can be supported with various macro remapping tricks
composing characters following the base are basically not supportable even for simple
composing accents like acute.

Support in unicode tex variants such as xetex and luatex seems a bit variable. In text, xetex
hands things over to the HarfBuzz library so does pretty well. luatex handles it internally and currently does less well with the virama. In math both require a font with an opentype MATH table to do anything very useful and I couldn't find such a font that had a virama.

The following latex document is using kartika in text and latin modern math in math, you will note that
even european accents typically fail in math, but even the virama example works if you add some markup \mbox here or mi or mtext equivalently in MathML

The image shows xetex at the top and luatex at the bottom.

So while not requiring something like \text{..} or \mbox{...} around such character strings would be desirable, it would put your unicode support a long way ahead of what TeX can currently achieve
so it depends a bit on what the specification of the "tex-like syntax" is, how far beyond what TeX can do is it reasonable to push it?

\documentclass{article}

\usepackage{fontspec}
\usepackage{unicode-math}
\setmainfont{kartika.ttf}


\begin{document}

U+0d24 U+0d4d U+0d30 outputs e.g., ത്ര but 

abc $abc \mbox{ത്ര} $  U+0063

abç $abç \mbox{ത്ര} $ U+00e7

abç $abç \mbox{ത്ര} $  U+0063 U+0327

\end{document}

khaledhosny · 2015-03-03T12:05:37Z

I'm not really sure if I understand what the discussion is about, but if the idea is to identify what sequence of characters constitute a single unit, then Unicode grapheme clustering should provide the needed information..

amire80 · 2015-03-03T12:13:46Z

Yes - what @khaledhosny says sounds like the right thing to me, although I'm not every experienced with it. Maybe @santhoshtr can contribute more details.

Santhosh, I think that what @pkra wrote three comments above explains the problem best.

davidcarlisle · 2015-03-03T12:25:02Z

On 3 March 2015 at 12:05, Khaled Hosny notifications@github.com wrote:

I'm not really sure if I understand what the discussion is about, but if
the idea is to identify what sequence of characters constitute a single
unit, then Unicode Grapheme clustering
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries should
provide the needed information..

Yes but I suppose the question is how far it makes sense for a javascript
library to do that
by hand if the underlying platform doesn't make the unicode properties
available
and if it's emulating tex syntax how far would tex go? You know as much
about the tex support as anyone. How far would it be reasonable in xetex to
have such a cluster do anything sensible in math without escaping to text
with \text{..} or some such command, given that you can't assign a
\mathclass to such a cluster?

hartman · 2015-03-04T07:30:19Z

I found a CoffeeScript implementation for graphemes.
https://github.com/devongovett/grapheme-breaker

Might be useful.

pkra · 2015-03-04T10:50:03Z

Thanks for all the useful comments. To summarize,

xetex/luatex do not handle input the way requested in this issue, i.e., without extra markup such as \text
it's not clear (to me at least) if there are plans to handle it this way
a solution could start with the simple approach David C outlined or potentially build on grapheme-breaker (thanks @hartman!)

To add to that,

On the other hand, a quick test with LaTeXML and pandoc indicates that they do handle such characters as requested here, i.e., not like xetex/luatex.

So it seems to me that a solution can't be in the core TeX input but needs to be an extension. That's not a problem, of course, since it probably would have ended up an extension anyway.

It would be good to hear from MediaWiki/WMF communities if they actually want to delineate from the TeX-engines here.

pkra · 2015-03-10T10:54:16Z

Again it would be good to get more feedback.

At TeX folks, is handling characters in math mode without extra markup the future direction of xetex/luatex/etc?
At MediaWiki / WMF folks: is non-standard TeX behavior actually desired by the relevant communities?

Without more feedback, I think we should punt on this / move it out of the 2.6 milestone.

khaledhosny · 2015-03-11T09:41:29Z

Let me understand the issue here, people want to do things like $x+y=<complex character>$ where <complex character> is possibly a multi-code point grapheme, and have <complex character> treated as a math identifier, right? If so, then I think that is a reasonable expectation and if current Unicode TeX engines do not handle it correctly (they probably don’t) it is likely a bug or a missing feature, not something by design.

Or is it that people want to do things like $<complex text string>$ , where <complex text string> is a multi-character text string that possibly needs complex text layout, and get proper text layout (bidi, shaping etc.)? I don't think that is a reasonable expectation and some kind of markup is needed here to indicate that this is a regular text string that needs to be treated as such.

pkra · 2015-03-11T09:52:05Z

Thanks, @khaledhosny!

[...] people want to do things like $x+y=$ where is possibly a multi-code point grapheme, and have treated as a math identifier, right?

Yes, that's how I understand it as well. (It's a bit difficult to say since this is originally a request from the Wikipedia end).

I think that is a reasonable expectation

Thanks!

if current Unicode TeX engines do not handle it correctly (they probably don’t) it is likely a bug or a missing feature, not something by design.

Thanks for that, too. The "they probably don’t" part worries me slightly but if you and @davidcarlisle agree that it's the desired behavior in Unicode TeX engines, then that's enough for us, I think.

Still hoping the MediaWiki/WMF/Wikipedia side will chime in.

pkra · 2015-08-04T18:59:05Z

As per F2F, we're removing this from the v2.6 Milestone (i.e., the upcoming release).

It's not clear what the right approach is, in particular, in terms of compatibility with TeX/LaTeX (or rather XeTeX/LuaTeX). It's also not clear what the WMF and the Wikipedia community really want here.

To be clear, we're not closing this issue and we are still interested in figuring out how complex layout might work in the TeX input.

pkra · 2018-10-25T19:14:25Z

Blast from the future: there's a TC39 proposal "Unicode segmentation" to allow (among other things) to split strings by grapheme https://github.com/tc39/proposal-intl-segmenter. The repository includes a link to a polyfill (and there's also a non-standard Chrome feautre apparently).

dpvc · 2018-10-25T19:30:00Z

Cool. Thanks, @pkra.

pkra · 2018-10-25T20:07:33Z

No problem. The polyfill is unfortunately useless -- it only covers Enligsh. But for those who want to try it out, the chrome build-in might be useful.

pkra · 2022-03-09T09:20:06Z

Another blast from the future: Chrome and Safari have supported Intl.Segmenter for a while now https://caniuse.com/?search=Intl.Segmenter.

dpvc · 2022-03-09T13:59:26Z

The segmented looks interesting, and something to keep in mind, though until it is available in Firefox, it probably won't be able to be used in MathJax.

One of the PRs for the font update includes changes (8eeeaa2e) to have multiple unknown characters placed in a single container rather than each one in a separate one, so that will help with this, though currently it still requires \text{} to group them into one MathML item (but you don't need to use mtextInheritFont any more).

Of course, this is really an input issue, so that multiple combining characters are treated as a single unit by the TeX input jax so they are all placed in the same MathML token element initially, which is where the Segmenter would come in handy. It is on the list of things to do, and I haven't forgotten about it. Perhaps the font update is a good place for that.

pkra · 2022-03-09T14:42:21Z

The segmented looks interesting, and something to keep in mind, though until it is available in Firefox, it probably won't be able to be used in MathJax.

It could be an (opt-in) progressive enhancement. For NodeJS it would already be useful now (not that I need it, I just wanted to point that out).

pkra · 2024-05-02T10:42:04Z

Note from the future: Intl.Segmenter is how widely supported https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/Segmenter

fred-wang mentioned this issue Oct 17, 2013

The MathJax class forces ltr direction on messages #627

Closed

dpvc modified the milestones: A future release, Bugfix Version Apr 10, 2014

dpvc mentioned this issue Nov 12, 2014

Unicode fallback and combined characters #952

Closed

pkra modified the milestones: A future release, The next release Feb 26, 2015

pkra changed the title ~~MathJax does not support Complex text layout.~~ TeX input and complex text layout [was: MathJax does not support Complex text layout.] Mar 4, 2015

pkra mentioned this issue Mar 4, 2015

Top of Non-English Characters are clipped #168

Closed

pkra added this to the A future release milestone Aug 4, 2015

pkra removed this from the MathJax v2.6 milestone Aug 4, 2015

pkra mentioned this issue Aug 21, 2015

mediawiki texvc: Commands printed with backslash #1236

Closed

pkra mentioned this issue Nov 23, 2015

Arabic letters doesn't appear connected when using whole Arabic words #1307

Closed

pkra changed the title ~~TeX input and complex text layout [was: MathJax does not support Complex text layout.]~~ Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.] Nov 23, 2015

pkra mentioned this issue Nov 23, 2015

[Meta] RTL / BiDi support #1311

Closed

5 tasks

dpvc removed the Ready for Development label Apr 11, 2016

pkra mentioned this issue Jul 11, 2016

SVG output and vnsub #1304

Closed

rtibbles mentioned this issue Jun 29, 2017

The letters run over the images of the question in KA Hi Channel. learningequality/kolibri#1767

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.] #474

Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.] #474

hartman commented May 19, 2013

dpvc commented May 19, 2013

dpvc commented May 19, 2013

hartman commented May 22, 2013

dpvc commented Mar 21, 2014

hartman commented Mar 22, 2014

pkra commented Feb 26, 2015

amire80 commented Feb 26, 2015

pkra commented Feb 26, 2015

amire80 commented Feb 26, 2015

pkra commented Feb 26, 2015

davidcarlisle commented Mar 2, 2015

khaledhosny commented Mar 3, 2015

amire80 commented Mar 3, 2015

davidcarlisle commented Mar 3, 2015

hartman commented Mar 4, 2015

pkra commented Mar 4, 2015

pkra commented Mar 10, 2015

khaledhosny commented Mar 11, 2015

pkra commented Mar 11, 2015

pkra commented Aug 4, 2015

pkra commented Oct 25, 2018

dpvc commented Oct 25, 2018

pkra commented Oct 25, 2018

pkra commented Mar 9, 2022

dpvc commented Mar 9, 2022

pkra commented Mar 9, 2022

pkra commented May 2, 2024

Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.] #474

Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.] #474

Comments

hartman commented May 19, 2013

dpvc commented May 19, 2013

dpvc commented May 19, 2013

hartman commented May 22, 2013

dpvc commented Mar 21, 2014

hartman commented Mar 22, 2014

pkra commented Feb 26, 2015

amire80 commented Feb 26, 2015

pkra commented Feb 26, 2015

amire80 commented Feb 26, 2015

pkra commented Feb 26, 2015

davidcarlisle commented Mar 2, 2015

khaledhosny commented Mar 3, 2015

amire80 commented Mar 3, 2015

davidcarlisle commented Mar 3, 2015

hartman commented Mar 4, 2015

pkra commented Mar 4, 2015

pkra commented Mar 10, 2015

khaledhosny commented Mar 11, 2015

pkra commented Mar 11, 2015

pkra commented Aug 4, 2015

pkra commented Oct 25, 2018

dpvc commented Oct 25, 2018

pkra commented Oct 25, 2018

pkra commented Mar 9, 2022

dpvc commented Mar 9, 2022

pkra commented Mar 9, 2022

pkra commented May 2, 2024