-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complex text layout, in particular with TeX input [was: MathJax does not support Complex text layout.] #474
Comments
Note that if you set |
PS, I saw the report on the Wikimedia bugzilla and was planning to add it to the list of things to fix. Thanks for staring the issue here to track that. |
Thanks for the mtextFontInherit tip. I was going to enable that anyways, but this is one more reason to do that. |
Some support for RTL was added in v2.3, but the issue of multiple-character sequences being treated as a unit remains. For Ideally, MathJax would put each sequence that forms one group into a single I'm really not familiar enough with the languages that use these features to know if what I'm trying out would be sufficient or not. I'm wondering if it is possible to get some examples from a variety of languages that show the range of situations that need to be accommodated. One approach might be to put the data needed for each language's script into an individual extension that gets loaded for those pages that need it (either explicitly in the MathJax configuration, or via |
Perhaps @amire80 of our WMF language engineering is able to help out a bit here... |
I'm right here :) How can I help? Testing? - Gladly, just tel me what to test exactly. Examples of how non-Latin scripts work in formulas? - It's not used in Hebrew textbooks, but it is used in textbooks in Arabic and Persian. Maybe @ebraminio can chime in here. Anything else? |
Thanks for stopping by @amire80 :-)
I'm hoping we can improve handling of combined characters in non-Latin scripts. This has come up on WMF bugzilla/phabricator repeatedly. To quote Davide from #474 (comment) :
So our question would be: does anyone have expertise they can share with us? @hartman was kind enough to point to you ;-) (Perhaps we should split this out into a separate issue.) |
The (very) basic idea of virama is that the sequence of consonant + virama + consonant has three Unicode characters, which appear as occupying the space of one glyph (but it can get far more complicated). More generally, I'd love to understand MathJax's current situation. What should I do to test the current rendering? Install my own instance? Or is there an online instance where a current version can be tested? |
Right. Combined characters are common enough in mathematical layout so we understand the situation in general.
That's our problem. We lack the specifics for most natural language, non-Latin scripts.
You can do this on MediaWiki (using the MathML/SVG mode of the math extension), in the browser (this sample or this codepen) or use a local copy of MathJax -- whichever you like. A basic example: <math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow class="MJX-TeXAtom-ORD">
<mo>ത</mo>
</mrow>
<mrow class="MJX-TeXAtom-ORD">
<mo>്</mo>
</mrow>
<mrow class="MJX-TeXAtom-ORD">
<mo>ര</mo>
</mrow>
</math> Which the MathJax output will in turn split across three span's (in the HTML outputs) or three g's (in the SVG output) -- and of course this breaks the rendering of the combined character. (I just noticed that Firefox sometimes combines the spans in the HTML outputs e.g., So for us the problem is: is there a concise set of data (or some efficient heuristic) that we could use to identify all relevant situations where we need to re-combine into one mi/mo element in the MathML? Once we have that, the rendering will work as well. |
Sorry for the long comment, bringing a bit of off site discussion back to the issue tracker. How feasible/expensive would it be to make the Unicode UCD database It's probably also worth noting that tex, even unicode tex like xetex In classic tex it is not an issue as fonts can only have 256 characters Support in unicode tex variants such as xetex and luatex seems a bit variable. In text, xetex The following latex document is using kartika in text and latin modern math in math, you will note that The image shows xetex at the top and luatex at the bottom. So while not requiring something like \text{..} or \mbox{...} around such character strings would be desirable, it would put your unicode support a long way ahead of what TeX can currently achieve
|
I'm not really sure if I understand what the discussion is about, but if the idea is to identify what sequence of characters constitute a single unit, then Unicode grapheme clustering should provide the needed information.. |
Yes - what @khaledhosny says sounds like the right thing to me, although I'm not every experienced with it. Maybe @santhoshtr can contribute more details. Santhosh, I think that what @pkra wrote three comments above explains the problem best. |
On 3 March 2015 at 12:05, Khaled Hosny notifications@github.com wrote:
Yes but I suppose the question is how far it makes sense for a javascript |
I found a CoffeeScript implementation for graphemes. Might be useful. |
Thanks for all the useful comments. To summarize,
To add to that,
So it seems to me that a solution can't be in the core TeX input but needs to be an extension. That's not a problem, of course, since it probably would have ended up an extension anyway. It would be good to hear from MediaWiki/WMF communities if they actually want to delineate from the TeX-engines here. |
Again it would be good to get more feedback.
Without more feedback, I think we should punt on this / move it out of the 2.6 milestone. |
Let me understand the issue here, people want to do things like Or is it that people want to do things like |
Thanks, @khaledhosny!
Yes, that's how I understand it as well. (It's a bit difficult to say since this is originally a request from the Wikipedia end).
Thanks!
Thanks for that, too. The "they probably don’t" part worries me slightly but if you and @davidcarlisle agree that it's the desired behavior in Unicode TeX engines, then that's enough for us, I think. Still hoping the MediaWiki/WMF/Wikipedia side will chime in. |
As per F2F, we're removing this from the v2.6 Milestone (i.e., the upcoming release). It's not clear what the right approach is, in particular, in terms of compatibility with TeX/LaTeX (or rather XeTeX/LuaTeX). It's also not clear what the WMF and the Wikipedia community really want here. To be clear, we're not closing this issue and we are still interested in figuring out how complex layout might work in the TeX input. |
Blast from the future: there's a TC39 proposal "Unicode segmentation" to allow (among other things) to split strings by grapheme https://github.com/tc39/proposal-intl-segmenter. The repository includes a link to a polyfill (and there's also a non-standard Chrome feautre apparently). |
Cool. Thanks, @pkra. |
No problem. The polyfill is unfortunately useless -- it only covers Enligsh. But for those who want to try it out, the chrome build-in might be useful. |
Another blast from the future: Chrome and Safari have supported Intl.Segmenter for a while now https://caniuse.com/?search=Intl.Segmenter. |
The segmented looks interesting, and something to keep in mind, though until it is available in Firefox, it probably won't be able to be used in MathJax. One of the PRs for the font update includes changes (8eeeaa2e) to have multiple unknown characters placed in a single container rather than each one in a separate one, so that will help with this, though currently it still requires Of course, this is really an input issue, so that multiple combining characters are treated as a single unit by the TeX input jax so they are all placed in the same MathML token element initially, which is where the Segmenter would come in handy. It is on the list of things to do, and I haven't forgotten about it. Perhaps the font update is a good place for that. |
It could be an (opt-in) progressive enhancement. For NodeJS it would already be useful now (not that I need it, I just wanted to point that out). |
Note from the future: |
Because MathJax looks at individual code points it has trouble dealing with scripts that require bidirectionality, context shaping etc. This is visible whenever trying to use hebrew or arabic for instance.
It would be good if MathJax would be able to identify these ranges and be able to keep those as blocks instead of dividing it into individual characters. At the very least in \text mode.
http://en.wikipedia.org/wiki/Complex_text_layout
The text was updated successfully, but these errors were encountered: