Add support for rendering CJK in a vertical writing mode along line-placed features #3438

lucaswoj · 2016-10-21T19:55:35Z

replaces #3402 (HUGE ❤️ to @nickidlugash & @xrwang)
fixes #1246
see also mapbox/mapbox-gl-native#1682

cc @nickidlugash @xrwang @ansis @1ec5 @willwhite @jakepruitt @jfirebaugh @mourner

Requirements

GL JS must render CJK text in a vertical writing mode along vertical lines, as is conventional in cartographic design

Specifications

we will enable / disable the vertical writing mode by script detection
we will use character ranges for script detection (data)
if a single label has mixed scripts, we will use vertical writing mode

Open Questions

How can we test our script detection code?
How can we make our script detection code more readable?
Can we improve upon our character centerline alignment?
Can we add proper vertical advance to our glyph metadata in the near future?
Can we pack the symbol structs more efficiently?

Launch Checklist

lucaswoj · 2016-10-21T20:28:19Z

js/data/bucket/symbol_bucket.js

+                const requiresVerticalWritingMode = (
+                    // eslint-disable-next-line
+                    feature.text.match(/[\u1100-\u11FF\uAC00-\uD7A3\u3131-\u318E\u4E00-\u9FCC\u3400-\u4DB5\u3000-\u303F\uFF01-\uFFEE\u3041-\u309F\u30A0-\u30FF\u31F0-\u31FF\uA000-\uA4C6\u1800-\u18AA]/) ||
+                    feature.text.match(/(\uD840\uDC00-\uFFFF)|(\uD841-\uD872)|(\uD873\u0000-\uDEAF)/)


@1ec5 Any ideas on how we can construct an exhaustive test suite for this regex?

We could have one test case for each start or end character in each range, plus one for the character before the start character and one for the character after the end character.

1ec5 · 2016-10-21T20:47:53Z

js/symbol/shaping.js

@@ -1,41 +1,48 @@
 'use strict';

+const WritingMode = {
+    horizantal: 1,


s/horizantal/horizontal/

1ec5 · 2016-10-21T20:50:05Z

js/data/bucket/symbol_bucket.js

-                        lineHeight, horizontalAlign, verticalAlign, justify, spacing, textOffset);
+                const requiresVerticalWritingMode = (
+                    feature.text.match(/[\u1100-\u11FF\uAC00-\uD7A3\u3131-\u318E\u4E00-\u9FCC\u3400-\u4DB5\u3000-\u303F\uFF01-\uFFEE\u3041-\u309F\u30A0-\u30FF\u31F0-\u31FF\uA000-\uA4C6\u1800-\u18AA]/) ||
+                    feature.text.match(/(\uD840[\uDC00-\uFFFF])|[\uD841-\uD872]|(\uD873[\u0000-\uDEAF])/)


For possibly better performance, combine these two regular expressions with |. Also, use search() instead of match(), since you don’t really care about all the individual CJK characters. Finally, please indicate the unescaped characters (perhaps as a table mapping Unicode codepoints to literals) in a comment so we don’t have to look up each individual character when making changes.

Also, I would make them constants outside of the function scope/loops, because constructing a regex takes a bit of time.

Done combining the regexes and switching to search. Going to work on readability & perf now.

@mourner, per https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions:

Regular expression literals provide compilation of the regular expression when the script is loaded. When the regular expression will remain constant, use this for better performance.

So even if we were to keep this regular expression literal here, it would only be compiled once.

And of course I only now remember that match doesn’t look for all matches unless you give the regex a g flag.

1ec5 · 2016-10-21T21:44:03Z

js/util/script_detection.js

+    '[一-鿌]',
+    '[㐀-䶵]',
+    // eslint-disable-next-line no-irregular-whitespace
+    '[　-〿]',


This matches everything from U+0020 (space) to U+303F, including the entire ASCII alphabet. I made a typo in #3402 (comment): the space should’ve been the ideographic space \u3000 instead.

@1ec5 Are you sure? I'm seeing an ideographic space here. Would you prefer the \u version in this case?

Ah, you’re right. I was seeing a quirk in Gecko’s text selection behavior that causes the Character Identifier extension to conflate various spaces. The escaped sequence would be less confusing in this case.

1ec5 · 2016-10-21T21:48:58Z

js/util/script_detection.js

+    // eslint-disable-next-line no-irregular-whitespace
+    '[　-〿]',
+    '\uD840[\uDC00-\uFFFF]|[\uD841-\uD872]|\uD873[\u0000-\uDEAF]', // '[𠀀-𬺯]'
+    '[！-￮]',


Per #3402 (comment), we should restrict this to fullwidth forms only ([！-｠￠-￦]) since halfwidth forms would look quite odd unless doubled up.

/cc @friedbunny

1ec5 · 2016-10-21T22:53:19Z

js/util/script_detection.js

+    '[ぁ-ゟ]',
+    '[゠-ヿ]',
+    '[ㇰ-ㇿ]',
+    '[ꀀ-꓆]',


I was mistaken in including Yi script here. According to this Wikipedia article, the Classical Yi syllabary was written vertically, but the Modern Yi syllabary has always been written horizontally. So we should remove this line.

1ec5 · 2016-10-22T18:40:23Z

js/util/script_detection.js

@@ -0,0 +1,22 @@
+'use strict';
+
+const verticalRegExp = new RegExp([


If #3420 lands first, verticalRegExp and ideographicBreakingRegExp will have some overlap. To wit:

Hangul and Mongolian are vertical but don’t get ideographic breaking.

Halfwidth forms and Yi are horizontal but do get ideographic breaking.

1ec5 · 2016-10-22T20:51:11Z

I’m not sure how prevalent punctuation is in road names, but various Chinese and Japanese dashes and brackets are rotated 90° or 180° when laid out vertically. Would it be possible to replace the following characters for vertical text?

Horizontal	Vertical
。	︒
—	︱
–	︲
_	︳
（	︵
）	︶
｛	︷
｝	︸
〔	︹
〕	︺
〘	︹ (?)
〙	︺ (?)
【	︻
】	︼
《	︽
》	︾
〈	︿
〉	﹀
「	﹁
」	﹂
『	﹃
』	﹄
｢	﹁
｣	﹂
［	﹇
］	﹈
“	﹁
”	﹂
‘	﹃
’	﹄
…	⋮

1ec5 · 2016-10-22T21:37:45Z

js/util/script_detection.js

+    '[゠-ヿ]',
+    '[ㇰ-ㇿ]',
+    '[ꀀ-꓆]',
+    '[᠀-ᢪ]'


We should also remove Mongolian from this regular expression. Unlike the other scripts, Mongolian must be written vertically (never horizontally) and its glyphs are always rotated when set vertically. So apart from the fact that we don’t support complex text shaping yet (mapbox/DEPRECATED-mapbox-gl#4), Mongolian is already laid out correctly along lines from 3:00 to 6:00 and from 9:00 to 12:00. With this PR as is, Mongolian would only be laid out correctly from 3:00 to 4:30 and from 9:00 to 10:30.

If we want to address Mongolian, I think we should do that as tail work. But for that matter, I’m having a difficult time finding any area in OpenStreetMap or the Mapbox Streets source where roads are labeled in Mongolian. So best to leave Mongolian alone.

@1ec5 makes sense. Only Inner Mongolia uses the traditional (vertical) Mongolian script, and all of those road names in Inner Mongolia are transliterated into Chinese characters or are Chinese names. Mongolia the country uses Cyrillic (consistently LTR and horizontal).

Right, in OSM, some places in Inner Mongolia are tagged bilingually in Mongolian (with Mongolian script) in the name tag; however, Mapbox Streets seems to exclude the Mongolian text from its name field, and I’m not sure why. Regardless, I can’t find any streets tagged that way, so it’s low-priority for the time being.

kkaefer · 2016-10-24T20:08:24Z

wow

per #3438 (comment)

lucaswoj · 2016-10-31T21:01:05Z

benchmark	master `8127e17`	vertical `893d46c`
map-load	221 ms	135 ms
style-load	151 ms	159 ms
buffer	1,326 ms	1,122 ms
fps	60 fps	60 fps
frame-duration	4.4 ms, 0% > 16ms	4.9 ms, 0% > 16ms
query-point	1.07 ms	1.09 ms
query-box	65.20 ms	68.88 ms
geojson-setdata-small	10 ms	8 ms
geojson-setdata-large	308 ms	261 ms

lucaswoj · 2016-10-31T21:11:24Z

This is ready for 👀

1ec5 · 2016-10-31T21:17:05Z