-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for rendering CJK in a vertical writing mode along line-placed features #3438
Conversation
const requiresVerticalWritingMode = ( | ||
// eslint-disable-next-line | ||
feature.text.match(/[\u1100-\u11FF\uAC00-\uD7A3\u3131-\u318E\u4E00-\u9FCC\u3400-\u4DB5\u3000-\u303F\uFF01-\uFFEE\u3041-\u309F\u30A0-\u30FF\u31F0-\u31FF\uA000-\uA4C6\u1800-\u18AA]/) || | ||
feature.text.match(/(\uD840\uDC00-\uFFFF)|(\uD841-\uD872)|(\uD873\u0000-\uDEAF)/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@1ec5 Any ideas on how we can construct an exhaustive test suite for this regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have one test case for each start or end character in each range, plus one for the character before the start character and one for the character after the end character.
@@ -1,41 +1,48 @@ | |||
'use strict'; | |||
|
|||
const WritingMode = { | |||
horizantal: 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/horizantal/horizontal/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lineHeight, horizontalAlign, verticalAlign, justify, spacing, textOffset); | ||
const requiresVerticalWritingMode = ( | ||
feature.text.match(/[\u1100-\u11FF\uAC00-\uD7A3\u3131-\u318E\u4E00-\u9FCC\u3400-\u4DB5\u3000-\u303F\uFF01-\uFFEE\u3041-\u309F\u30A0-\u30FF\u31F0-\u31FF\uA000-\uA4C6\u1800-\u18AA]/) || | ||
feature.text.match(/(\uD840[\uDC00-\uFFFF])|[\uD841-\uD872]|(\uD873[\u0000-\uDEAF])/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For possibly better performance, combine these two regular expressions with |
. Also, use search()
instead of match()
, since you don’t really care about all the individual CJK characters. Finally, please indicate the unescaped characters (perhaps as a table mapping Unicode codepoints to literals) in a comment so we don’t have to look up each individual character when making changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would make them constants outside of the function scope/loops, because constructing a regex takes a bit of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done combining the regexes and switching to search
. Going to work on readability & perf now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mourner, per https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions:
Regular expression literals provide compilation of the regular expression when the script is loaded. When the regular expression will remain constant, use this for better performance.
So even if we were to keep this regular expression literal here, it would only be compiled once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And of course I only now remember that match
doesn’t look for all matches unless you give the regex a g
flag.
'[一-鿌]', | ||
'[㐀-䶵]', | ||
// eslint-disable-next-line no-irregular-whitespace | ||
'[ -〿]', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches everything from U+0020 (space) to U+303F, including the entire ASCII alphabet. I made a typo in #3402 (comment): the space should’ve been the ideographic space \u3000
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@1ec5 Are you sure? I'm seeing an ideographic space here. Would you prefer the \u
version in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you’re right. I was seeing a quirk in Gecko’s text selection behavior that causes the Character Identifier extension to conflate various spaces. The escaped sequence would be less confusing in this case.
// eslint-disable-next-line no-irregular-whitespace | ||
'[ -〿]', | ||
'\uD840[\uDC00-\uFFFF]|[\uD841-\uD872]|\uD873[\u0000-\uDEAF]', // '[𠀀-]' | ||
'[!-○]', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per #3402 (comment), we should restrict this to fullwidth forms only ([!-⦆¢-₩]
) since halfwidth forms would look quite odd unless doubled up.
/cc @friedbunny
'[ぁ-ゟ]', | ||
'[゠-ヿ]', | ||
'[ㇰ-ㇿ]', | ||
'[ꀀ-꓆]', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mistaken in including Yi script here. According to this Wikipedia article, the Classical Yi syllabary was written vertically, but the Modern Yi syllabary has always been written horizontally. So we should remove this line.
@@ -0,0 +1,22 @@ | |||
'use strict'; | |||
|
|||
const verticalRegExp = new RegExp([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If #3420 lands first, verticalRegExp
and ideographicBreakingRegExp
will have some overlap. To wit:
- Hangul and Mongolian are vertical but don’t get ideographic breaking.
- Halfwidth forms and Yi are horizontal but do get ideographic breaking.
I’m not sure how prevalent punctuation is in road names, but various Chinese and Japanese dashes and brackets are rotated 90° or 180° when laid out vertically. Would it be possible to replace the following characters for vertical text?
|
'[゠-ヿ]', | ||
'[ㇰ-ㇿ]', | ||
'[ꀀ-꓆]', | ||
'[᠀-ᢪ]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also remove Mongolian from this regular expression. Unlike the other scripts, Mongolian must be written vertically (never horizontally) and its glyphs are always rotated when set vertically. So apart from the fact that we don’t support complex text shaping yet (mapbox/DEPRECATED-mapbox-gl#4), Mongolian is already laid out correctly along lines from 3:00 to 6:00 and from 9:00 to 12:00. With this PR as is, Mongolian would only be laid out correctly from 3:00 to 4:30 and from 9:00 to 10:30.
If we want to address Mongolian, I think we should do that as tail work. But for that matter, I’m having a difficult time finding any area in OpenStreetMap or the Mapbox Streets source where roads are labeled in Mongolian. So best to leave Mongolian alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@1ec5 makes sense. Only Inner Mongolia uses the traditional (vertical) Mongolian script, and all of those road names in Inner Mongolia are transliterated into Chinese characters or are Chinese names. Mongolia the country uses Cyrillic (consistently LTR and horizontal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, in OSM, some places in Inner Mongolia are tagged bilingually in Mongolian (with Mongolian script) in the name
tag; however, Mapbox Streets seems to exclude the Mongolian text from its name
field, and I’m not sure why. Regardless, I can’t find any streets tagged that way, so it’s low-priority for the time being.
wow |
42163bb
to
6b86db2
Compare
|
This is ready for 👀 |
// 'Buhid': (char) => char >= 0x1740 && char <= 0x175F, | ||
// 'Tagbanwa': (char) => char >= 0x1760 && char <= 0x177F, | ||
// 'Khmer': (char) => char >= 0x1780 && char <= 0x17FF, | ||
'Mongolian': (char) => char >= 0x1800 && char <= 0x18AF, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per #3438 (comment), we don’t want to rotate Mongolian vertically or give it ideographic breaking.
'Katakana': (char) => char >= 0x30A0 && char <= 0x30FF, | ||
// 'Bopomofo': (char) => char >= 0x3100 && char <= 0x312F, | ||
'Hangul Compatibility Jamo': (char) => char >= 0x3130 && char <= 0x318F, | ||
// 'Kanbun': (char) => char >= 0x3190 && char <= 0x319F, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I guess it wouldn’t hurt to make this vertical too.
// 'Cherokee Supplement': (char) => char >= 0xAB70 && char <= 0xABBF, | ||
// 'Meetei Mayek': (char) => char >= 0xABC0 && char <= 0xABFF, | ||
'Hangul Syllables': (char) => char >= 0xAC00 && char <= 0xD7AF, | ||
// 'Hangul Jamo Extended-B': (char) => char >= 0xD7B0 && char <= 0xD7FF, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well make this vertical too.
// 'Devanagari Extended': (char) => char >= 0xA8E0 && char <= 0xA8FF, | ||
// 'Kayah Li': (char) => char >= 0xA900 && char <= 0xA92F, | ||
// 'Rejang': (char) => char >= 0xA930 && char <= 0xA95F, | ||
// 'Hangul Jamo Extended-A': (char) => char >= 0xA960 && char <= 0xA97F, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well make this vertical too.
// 'Alphabetic Presentation Forms': (char) => char >= 0xFB00 && char <= 0xFB4F, | ||
// 'Arabic Presentation Forms-A': (char) => char >= 0xFB50 && char <= 0xFDFF, | ||
// 'Variation Selectors': (char) => char >= 0xFE00 && char <= 0xFE0F, | ||
// 'Vertical Forms': (char) => char >= 0xFE10 && char <= 0xFE1F, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this block vertical and ideographic. This block will also come in handy once we implement #3438 (comment).
'CJK Symbols and Punctuation': (char) => char >= 0x3000 && char <= 0x303F, | ||
'Hiragana': (char) => char >= 0x3040 && char <= 0x309F, | ||
'Katakana': (char) => char >= 0x30A0 && char <= 0x30FF, | ||
// 'Bopomofo': (char) => char >= 0x3100 && char <= 0x312F, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block doesn’t appear in OSM, but it could appear in custom data. It should be vertical and get ideographic breaking.
// 'Bopomofo': (char) => char >= 0x3100 && char <= 0x312F, | ||
'Hangul Compatibility Jamo': (char) => char >= 0x3130 && char <= 0x318F, | ||
// 'Kanbun': (char) => char >= 0x3190 && char <= 0x319F, | ||
// 'Bopomofo Extended': (char) => char >= 0x31A0 && char <= 0x31BF, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block doesn’t appear in OSM, but it could appear in custom data. It should be vertical and get ideographic breaking.
All requested changes implemented in 85abbdc 🚀 |
if (char >= 0x4E00 && char <= 0x9FFF) return true; | ||
function charAllowsVerticalWritingMode(char) { | ||
// Return early for characters outside all ideographic ranges. | ||
if (char < 0x2E80) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will exclude the Hangul Jamo block. The cutoff should instead be U+1100.
(To be pedantic, not all of the vertical blocks are ideographic.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇 fixed in 97ef972
→ #3505 |
Thanks for the 🍏 @1ec5! I'm going to hold off merging until @nickidlugash gets a chance to look at this branch from a carto/desgin perspective. |
@lucaswoj @1ec5 I'm still in the middle of reviewing this! Just wanted to note one thing now though: Vertical labels seem off-centered towards the left of the line. Will it be possible to have them be centered? I haven't had a chance to dig into the code to see where the issue is. Side note relevant to that screenshot: "/" character collides with all characters above it. This is a common character in one of our relevant data sources. I've seen conflicting information as to whether this is the correct character to use for Chinese in general and specifically for vertical orientation – solidus (U+002F), versus fullwidth solidus (U+FF0F). Either way, I think perhaps without a true vertical advance value this is not getting enough space. The design of this character is one that signficantly overshoots the cap height of the font. |
This is a good argument for making #3505 a release blocker. |
I can't think of a reason why this problem would only affect vertical labels. Would you mind double checking that this behaviour only affects vertical labels? If so, we should endeavour to fix.
I'll make #3505 a release blocker. It should be easy to implement & help mitigate the problem. |
d560a43
to
32413f3
Compare
I'm rather stuck on these remaining punctuation glitches. Any comments on the correctness / incorrectness on what's rendered or solutions @1ec5 @xrwang @nickidlugash @ansis? Any bugs are most likely in this code within For reference, here is our overall punctuation test suite at this time. Things are looking 🚀 overall. |
I think the underscore would be exceedingly rare in CJK line-placed labels.
According to this Wikipedia article, the current horizontally centered behavior of Does |
'‘': '﹃', | ||
'’': '﹄', | ||
'.': '・' | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add all the remaining fullwidth punctuation so that characters like ?
and !
don’t get rotated 90° when they’re sitting next to CJK. @nickidlugash, can you think of any exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I can't think of any. They should probably all not get rotated. Are we using the fullwidth numerals and latin characters for single characters sitting next to CJK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'”': '﹂', | ||
'‘': '﹃', | ||
'’': '﹄', | ||
'.': '・' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@1ec5 I don't have a good sense of which punctuation should go into this table and which should not. Can you provide a table or algorithm? It may be easiest to just add a commit to the PR with the intended changes.
@nickidlugash We do not use fullwidth numerals and characters yet. Detecting when it is / isn't ok to use fullwidth characters is tricky. This can be done in a follow-up PR.
As best I can tell, this behaviour has not changed between |
This PR is rebased, squashed, and 🍏. I hope we can merge this very soon! Maintaining a large & long-running feature branch is a huge amount of work. |
I’ve added all the punctuation marks that belong in this PR. In addition to the punctuation marks in #3505, I mapped various ASCII characters to their fullwidth equivalents. However, I mapped the ASCII brackets and their fullwidth equivalents to their vertical analogues in the CJK Compatibility Forms, and I left out Currently, we replace a punctuation character with its fullwidth analogue if both adjacent characters are compatible with vertical writing. This heuristic is conservative enough that we could implement tate-chu-yoko (see #3506) simply by mapping all the ASCII numbers and letters to their fullwidth analogues in Since both those improvements would require changes to the test suite, I’ll take that as a sign that we should go ahead and merge and follow up with more heuristics in a separate PR. |
→ #3588
→ #3587 |
replaces #3402 (HUGE ❤️ to @nickidlugash & @xrwang)
fixes #1246
see also mapbox/mapbox-gl-native#1682
cc @nickidlugash @xrwang @ansis @1ec5 @willwhite @jakepruitt @jfirebaugh @mourner
Requirements
Specifications
Open Questions
Launch Checklist
_
characters in vertical writing mode