Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for rendering CJK in a vertical writing mode along line-placed features #3438

Merged
merged 4 commits into from
Nov 10, 2016

Conversation

lucaswoj
Copy link
Contributor

@lucaswoj lucaswoj commented Oct 21, 2016

replaces #3402 (HUGE ❤️ to @nickidlugash & @xrwang)
fixes #1246
see also mapbox/mapbox-gl-native#1682

actual

expected

cc @nickidlugash @xrwang @ansis @1ec5 @willwhite @jakepruitt @jfirebaugh @mourner

Requirements

  • GL JS must render CJK text in a vertical writing mode along vertical lines, as is conventional in cartographic design

vertical-label-rotation

Specifications

  • we will enable / disable the vertical writing mode by script detection
  • we will use character ranges for script detection (data)
  • if a single label has mixed scripts, we will use vertical writing mode

Open Questions

  • How can we test our script detection code?
  • How can we make our script detection code more readable?
  • Can we improve upon our character centerline alignment?
  • Can we add proper vertical advance to our glyph metadata in the near future?
  • Can we pack the symbol structs more efficiently?

Launch Checklist

const requiresVerticalWritingMode = (
// eslint-disable-next-line
feature.text.match(/[\u1100-\u11FF\uAC00-\uD7A3\u3131-\u318E\u4E00-\u9FCC\u3400-\u4DB5\u3000-\u303F\uFF01-\uFFEE\u3041-\u309F\u30A0-\u30FF\u31F0-\u31FF\uA000-\uA4C6\u1800-\u18AA]/) ||
feature.text.match(/(\uD840\uDC00-\uFFFF)|(\uD841-\uD872)|(\uD873\u0000-\uDEAF)/)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@1ec5 Any ideas on how we can construct an exhaustive test suite for this regex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have one test case for each start or end character in each range, plus one for the character before the start character and one for the character after the end character.

@@ -1,41 +1,48 @@
'use strict';

const WritingMode = {
horizantal: 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/horizantal/horizontal/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

lineHeight, horizontalAlign, verticalAlign, justify, spacing, textOffset);
const requiresVerticalWritingMode = (
feature.text.match(/[\u1100-\u11FF\uAC00-\uD7A3\u3131-\u318E\u4E00-\u9FCC\u3400-\u4DB5\u3000-\u303F\uFF01-\uFFEE\u3041-\u309F\u30A0-\u30FF\u31F0-\u31FF\uA000-\uA4C6\u1800-\u18AA]/) ||
feature.text.match(/(\uD840[\uDC00-\uFFFF])|[\uD841-\uD872]|(\uD873[\u0000-\uDEAF])/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For possibly better performance, combine these two regular expressions with |. Also, use search() instead of match(), since you don’t really care about all the individual CJK characters. Finally, please indicate the unescaped characters (perhaps as a table mapping Unicode codepoints to literals) in a comment so we don’t have to look up each individual character when making changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would make them constants outside of the function scope/loops, because constructing a regex takes a bit of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done combining the regexes and switching to search. Going to work on readability & perf now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mourner, per https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions:

Regular expression literals provide compilation of the regular expression when the script is loaded. When the regular expression will remain constant, use this for better performance.

So even if we were to keep this regular expression literal here, it would only be compiled once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And of course I only now remember that match doesn’t look for all matches unless you give the regex a g flag.

'[一-鿌]',
'[㐀-䶵]',
// eslint-disable-next-line no-irregular-whitespace
'[ -〿]',
Copy link
Contributor

@1ec5 1ec5 Oct 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches everything from U+0020 (space) to U+303F, including the entire ASCII alphabet. I made a typo in #3402 (comment): the space should’ve been the ideographic space \u3000 instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@1ec5 Are you sure? I'm seeing an ideographic space here. Would you prefer the \u version in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you’re right. I was seeing a quirk in Gecko’s text selection behavior that causes the Character Identifier extension to conflate various spaces. The escaped sequence would be less confusing in this case.

// eslint-disable-next-line no-irregular-whitespace
'[ -〿]',
'\uD840[\uDC00-\uFFFF]|[\uD841-\uD872]|\uD873[\u0000-\uDEAF]', // '[𠀀-𬺯]'
'[!-○]',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per #3402 (comment), we should restrict this to fullwidth forms only ([!-⦆¢-₩]) since halfwidth forms would look quite odd unless doubled up.

/cc @friedbunny

'[ぁ-ゟ]',
'[゠-ヿ]',
'[ㇰ-ㇿ]',
'[ꀀ-꓆]',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mistaken in including Yi script here. According to this Wikipedia article, the Classical Yi syllabary was written vertically, but the Modern Yi syllabary has always been written horizontally. So we should remove this line.

@1ec5 1ec5 changed the title Add support for rendering CJK in a vertical writing mode Add support for rendering CJK in a vertical writing mode along line-placed features Oct 22, 2016
@@ -0,0 +1,22 @@
'use strict';

const verticalRegExp = new RegExp([
Copy link
Contributor

@1ec5 1ec5 Oct 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #3420 lands first, verticalRegExp and ideographicBreakingRegExp will have some overlap. To wit:

  • Hangul and Mongolian are vertical but don’t get ideographic breaking.
  • Halfwidth forms and Yi are horizontal but do get ideographic breaking.

@1ec5
Copy link
Contributor

1ec5 commented Oct 22, 2016

I’m not sure how prevalent punctuation is in road names, but various Chinese and Japanese dashes and brackets are rotated 90° or 180° when laid out vertically. Would it be possible to replace the following characters for vertical text?

Horizontal Vertical
_
︹ (?)
︺ (?)
︿

'[゠-ヿ]',
'[ㇰ-ㇿ]',
'[ꀀ-꓆]',
'[᠀-ᢪ]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also remove Mongolian from this regular expression. Unlike the other scripts, Mongolian must be written vertically (never horizontally) and its glyphs are always rotated when set vertically. So apart from the fact that we don’t support complex text shaping yet (mapbox/DEPRECATED-mapbox-gl#4), Mongolian is already laid out correctly along lines from 3:00 to 6:00 and from 9:00 to 12:00. With this PR as is, Mongolian would only be laid out correctly from 3:00 to 4:30 and from 9:00 to 10:30.

If we want to address Mongolian, I think we should do that as tail work. But for that matter, I’m having a difficult time finding any area in OpenStreetMap or the Mapbox Streets source where roads are labeled in Mongolian. So best to leave Mongolian alone.

Copy link
Contributor

@xrwang xrwang Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@1ec5 makes sense. Only Inner Mongolia uses the traditional (vertical) Mongolian script, and all of those road names in Inner Mongolia are transliterated into Chinese characters or are Chinese names. Mongolia the country uses Cyrillic (consistently LTR and horizontal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in OSM, some places in Inner Mongolia are tagged bilingually in Mongolian (with Mongolian script) in the name tag; however, Mapbox Streets seems to exclude the Mongolian text from its name field, and I’m not sure why. Regardless, I can’t find any streets tagged that way, so it’s low-priority for the time being.

@kkaefer
Copy link
Member

kkaefer commented Oct 24, 2016

wow

@lucaswoj lucaswoj force-pushed the vertical branch 2 times, most recently from 42163bb to 6b86db2 Compare October 31, 2016 20:41
lucaswoj pushed a commit that referenced this pull request Oct 31, 2016
@lucaswoj
Copy link
Contributor Author

benchmark master 8127e17 vertical 893d46c
map-load 221 ms 135 ms
style-load 151 ms 159 ms
buffer 1,326 ms 1,122 ms
fps 60 fps 60 fps
frame-duration 4.4 ms, 0% > 16ms 4.9 ms, 0% > 16ms
query-point 1.07 ms 1.09 ms
query-box 65.20 ms 68.88 ms
geojson-setdata-small 10 ms 8 ms
geojson-setdata-large 308 ms 261 ms

@lucaswoj
Copy link
Contributor Author

This is ready for 👀

// 'Buhid': (char) => char >= 0x1740 && char <= 0x175F,
// 'Tagbanwa': (char) => char >= 0x1760 && char <= 0x177F,
// 'Khmer': (char) => char >= 0x1780 && char <= 0x17FF,
'Mongolian': (char) => char >= 0x1800 && char <= 0x18AF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per #3438 (comment), we don’t want to rotate Mongolian vertically or give it ideographic breaking.

'Katakana': (char) => char >= 0x30A0 && char <= 0x30FF,
// 'Bopomofo': (char) => char >= 0x3100 && char <= 0x312F,
'Hangul Compatibility Jamo': (char) => char >= 0x3130 && char <= 0x318F,
// 'Kanbun': (char) => char >= 0x3190 && char <= 0x319F,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I guess it wouldn’t hurt to make this vertical too.

// 'Cherokee Supplement': (char) => char >= 0xAB70 && char <= 0xABBF,
// 'Meetei Mayek': (char) => char >= 0xABC0 && char <= 0xABFF,
'Hangul Syllables': (char) => char >= 0xAC00 && char <= 0xD7AF,
// 'Hangul Jamo Extended-B': (char) => char >= 0xD7B0 && char <= 0xD7FF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well make this vertical too.

// 'Devanagari Extended': (char) => char >= 0xA8E0 && char <= 0xA8FF,
// 'Kayah Li': (char) => char >= 0xA900 && char <= 0xA92F,
// 'Rejang': (char) => char >= 0xA930 && char <= 0xA95F,
// 'Hangul Jamo Extended-A': (char) => char >= 0xA960 && char <= 0xA97F,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well make this vertical too.

// 'Alphabetic Presentation Forms': (char) => char >= 0xFB00 && char <= 0xFB4F,
// 'Arabic Presentation Forms-A': (char) => char >= 0xFB50 && char <= 0xFDFF,
// 'Variation Selectors': (char) => char >= 0xFE00 && char <= 0xFE0F,
// 'Vertical Forms': (char) => char >= 0xFE10 && char <= 0xFE1F,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this block vertical and ideographic. This block will also come in handy once we implement #3438 (comment).

'CJK Symbols and Punctuation': (char) => char >= 0x3000 && char <= 0x303F,
'Hiragana': (char) => char >= 0x3040 && char <= 0x309F,
'Katakana': (char) => char >= 0x30A0 && char <= 0x30FF,
// 'Bopomofo': (char) => char >= 0x3100 && char <= 0x312F,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block doesn’t appear in OSM, but it could appear in custom data. It should be vertical and get ideographic breaking.

// 'Bopomofo': (char) => char >= 0x3100 && char <= 0x312F,
'Hangul Compatibility Jamo': (char) => char >= 0x3130 && char <= 0x318F,
// 'Kanbun': (char) => char >= 0x3190 && char <= 0x319F,
// 'Bopomofo Extended': (char) => char >= 0x31A0 && char <= 0x31BF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block doesn’t appear in OSM, but it could appear in custom data. It should be vertical and get ideographic breaking.

@lucaswoj
Copy link
Contributor Author

lucaswoj commented Nov 1, 2016

All requested changes implemented in 85abbdc 🚀

if (char >= 0x4E00 && char <= 0x9FFF) return true;
function charAllowsVerticalWritingMode(char) {
// Return early for characters outside all ideographic ranges.
if (char < 0x2E80) return false;
Copy link
Contributor

@1ec5 1ec5 Nov 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will exclude the Hangul Jamo block. The cutoff should instead be U+1100.

(To be pedantic, not all of the vertical blocks are ideographic.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇 fixed in 97ef972

@1ec5
Copy link
Contributor

1ec5 commented Nov 1, 2016

Would it be possible to replace the following characters for vertical text?

#3505

@lucaswoj
Copy link
Contributor Author

lucaswoj commented Nov 1, 2016

Thanks for the 🍏 @1ec5! I'm going to hold off merging until @nickidlugash gets a chance to look at this branch from a carto/desgin perspective.

@nickidlugash
Copy link

@lucaswoj @1ec5 I'm still in the middle of reviewing this! Just wanted to note one thing now though:

Vertical labels seem off-centered towards the left of the line. Will it be possible to have them be centered? I haven't had a chance to dig into the code to see where the issue is.

screen shot 2016-11-01 at 7 32 25 pm

Side note relevant to that screenshot: "/" character collides with all characters above it. This is a common character in one of our relevant data sources. I've seen conflicting information as to whether this is the correct character to use for Chinese in general and specifically for vertical orientation – solidus (U+002F), versus fullwidth solidus (U+FF0F). Either way, I think perhaps without a true vertical advance value this is not getting enough space. The design of this character is one that signficantly overshoots the cap height of the font.

@1ec5
Copy link
Contributor

1ec5 commented Nov 2, 2016

"/" character collides with all characters above it

This is a good argument for making #3505 a release blocker.

@lucaswoj
Copy link
Contributor Author

lucaswoj commented Nov 2, 2016

Vertical labels seem off-centered towards the left of the line.

I can't think of a reason why this problem would only affect vertical labels. Would you mind double checking that this behaviour only affects vertical labels? If so, we should endeavour to fix.


"/" character collides with all characters above it

This is a good argument for making #3505 a release blocker.

I'll make #3505 a release blocker. It should be easy to implement & help mitigate the problem.

@lucaswoj lucaswoj force-pushed the vertical branch 3 times, most recently from d560a43 to 32413f3 Compare November 9, 2016 01:24
@lucaswoj
Copy link
Contributor Author

lucaswoj commented Nov 9, 2016

I'm rather stuck on these remaining punctuation glitches. Any comments on the correctness / incorrectness on what's rendered or solutions @1ec5 @xrwang @nickidlugash @ansis?

screen shot 2016-11-09 at 10 39 17 am

Any bugs are most likely in this code within glyphs.js (this algo doesn't make much intuitive sense, this code has evolved based on emperical success) or this code in shaping.js.


For reference, here is our overall punctuation test suite at this time. Things are looking 🚀 overall.

actual

@1ec5
Copy link
Contributor

1ec5 commented Nov 9, 2016

This should be an underscore

I think the underscore would be exceedingly rare in CJK line-placed labels.

Should this be more to the right?

According to this Wikipedia article, the current horizontally centered behavior of is correct for Traditional Chinese, but ideally it would be offset to the right for Simplified Chinese.

Does get centered vertically only when adjacent to CJK, or always? It shouldn’t be vertically centered for Latin text.

'‘': '﹃',
'’': '﹄',
'.': '・'
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add all the remaining fullwidth punctuation so that characters like and don’t get rotated 90° when they’re sitting next to CJK. @nickidlugash, can you think of any exceptions?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I can't think of any. They should probably all not get rotated. Are we using the fullwidth numerals and latin characters for single characters sitting next to CJK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'”': '﹂',
'‘': '﹃',
'’': '﹄',
'.': '・'
Copy link
Contributor Author

@lucaswoj lucaswoj Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was lost in the rebase cc @1ec5 @nickidlugash

#3438 (comment)

screen shot 2016-11-09 at 4 50 04 pm

Copy link
Contributor Author

@lucaswoj lucaswoj Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@1ec5 I don't have a good sense of which punctuation should go into this table and which should not. Can you provide a table or algorithm? It may be easiest to just add a commit to the PR with the intended changes.

@nickidlugash We do not use fullwidth numerals and characters yet. Detecting when it is / isn't ok to use fullwidth characters is tricky. This can be done in a follow-up PR.

@lucaswoj
Copy link
Contributor Author

Vertical labels seem off-centered towards the left of the line. Will it be possible to have them be centered? I haven't had a chance to dig into the code to see where the issue is.

As best I can tell, this behaviour has not changed between master and this branch. It may have been fixed between when this was reported and now.

@lucaswoj
Copy link
Contributor Author

This PR is rebased, squashed, and 🍏. I hope we can merge this very soon! Maintaining a large & long-running feature branch is a huge amount of work.

@1ec5
Copy link
Contributor

1ec5 commented Nov 10, 2016

I’ve added all the punctuation marks that belong in this PR. In addition to the punctuation marks in #3505, I mapped various ASCII characters to their fullwidth equivalents. However, I mapped the ASCII brackets and their fullwidth equivalents to their vertical analogues in the CJK Compatibility Forms, and I left out " and ' because their vertical equivalents and are probably only appropriate for horizontal text.

Currently, we replace a punctuation character with its fullwidth analogue if both adjacent characters are compatible with vertical writing. This heuristic is conservative enough that we could implement tate-chu-yoko (see #3506) simply by mapping all the ASCII numbers and letters to their fullwidth analogues in lookup (or by adding 0xfee0 to the character code of any ASCII number or letter). However, I think ultimately we’ll want to split lookup into three lookup tables (leftBinding, leftOrRightBinding, and rightBinding) and consider only the side(s) to which the character is bound.

Since both those improvements would require changes to the test suite, I’ll take that as a sign that we should go ahead and merge and follow up with more heuristics in a separate PR.

@1ec5
Copy link
Contributor

1ec5 commented Nov 10, 2016

Currently, we replace a punctuation character with its fullwidth analogue if both adjacent characters are compatible with vertical writing. This heuristic is conservative enough that we could implement tate-chu-yoko (see #3506) simply by mapping all the ASCII numbers and letters to their fullwidth analogues in lookup (or by adding 0xfee0 to the character code of any ASCII number or letter).

#3588

However, I think ultimately we’ll want to split lookup into three lookup tables (leftBinding, leftOrRightBinding, and rightBinding) and consider only the side(s) to which the character is bound.

#3587

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Orient CJK glyphs vertically along vertically oriented lines
6 participants