Add support for ideographic text breaking #3420

xrwang · 2016-10-20T06:26:18Z

Launch Checklist

Splitting out from #3402 for clearer tracking of two separate issues.

This PR adds support for:

CJK line breaking

Requirements

GL JS must line break for point labels in CJK even if there are no spaces in the label

Specifications

we will use naïve "balanced" breaking
we will enable / disable naive breaking based on language detection
we will use character ranges for language detection (data)

Launch Checklist

fix failing unit tests
write tests
remove whitespace & debug page changes from diff where appropriate
manually test the debug page
post benchmark scores
merge Add test for ideographic text breaking mapbox-gl-test-suite#153

mourner · 2016-10-20T10:58:51Z

js/symbol/shaping.js

@@ -123,7 +150,14 @@ function linewrap(shaping, glyphs, lineHeight, maxWidth, horizontalAlign, vertic
                line++;
            }

-            if (breakable[positionedGlyph.codePoint]) {
+            if (positionedGlyph.codePoint > 19968) {


Magic number alert. We should add a comment about this and/or make it a constant.

Also, when it comes to Unicode codepoints, always use hexadecimal literals: 0x4e00.

I reckon we will replace this with a modified version of the regex in #3438

mourner · 2016-10-20T10:59:19Z

js/symbol/shaping.js

+                    lastSafeBreak = i - 1;
+                }
+                if (!(breakable[positionedGlyph.codePoint])) {
+                    lastSafeBreak = Math.round(wordLength / 3);


Here too. Why is it divided by 3?

mourner · 2016-10-20T10:59:38Z

debug/chinese.html

@@ -37,7 +37,6 @@
    hash: true
 });

-map.addControl(new mapboxgl.Navigation());


Not intentional?

1ec5 · 2016-10-21T20:58:35Z

js/symbol/shaping.js

-    0x2f:   true, // solidus
-    0xad:   true, // soft hyphen
-    0xb7:   true, // middle dot
+    0x0020: true, // space


Would it be more performant to use a regular expression instead of iterating over this set? See #3438 (comment) for ideas on keeping the regular expression maintainable.

Fortunately we never need to iterate over this object. We just do lookups into it. That should be the fastest solution 👍

1ec5 · 2016-10-21T21:55:35Z

See https://github.com/mapbox/mapbox-gl-js/pull/3438/files#diff-0612e4d64682542fe1be64751bc484a3 for a work-in-progress regular expression that covers all the traditionally top-to-bottom scripts. This regular expression largely overlaps with the scripts that also need “balanced” ideographic breaking, but Hangul ('[ᄀ-ᇿ]', '[가-힣]', '[ㄱ-ㆎ]') and Mongolian ([᠀-ᢪ]) need to be removed.

lucaswoj · 2016-10-21T23:45:49Z

js/symbol/shaping.js

+            const lastPositionedGlyph = positionedGlyphs[positionedGlyphs.length - 1];
+            const estimatedLineCount = Math.max(1, Math.ceil(lastPositionedGlyph.x / maxWidth));
+            maxWidth = lastPositionedGlyph.x / estimatedLineCount;
+        }


I don't know if this is a hack or a ⚠️ HACK ⚠️.

lucaswoj · 2016-10-21T23:46:23Z

js/symbol/shaping.js

+    0xff5b: true, // fullwidth left curly bracket
+    0xff5e: true, // fullwidth tilde
+    0xffe1: true, // fullwidth pound sign
+    0xffe5: true  // fullwidth yen sign


This lookup table isn't strictly necessary anymore, though it seems like an improvement

I think the changes to this table should be reverted. Now it’s possible for Latin text to break on either side of a dollar sign, for instance:

Ye Old $ 99 Store

Furthermore, in ideographic scripts, characters like U+3016 left white lenticular bracket (〖) visually combine a space and a punctuation mark, so it can only break on one side (in this case, on the left but not the right).

As things stand in this PR, the CJK punctuation in this table has no effect on ideographic breaking, since breaking can occur on either side of any character. Indeed, we will eventually want to avoid orphaning CJK punctuation, but I don’t think this PR accomplishes that yet.

@1ec5, 👌 makes sense, will revert the additions.

Reverted in 877c895 (I kept a couple new characters 😈 )

What's the rationale for breaking at quotation marks or apostrophes? This would result in labels like "Linens 'n Things" turning into:

Linens ' n Things

Good point. This is my mistake. Reverting to the original set now.

1ec5 · 2016-10-22T18:43:58Z

js/util/script_detection.js

@@ -0,0 +1,18 @@
+'use strict';
+
+const ideographicBreakingRegExp = new RegExp([


If #3438 lands first, verticalRegExp and ideographicBreakingRegExp will have some overlap. To wit:

Hangul and Mongolian are vertical but don’t get ideographic breaking.

Halfwidth forms and Yi are horizontal but do get ideographic breaking.

Would it make sense to break this regex up into a set of shared strings? For example:

const hangulRegExpString = '...'; const halfwidthFormsRegExpString = '...'; ... const ideographicBreakingRegExp = new RegExp([ hangulRegExpString, halfwidthFormsRegExpString, ... ].join('|'));

I'd be happy to do so. Would you mind pointing out which character ranges belong to which sets?

I plan to merge this PR first. Let's think about the merge between the two regexps in #3438

1ec5 · 2016-10-22T18:53:27Z

js/symbol/shaping.js

+    0xff5b: true, // fullwidth left curly bracket
+    0xff5e: true, // fullwidth tilde
+    0xffe1: true, // fullwidth pound sign
+    0xffe5: true  // fullwidth yen sign


I think the changes to this table should be reverted. Now it’s possible for Latin text to break on either side of a dollar sign, for instance:

Ye Old $ 99 Store

Furthermore, in ideographic scripts, characters like U+3016 left white lenticular bracket (〖) visually combine a space and a punctuation mark, so it can only break on one side (in this case, on the left but not the right).

As things stand in this PR, the CJK punctuation in this table has no effect on ideographic breaking, since breaking can occur on either side of any character. Indeed, we will eventually want to avoid orphaning CJK punctuation, but I don’t think this PR accomplishes that yet.

lucaswoj · 2016-10-25T18:09:37Z

@1ec5 I think I've addressed all your feedback and my todos. Anything else you'd like to see before this 🚢s?

@jfirebaugh @mourner @ansis Would you like to 👀 this PR?

jfirebaugh · 2016-10-25T18:13:27Z

package.json

@@ -56,7 +56,7 @@
    "highlight.js": "9.3.0",
    "jsdom": "^9.4.2",
    "lodash.template": "^4.4.0",
-    "mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#28c76c64e8cfcee8764c6c0f6d4fcc2d15a8d1e1",
+    "mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#77b281c9e6225471505f7daefa76806a8fbf22e2",


Need a rebase and merge of the mapbox-gl-test-suite branch.

✅ I'll do that with the "squash & merge" button on mapbox/mapbox-gl-test-suite#153 once this PR gets the green light.

lucaswoj · 2016-10-25T18:16:16Z

benchmark	master `e03a7b1`	cjk-break2 `7462ab1`
map-load	171 ms	179 ms
style-load	197 ms	147 ms
buffer	1,068 ms	1,200 ms
fps	60 fps	60 fps
frame-duration	6.4 ms, 1% > 16ms	6.8 ms, 1% > 16ms
query-point	1.25 ms	1.16 ms
query-box	82.52 ms	89.32 ms
geojson-setdata-small	10 ms	7 ms
geojson-setdata-large	316 ms	346 ms

ansis · 2016-10-25T18:22:52Z

How does this handle cases where the text contains both ideographic glyphs and non-ideographic ones? allowsIdeographicBreaking looks like it returns true in these cases. Does this mean that linebreaks can occur anywhere in these cases?

1ec5 · 2016-10-25T18:28:13Z

How does this handle cases where the text contains both ideographic glyphs and non-ideographic ones? allowsIdeographicBreaking looks like it returns true in these cases. Does this mean that linebreaks can occur anywhere in these cases?

Yes, and that goes for #3438 as well. That’s why I originally proposed a negation in #3402 (comment) that would only affect labels that consisted entirely of ideographic characters.

lucaswoj · 2016-10-25T19:10:25Z

Yes, and that goes for #3438 as well. That’s why I originally proposed a negation in #3402 (comment) that would only affect labels that consisted entirely of ideographic characters.

Ah. I see. I did not understand the intent behind that code. I will adjust this PR to accommodate.

1ec5 · 2016-10-25T19:50:02Z

js/util/script_detection.js

+].join('|')})+$`);
+
+module.exports.allowsIdeographicBreaking = function(input) {
+    return input.search(ideographicBreakingRegExp) !== -1;


Now that a match for ideographicBreakingRegExp has ^ and $ anchors, spanning the entire input string, this should be a call to match(). But I wonder if it’s even necessary to match the entire string. Searching for [^一-鿌…] instead would be more performant for strings that contain no CJK. The tricky part is the surrogate pairs, but it should definitely be possible to negate that part of the regular expression too.

1ec5 · 2016-10-25T22:22:42Z

Concretely, the problem is bilingual labels or those that include romanizations, like this POI, named 曾大屋新村 Tsang Tai Uk New Village. We wouldn’t want it become:

曾大屋新村
Tsang
Tai Uk
New Vi
llage

On the other hand, there are legitimate reasons for a CJK label to contain non-CJK characters, like (hypothetically) a subway station named 施氏食狮史（ 1A口）.

What this PR calls “ideographic breaking” is a combination of two features: breaking on any character, and balancing lines when breaking. The two naturally go hand in hand for purely CJK labels, but bilingual labels force us to distinguish between them.

If we’re conservative and err on the side of applying word breaking throughout, then the worst that could happen is a label that collides other labels out. If we’re too aggressive and err on the side of applying CJK breaking throughout, then we risk breaking in the middle of non-CJK words, which looks amateurish.

Ideally, in the presence of non-CJK characters, we’d continue to break on any CJK character but go back to word breaking for non-CJK words. @lucaswoj has pushed an implementation that seems to do this quite well.

mourner · 2016-10-25T22:24:04Z

buffer 1,068 ms 1,200 ms

Is this due to flaky benchmark, or a real regression? I wouldn't expect a perf regression on an latin-character area which we use in the bench. We can still merge but should follow-up if it's a real regression.

lucaswoj · 2016-10-25T22:39:10Z

We might expect some slowdown in Latin areas because we need to check each label for ideographic characters. In the case of Latin-only labels, e1eee1b improves this from a O(number of characters) to O(number of labels) check.

Here are a couple of benchmark runs. We sorely need #3237.

benchmark	master `baa2b8d`	cjk-break2 `0f3ed22`
buffer	889 ms	901 ms

benchmark	master `baa2b8d`	cjk-break2 `0f3ed22`
buffer	901 ms	878 ms

benchmark	master `baa2b8d`	cjk-break2 `0f3ed22`
buffer	879 ms	893 ms

lucaswoj · 2016-10-25T22:39:51Z

@1ec5 are you ready to bless this with a ✅?

mourner · 2016-10-25T22:45:32Z

We might expect some slowdown in Latin areas because we need to check each label for ideographic characters.

@lucaswoj The bench results look great now! O(number of labels) is very fast. If we want to improve the check further, maybe we could bail out early on latin characters (e.g. if (char <= ...) return false; on the first line).

lucaswoj · 2016-10-25T23:44:18Z

js/util/script_detection.js

+    // "𠀀" to "𬺯"
+    if (char === 0xD840 && nextChar >= 0xDC00) return true;
+    if (char >= 0xD841 && char <= 0xD872) return true;
+    if (char === 0xD873 && nextChar <= 0xDEAF) return true;


I've realized that there are a few bugs in our handling of surrogate pairs. Because we don't actually have the ability to render surrogate pairs as glyphs, I can't make a test case. I'm thinking we should remove this range altogether.

1ec5 · 2016-10-26T10:57:10Z

A C++ port is ready in mapbox/mapbox-gl-native#6828. In that PR, I took another look at the exact character ranges we treat as ideographic. I decided to add some additional Unicode code blocks from this Wikipedia table – just the BMP blocks, since we don’t support the SIP for glyphs yet (mapbox/DEPRECATED-mapbox-gl#29). Additionally, I expanded each range to cover the entire block, not just the currently assigned code points, to future-proof the code a bit. (Future versions of Unicode may assign more characters within these blocks and also add new blocks. However, new CJK blocks would likely fall within the SIP.)

Feel free to synchronize this PR with that one.

lucaswoj · 2016-10-31T13:49:45Z

@1ec5 Done in b2cd6ca. I'm going to rebase this branch and then 🚢.

xrwang added the not ready for review label Oct 20, 2016

mourner reviewed Oct 20, 2016

View reviewed changes

1ec5 reviewed Oct 21, 2016

View reviewed changes

lucaswoj mentioned this pull request Oct 21, 2016

Add test for ideographic text breaking mapbox/mapbox-gl-test-suite#153

Merged

lucaswoj reviewed Oct 21, 2016

View reviewed changes

lucaswoj removed the not ready for review label Oct 21, 2016

1ec5 mentioned this pull request Oct 22, 2016

Add support for rendering CJK in a vertical writing mode along line-placed features #3438

Merged

16 tasks

1ec5 suggested changes Oct 22, 2016

View reviewed changes

lucaswoj changed the title ~~Add support for breaking CJK lines evenly~~ Add support for ideographic text breaking Oct 24, 2016

lucaswoj self-assigned this Oct 24, 2016

lucaswoj force-pushed the cjk-break2 branch from 3242ac2 to 7462ab1 Compare October 25, 2016 18:08

jfirebaugh approved these changes Oct 25, 2016

View reviewed changes

1ec5 suggested changes Oct 25, 2016

View reviewed changes

lucaswoj force-pushed the cjk-break2 branch from 83a7aef to 0f3ed22 Compare October 25, 2016 22:20

lucaswoj force-pushed the cjk-break2 branch from 0f3ed22 to 9fcbbda Compare October 25, 2016 22:40

1ec5 approved these changes Oct 25, 2016

View reviewed changes

lucaswoj reviewed Oct 25, 2016

View reviewed changes

1ec5 mentioned this pull request Oct 26, 2016

Line-break ideographic text by character mapbox/mapbox-gl-native#6828

Merged

2 tasks

1ec5 mentioned this pull request Oct 26, 2016

Text wrap without spaces mapbox/mapbox-gl-style-spec#420

Closed

xrwang and others added 15 commits October 31, 2016 10:02

update linewrap to check for first character of chinese

bdaae68

Tweak algorithm, add test

49081aa

Rename "balanced" breaking to "ideographic" breaking

b54930f

Improve balanced breaking

9040762

Fix eslint

d44e239

Remove CJK characters from "breakable" table

e31de52

Revert changes to Chinese debug pages

60dc351

Revert all "breakable" changes

56fa921

Refactored so that mixed scripts do not use ideographic line breaking

7452092

Bump test-suite version

cab9759

Switch from regex to nested "if" statements

5257a32

Allow breaking on all ideographic characters

ef298f4

Remove surrogate pairs from script detection

70b889d

Sync with native PR

0e7f186

Bump test-suite version

83e2923

lucaswoj force-pushed the cjk-break2 branch from b2cd6ca to 83e2923 Compare October 31, 2016 14:04

lucaswoj merged commit 2a706ab into master Oct 31, 2016

lucaswoj deleted the cjk-break2 branch October 31, 2016 14:15

1ec5 mentioned this pull request Nov 18, 2016

Halfwidth punctuation prevents ideographic line breaking #3658

Closed

		@@ -0,0 +1,18 @@
		'use strict';

		const ideographicBreakingRegExp = new RegExp([

Add support for ideographic text breaking #3420

Add support for ideographic text breaking #3420

Conversation

xrwang commented Oct 20, 2016 • edited by lucaswoj Loading

Launch Checklist

Requirements

Specifications

Launch Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1ec5 commented Oct 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1ec5 Oct 22, 2016 • edited Loading

Choose a reason for hiding this comment

xrwang Oct 24, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucaswoj Oct 24, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1ec5 Oct 22, 2016 • edited Loading

Choose a reason for hiding this comment

lucaswoj commented Oct 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucaswoj commented Oct 25, 2016

ansis commented Oct 25, 2016

1ec5 commented Oct 25, 2016

lucaswoj commented Oct 25, 2016

Choose a reason for hiding this comment

1ec5 commented Oct 25, 2016

mourner commented Oct 25, 2016

lucaswoj commented Oct 25, 2016

lucaswoj commented Oct 25, 2016

mourner commented Oct 25, 2016

Choose a reason for hiding this comment

1ec5 commented Oct 26, 2016

lucaswoj commented Oct 31, 2016

xrwang commented Oct 20, 2016 •

edited by lucaswoj

Loading

1ec5 Oct 22, 2016 •

edited

Loading

xrwang Oct 24, 2016 •

edited

Loading

lucaswoj Oct 24, 2016 •

edited

Loading

1ec5 Oct 22, 2016 •

edited

Loading