Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ideographic text breaking #3420

Merged
merged 15 commits into from
Oct 31, 2016
Merged

Add support for ideographic text breaking #3420

merged 15 commits into from
Oct 31, 2016

Conversation

xrwang
Copy link
Contributor

@xrwang xrwang commented Oct 20, 2016

Launch Checklist

Splitting out from #3402 for clearer tracking of two separate issues.

This PR adds support for:

  • CJK line breaking

Requirements

  • GL JS must line break for point labels in CJK even if there are no spaces in the label

Specifications

  • we will use naïve "balanced" breaking
  • we will enable / disable naive breaking based on language detection
  • we will use character ranges for language detection (data)

Launch Checklist

@@ -123,7 +150,14 @@ function linewrap(shaping, glyphs, lineHeight, maxWidth, horizontalAlign, vertic
line++;
}

if (breakable[positionedGlyph.codePoint]) {
if (positionedGlyph.codePoint > 19968) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number alert. We should add a comment about this and/or make it a constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when it comes to Unicode codepoints, always use hexadecimal literals: 0x4e00.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon we will replace this with a modified version of the regex in #3438

lastSafeBreak = i - 1;
}
if (!(breakable[positionedGlyph.codePoint])) {
lastSafeBreak = Math.round(wordLength / 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too. Why is it divided by 3?

@@ -37,7 +37,6 @@
hash: true
});

map.addControl(new mapboxgl.Navigation());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not intentional?

0x2f: true, // solidus
0xad: true, // soft hyphen
0xb7: true, // middle dot
0x0020: true, // space
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more performant to use a regular expression instead of iterating over this set? See #3438 (comment) for ideas on keeping the regular expression maintainable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately we never need to iterate over this object. We just do lookups into it. That should be the fastest solution 👍

@1ec5
Copy link
Contributor

1ec5 commented Oct 21, 2016

See https://github.com/mapbox/mapbox-gl-js/pull/3438/files#diff-0612e4d64682542fe1be64751bc484a3 for a work-in-progress regular expression that covers all the traditionally top-to-bottom scripts. This regular expression largely overlaps with the scripts that also need “balanced” ideographic breaking, but Hangul ('[ᄀ-ᇿ]', '[가-힣]', '[ㄱ-ㆎ]') and Mongolian ([᠀-ᢪ]) need to be removed.

const lastPositionedGlyph = positionedGlyphs[positionedGlyphs.length - 1];
const estimatedLineCount = Math.max(1, Math.ceil(lastPositionedGlyph.x / maxWidth));
maxWidth = lastPositionedGlyph.x / estimatedLineCount;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is a hack or a ⚠️ HACK ⚠️.

0xff5b: true, // fullwidth left curly bracket
0xff5e: true, // fullwidth tilde
0xffe1: true, // fullwidth pound sign
0xffe5: true // fullwidth yen sign
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lookup table isn't strictly necessary anymore, though it seems like an improvement

Copy link
Contributor

@1ec5 1ec5 Oct 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes to this table should be reverted. Now it’s possible for Latin text to break on either side of a dollar sign, for instance:

Ye Old $
99 Store

Furthermore, in ideographic scripts, characters like U+3016 left white lenticular bracket (〖) visually combine a space and a punctuation mark, so it can only break on one side (in this case, on the left but not the right).

As things stand in this PR, the CJK punctuation in this table has no effect on ideographic breaking, since breaking can occur on either side of any character. Indeed, we will eventually want to avoid orphaning CJK punctuation, but I don’t think this PR accomplishes that yet.

Copy link
Contributor Author

@xrwang xrwang Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@1ec5, 👌 makes sense, will revert the additions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in 877c895 (I kept a couple new characters 😈 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for breaking at quotation marks or apostrophes? This would result in labels like "Linens 'n Things" turning into:

Linens '
n Things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is my mistake. Reverting to the original set now.

@@ -0,0 +1,18 @@
'use strict';

const ideographicBreakingRegExp = new RegExp([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #3438 lands first, verticalRegExp and ideographicBreakingRegExp will have some overlap. To wit:

  • Hangul and Mongolian are vertical but don’t get ideographic breaking.
  • Halfwidth forms and Yi are horizontal but do get ideographic breaking.

Copy link
Contributor

@lucaswoj lucaswoj Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to break this regex up into a set of shared strings? For example:

const hangulRegExpString = '...';
const halfwidthFormsRegExpString = '...';
...

const ideographicBreakingRegExp = new RegExp([
    hangulRegExpString,
    halfwidthFormsRegExpString,
    ...
].join('|'));

I'd be happy to do so. Would you mind pointing out which character ranges belong to which sets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to merge this PR first. Let's think about the merge between the two regexps in #3438

0xff5b: true, // fullwidth left curly bracket
0xff5e: true, // fullwidth tilde
0xffe1: true, // fullwidth pound sign
0xffe5: true // fullwidth yen sign
Copy link
Contributor

@1ec5 1ec5 Oct 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes to this table should be reverted. Now it’s possible for Latin text to break on either side of a dollar sign, for instance:

Ye Old $
99 Store

Furthermore, in ideographic scripts, characters like U+3016 left white lenticular bracket (〖) visually combine a space and a punctuation mark, so it can only break on one side (in this case, on the left but not the right).

As things stand in this PR, the CJK punctuation in this table has no effect on ideographic breaking, since breaking can occur on either side of any character. Indeed, we will eventually want to avoid orphaning CJK punctuation, but I don’t think this PR accomplishes that yet.

@lucaswoj lucaswoj changed the title Add support for breaking CJK lines evenly Add support for ideographic text breaking Oct 24, 2016
@lucaswoj lucaswoj self-assigned this Oct 24, 2016
@lucaswoj
Copy link
Contributor

@1ec5 I think I've addressed all your feedback and my todos. Anything else you'd like to see before this 🚢s?

@jfirebaugh @mourner @ansis Would you like to 👀 this PR?

@@ -56,7 +56,7 @@
"highlight.js": "9.3.0",
"jsdom": "^9.4.2",
"lodash.template": "^4.4.0",
"mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#28c76c64e8cfcee8764c6c0f6d4fcc2d15a8d1e1",
"mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#77b281c9e6225471505f7daefa76806a8fbf22e2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a rebase and merge of the mapbox-gl-test-suite branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ I'll do that with the "squash & merge" button on mapbox/mapbox-gl-test-suite#153 once this PR gets the green light.

@lucaswoj
Copy link
Contributor

benchmark master e03a7b1 cjk-break2 7462ab1
map-load 171 ms 179 ms
style-load 197 ms 147 ms
buffer 1,068 ms 1,200 ms
fps 60 fps 60 fps
frame-duration 6.4 ms, 1% > 16ms 6.8 ms, 1% > 16ms
query-point 1.25 ms 1.16 ms
query-box 82.52 ms 89.32 ms
geojson-setdata-small 10 ms 7 ms
geojson-setdata-large 316 ms 346 ms

@ansis
Copy link
Contributor

ansis commented Oct 25, 2016

How does this handle cases where the text contains both ideographic glyphs and non-ideographic ones? allowsIdeographicBreaking looks like it returns true in these cases. Does this mean that linebreaks can occur anywhere in these cases?

@1ec5
Copy link
Contributor

1ec5 commented Oct 25, 2016

How does this handle cases where the text contains both ideographic glyphs and non-ideographic ones? allowsIdeographicBreaking looks like it returns true in these cases. Does this mean that linebreaks can occur anywhere in these cases?

Yes, and that goes for #3438 as well. That’s why I originally proposed a negation in #3402 (comment) that would only affect labels that consisted entirely of ideographic characters.

@lucaswoj
Copy link
Contributor

Yes, and that goes for #3438 as well. That’s why I originally proposed a negation in #3402 (comment) that would only affect labels that consisted entirely of ideographic characters.

Ah. I see. I did not understand the intent behind that code. I will adjust this PR to accommodate.

].join('|')})+$`);

module.exports.allowsIdeographicBreaking = function(input) {
return input.search(ideographicBreakingRegExp) !== -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that a match for ideographicBreakingRegExp has ^ and $ anchors, spanning the entire input string, this should be a call to match(). But I wonder if it’s even necessary to match the entire string. Searching for [^一-鿌…] instead would be more performant for strings that contain no CJK. The tricky part is the surrogate pairs, but it should definitely be possible to negate that part of the regular expression too.

@1ec5
Copy link
Contributor

1ec5 commented Oct 25, 2016

Concretely, the problem is bilingual labels or those that include romanizations, like this POI, named 曾大屋新村 Tsang Tai Uk New Village. We wouldn’t want it become:

曾大屋新村
Tsang
Tai Uk
New Vi
llage

On the other hand, there are legitimate reasons for a CJK label to contain non-CJK characters, like (hypothetically) a subway station named 施氏食狮史( 1A口).

What this PR calls “ideographic breaking” is a combination of two features: breaking on any character, and balancing lines when breaking. The two naturally go hand in hand for purely CJK labels, but bilingual labels force us to distinguish between them.

If we’re conservative and err on the side of applying word breaking throughout, then the worst that could happen is a label that collides other labels out. If we’re too aggressive and err on the side of applying CJK breaking throughout, then we risk breaking in the middle of non-CJK words, which looks amateurish.

Ideally, in the presence of non-CJK characters, we’d continue to break on any CJK character but go back to word breaking for non-CJK words. @lucaswoj has pushed an implementation that seems to do this quite well.

@mourner
Copy link
Member

mourner commented Oct 25, 2016

buffer 1,068 ms 1,200 ms

Is this due to flaky benchmark, or a real regression? I wouldn't expect a perf regression on an latin-character area which we use in the bench. We can still merge but should follow-up if it's a real regression.

@lucaswoj
Copy link
Contributor

We might expect some slowdown in Latin areas because we need to check each label for ideographic characters. In the case of Latin-only labels, e1eee1b improves this from a O(number of characters) to O(number of labels) check.

Here are a couple of benchmark runs. We sorely need #3237.

benchmark master baa2b8d cjk-break2 0f3ed22
buffer 889 ms 901 ms
benchmark master baa2b8d cjk-break2 0f3ed22
buffer 901 ms 878 ms
benchmark master baa2b8d cjk-break2 0f3ed22
buffer 879 ms 893 ms

@lucaswoj
Copy link
Contributor

@1ec5 are you ready to bless this with a ✅?

@mourner
Copy link
Member

mourner commented Oct 25, 2016

We might expect some slowdown in Latin areas because we need to check each label for ideographic characters.

@lucaswoj The bench results look great now! O(number of labels) is very fast. If we want to improve the check further, maybe we could bail out early on latin characters (e.g. if (char <= ...) return false; on the first line).

// "𠀀" to "𬺯"
if (char === 0xD840 && nextChar >= 0xDC00) return true;
if (char >= 0xD841 && char <= 0xD872) return true;
if (char === 0xD873 && nextChar <= 0xDEAF) return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've realized that there are a few bugs in our handling of surrogate pairs. Because we don't actually have the ability to render surrogate pairs as glyphs, I can't make a test case. I'm thinking we should remove this range altogether.

@1ec5
Copy link
Contributor

1ec5 commented Oct 26, 2016

A C++ port is ready in mapbox/mapbox-gl-native#6828. In that PR, I took another look at the exact character ranges we treat as ideographic. I decided to add some additional Unicode code blocks from this Wikipedia table – just the BMP blocks, since we don’t support the SIP for glyphs yet (mapbox/DEPRECATED-mapbox-gl#29). Additionally, I expanded each range to cover the entire block, not just the currently assigned code points, to future-proof the code a bit. (Future versions of Unicode may assign more characters within these blocks and also add new blocks. However, new CJK blocks would likely fall within the SIP.)

Feel free to synchronize this PR with that one.

@lucaswoj
Copy link
Contributor

@1ec5 Done in b2cd6ca. I'm going to rebase this branch and then 🚢.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants