-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ideographic text breaking #3420
Conversation
@@ -123,7 +150,14 @@ function linewrap(shaping, glyphs, lineHeight, maxWidth, horizontalAlign, vertic | |||
line++; | |||
} | |||
|
|||
if (breakable[positionedGlyph.codePoint]) { | |||
if (positionedGlyph.codePoint > 19968) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number alert. We should add a comment about this and/or make it a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when it comes to Unicode codepoints, always use hexadecimal literals: 0x4e00
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon we will replace this with a modified version of the regex in #3438
lastSafeBreak = i - 1; | ||
} | ||
if (!(breakable[positionedGlyph.codePoint])) { | ||
lastSafeBreak = Math.round(wordLength / 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too. Why is it divided by 3?
@@ -37,7 +37,6 @@ | |||
hash: true | |||
}); | |||
|
|||
map.addControl(new mapboxgl.Navigation()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not intentional?
0x2f: true, // solidus | ||
0xad: true, // soft hyphen | ||
0xb7: true, // middle dot | ||
0x0020: true, // space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more performant to use a regular expression instead of iterating over this set? See #3438 (comment) for ideas on keeping the regular expression maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately we never need to iterate over this object. We just do lookups into it. That should be the fastest solution 👍
See https://github.com/mapbox/mapbox-gl-js/pull/3438/files#diff-0612e4d64682542fe1be64751bc484a3 for a work-in-progress regular expression that covers all the traditionally top-to-bottom scripts. This regular expression largely overlaps with the scripts that also need “balanced” ideographic breaking, but Hangul ( |
const lastPositionedGlyph = positionedGlyphs[positionedGlyphs.length - 1]; | ||
const estimatedLineCount = Math.max(1, Math.ceil(lastPositionedGlyph.x / maxWidth)); | ||
maxWidth = lastPositionedGlyph.x / estimatedLineCount; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is a hack or a
0xff5b: true, // fullwidth left curly bracket | ||
0xff5e: true, // fullwidth tilde | ||
0xffe1: true, // fullwidth pound sign | ||
0xffe5: true // fullwidth yen sign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lookup table isn't strictly necessary anymore, though it seems like an improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes to this table should be reverted. Now it’s possible for Latin text to break on either side of a dollar sign, for instance:
Ye Old $
99 Store
Furthermore, in ideographic scripts, characters like U+3016 left white lenticular bracket (〖) visually combine a space and a punctuation mark, so it can only break on one side (in this case, on the left but not the right).
As things stand in this PR, the CJK punctuation in this table has no effect on ideographic breaking, since breaking can occur on either side of any character. Indeed, we will eventually want to avoid orphaning CJK punctuation, but I don’t think this PR accomplishes that yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@1ec5, 👌 makes sense, will revert the additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted in 877c895 (I kept a couple new characters 😈 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for breaking at quotation marks or apostrophes? This would result in labels like "Linens 'n Things" turning into:
Linens '
n Things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is my mistake. Reverting to the original set now.
@@ -0,0 +1,18 @@ | |||
'use strict'; | |||
|
|||
const ideographicBreakingRegExp = new RegExp([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If #3438 lands first, verticalRegExp
and ideographicBreakingRegExp
will have some overlap. To wit:
- Hangul and Mongolian are vertical but don’t get ideographic breaking.
- Halfwidth forms and Yi are horizontal but do get ideographic breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to break this regex up into a set of shared strings? For example:
const hangulRegExpString = '...';
const halfwidthFormsRegExpString = '...';
...
const ideographicBreakingRegExp = new RegExp([
hangulRegExpString,
halfwidthFormsRegExpString,
...
].join('|'));
I'd be happy to do so. Would you mind pointing out which character ranges belong to which sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to merge this PR first. Let's think about the merge between the two regexps in #3438
0xff5b: true, // fullwidth left curly bracket | ||
0xff5e: true, // fullwidth tilde | ||
0xffe1: true, // fullwidth pound sign | ||
0xffe5: true // fullwidth yen sign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes to this table should be reverted. Now it’s possible for Latin text to break on either side of a dollar sign, for instance:
Ye Old $
99 Store
Furthermore, in ideographic scripts, characters like U+3016 left white lenticular bracket (〖) visually combine a space and a punctuation mark, so it can only break on one side (in this case, on the left but not the right).
As things stand in this PR, the CJK punctuation in this table has no effect on ideographic breaking, since breaking can occur on either side of any character. Indeed, we will eventually want to avoid orphaning CJK punctuation, but I don’t think this PR accomplishes that yet.
@1ec5 I think I've addressed all your feedback and my todos. Anything else you'd like to see before this 🚢s? @jfirebaugh @mourner @ansis Would you like to 👀 this PR? |
@@ -56,7 +56,7 @@ | |||
"highlight.js": "9.3.0", | |||
"jsdom": "^9.4.2", | |||
"lodash.template": "^4.4.0", | |||
"mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#28c76c64e8cfcee8764c6c0f6d4fcc2d15a8d1e1", | |||
"mapbox-gl-test-suite": "mapbox/mapbox-gl-test-suite#77b281c9e6225471505f7daefa76806a8fbf22e2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a rebase and merge of the mapbox-gl-test-suite branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ I'll do that with the "squash & merge" button on mapbox/mapbox-gl-test-suite#153 once this PR gets the green light.
|
How does this handle cases where the text contains both ideographic glyphs and non-ideographic ones? |
Yes, and that goes for #3438 as well. That’s why I originally proposed a negation in #3402 (comment) that would only affect labels that consisted entirely of ideographic characters. |
].join('|')})+$`); | ||
|
||
module.exports.allowsIdeographicBreaking = function(input) { | ||
return input.search(ideographicBreakingRegExp) !== -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that a match for ideographicBreakingRegExp
has ^
and $
anchors, spanning the entire input string, this should be a call to match()
. But I wonder if it’s even necessary to match the entire string. Searching for [^一-鿌…]
instead would be more performant for strings that contain no CJK. The tricky part is the surrogate pairs, but it should definitely be possible to negate that part of the regular expression too.
Concretely, the problem is bilingual labels or those that include romanizations, like this POI, named
On the other hand, there are legitimate reasons for a CJK label to contain non-CJK characters, like (hypothetically) a subway station named What this PR calls “ideographic breaking” is a combination of two features: breaking on any character, and balancing lines when breaking. The two naturally go hand in hand for purely CJK labels, but bilingual labels force us to distinguish between them. If we’re conservative and err on the side of applying word breaking throughout, then the worst that could happen is a label that collides other labels out. If we’re too aggressive and err on the side of applying CJK breaking throughout, then we risk breaking in the middle of non-CJK words, which looks amateurish. Ideally, in the presence of non-CJK characters, we’d continue to break on any CJK character but go back to word breaking for non-CJK words. @lucaswoj has pushed an implementation that seems to do this quite well. |
Is this due to flaky benchmark, or a real regression? I wouldn't expect a perf regression on an latin-character area which we use in the bench. We can still merge but should follow-up if it's a real regression. |
We might expect some slowdown in Latin areas because we need to check each label for ideographic characters. In the case of Latin-only labels, e1eee1b improves this from a Here are a couple of benchmark runs. We sorely need #3237.
|
@1ec5 are you ready to bless this with a ✅? |
@lucaswoj The bench results look great now! O(number of labels) is very fast. If we want to improve the check further, maybe we could bail out early on latin characters (e.g. |
// "𠀀" to "" | ||
if (char === 0xD840 && nextChar >= 0xDC00) return true; | ||
if (char >= 0xD841 && char <= 0xD872) return true; | ||
if (char === 0xD873 && nextChar <= 0xDEAF) return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've realized that there are a few bugs in our handling of surrogate pairs. Because we don't actually have the ability to render surrogate pairs as glyphs, I can't make a test case. I'm thinking we should remove this range altogether.
A C++ port is ready in mapbox/mapbox-gl-native#6828. In that PR, I took another look at the exact character ranges we treat as ideographic. I decided to add some additional Unicode code blocks from this Wikipedia table – just the BMP blocks, since we don’t support the SIP for glyphs yet (mapbox/DEPRECATED-mapbox-gl#29). Additionally, I expanded each range to cover the entire block, not just the currently assigned code points, to future-proof the code a bit. (Future versions of Unicode may assign more characters within these blocks and also add new blocks. However, new CJK blocks would likely fall within the SIP.) Feel free to synchronize this PR with that one. |
Launch Checklist
Splitting out from #3402 for clearer tracking of two separate issues.
This PR adds support for:
Requirements
Specifications
Launch Checklist