-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to fix unicode layout issue #3578
Try to fix unicode layout issue #3578
Conversation
I think we need to do something at cell-allocation level, when we write a string into the buffer, we need to :
And, when performing text layout, group adhensive clusters together and shape them together, then perform cluster-to-column alignment according to the cluster category. Category flags: enum ClusterCategory {
ScriptCategoryMask = 0xFF,
// Non-adhensive scripts
ScriptUnknown = 0x00,
ScriptIdeographs = 0x01,
ScriptEmoji = 0x02,
ScriptGeometric = 0x03,
ScriptNexusGeometric = 0x04, // Box drawing, Block Element, Powerline, etc.
// Add more items when necessary
Adhensive = 0x80,
// Adhensive scripts
ScriptWestern = 0x80,
// Add more items when necessary
// Alignment
AlignCenter = 0,
AlignScaleDown = 0x200,
AlignLeft = 0x400,
AlignRight = 0x600,
AlignmentMask = 0x600,
// Width
DoubleWidth = 0x400
} |
@reli-msft Yeah I think you are right. Without cluster category, some code just doesn't really make sense. For example, in order to handle zero-width VS16 the text cell iterator is moving like this: it += columnCount > 0 ? columnCount : 2; If To thoroughly handle all kinds of Unicode thing, cluster category classification is a must. |
@@ -75,7 +75,7 @@ namespace | |||
UnicodeRange{ 0x2d8, 0x2db, CodepointWidth::Ambiguous }, | |||
UnicodeRange{ 0x2dd, 0x2dd, CodepointWidth::Ambiguous }, | |||
UnicodeRange{ 0x2df, 0x2df, CodepointWidth::Ambiguous }, | |||
UnicodeRange{ 0x300, 0x36f, CodepointWidth::Ambiguous }, | |||
UnicodeRange{ 0x300, 0x36f, CodepointWidth::Combining }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely wrong since it's generated. Just showing the basic idea.
@skyline75489 I plan to have a meeting with Console people to find out a proper solution for this. |
@reli-msft You made my day, bro. Go get them. |
Well, now, hold on. This is a good sentiment and all that, but still need to figure out how to deal with applications that read back the contents of the screen adn ahev their understanding of their codepoint to cell mapping violated. This isn't just a problem somebody can come in and solve overnight: Terminal still needs to act compatibly for legacy win32 console applications. There's already #1472 tracking lighting up ZWJ/ZWNJ and any rendering and buffer concerns where Let it not be understated that we want to do the right thing here. It's just that we have to do the right thing in the context of 35 years worth of application support code and the fact that conhost (who also consumes this codepoint width detector, text buffer and text renderer) is the API server for all text-mode applications on every SKU of Windows everywhere. |
To wit: WT can break UI paradigm compatibility and some level of application compatibility, but the code it shares with conhost must uphold the same compatibility guarantees as conhost does. |
@DHowett-MSFT Easy. If I keep doing open source thing, the company I'm working at will soon be bankrupted and then I can work for MSFT 😎 . Everyone is happy. Glad to know there's #1472. Gonna subscribe it right now. |
LOL. 😁 |
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? --> ## Summary of the Pull Request This change tries to fix column size calculation when shaping return glyphs that represents multiple characters (e.g. ligature). <!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> ## References This should fix #696. <!-- Please review the items on the PR checklist before submitting--> ## PR Checklist * [ ] Closes #xxx * [X] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [ ] Tests added/passed * [ ] Requires documentation to be updated * [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx <!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here --> ## Detailed Description of the Pull Request / Additional comments Currently, it seems like CustomTextLayout::_CorrectGlyphRun generally assumes that glyphs and characters have a 1:1 mapping relationship - which holds true for most trivial scenarios with basic western scripts, and also many, but unfortunately not all, monospace "programming" fonts with programming ligatures. This change makes terminal correctly processes glyphs that represents multiple characters, by properly accumulating the column counts of all these characters together (which I believe is more close to what this code originally intended to do). There are still many issues existing in both CustomTextLayout as well as the TextBuffer, and the correct solution to them will likely demand large-scale changes, at least at the scale of #3578. I wish small changes like this can serve as a stop gap solution while we take our time to work on the long-term right thing. <!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well --> ## Validation Steps Performed Builds and runs. Manual testing confirmed that it solves #696 with both LigConsalata and Fixedsys Excelsior.
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? --> ## Summary of the Pull Request This change tries to fix column size calculation when shaping return glyphs that represents multiple characters (e.g. ligature). <!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> ## References This should fix #696. <!-- Please review the items on the PR checklist before submitting--> ## PR Checklist * [ ] Closes #xxx * [X] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [ ] Tests added/passed * [ ] Requires documentation to be updated * [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx <!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here --> ## Detailed Description of the Pull Request / Additional comments Currently, it seems like CustomTextLayout::_CorrectGlyphRun generally assumes that glyphs and characters have a 1:1 mapping relationship - which holds true for most trivial scenarios with basic western scripts, and also many, but unfortunately not all, monospace "programming" fonts with programming ligatures. This change makes terminal correctly processes glyphs that represents multiple characters, by properly accumulating the column counts of all these characters together (which I believe is more close to what this code originally intended to do). There are still many issues existing in both CustomTextLayout as well as the TextBuffer, and the correct solution to them will likely demand large-scale changes, at least at the scale of #3578. I wish small changes like this can serve as a stop gap solution while we take our time to work on the long-term right thing. <!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well --> ## Validation Steps Performed Builds and runs. Manual testing confirmed that it solves #696 with both LigConsalata and Fixedsys Excelsior. (cherry picked from commit 027f122)
This is a subset of #3578 which I think is harmless and the first step towards making things right. References #3546 #3578 ## Detailed Description of the Pull Request / Additional comments For more robust Unicode support, `CodepointWidthDetector` should provide concrete width information rather than a simple boolean of `IsWide`. Currently only `IsWide` is widely used and optimized using quick lookup table and fallback cache. This PR moves those optimization into `GetWidth`. ## Validation Steps Performed API remains unchanged. Things are not broken.
This is a subset of #3578 which I think is harmless and the first step towards making things right. References #3546 #3578 ## Detailed Description of the Pull Request / Additional comments For more robust Unicode support, `CodepointWidthDetector` should provide concrete width information rather than a simple boolean of `IsWide`. Currently only `IsWide` is widely used and optimized using quick lookup table and fallback cache. This PR moves those optimization into `GetWidth`. ## Validation Steps Performed API remains unchanged. Things are not broken.
Summary of the Pull Request
This could be the first of a series PRs intended to fix unicode layout issue.
References
#3546
PR Checklist
Detailed Description of the Pull Request / Additional comments
Contrary to what @reli-msft emphasize in #3546, I found that zero length Unicode codepoint handling is the key here.
I searched a little and found https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9 which refers https://github.com/ridiculousfish/widecharwidth. Then I was like, oh so there's bunch of unicode codepoint that's not included in WT currently.
Handling zero-width Unicode characters makes a lot of things weird. But here's what I got so far:
Would love to hear your guys' opinion on this.
Validation Steps Performed