Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.x] Fix Chinese&Japanese erroneous newline #45290

Closed
wants to merge 8 commits into from

Conversation

erbing315
Copy link

Problem

Chinese&Japanese erroneous newline.For examples:
parse_bbcode("向最坏处着想,向最好处努力")
image
It's correct newline.But,when I add bbcode tag:
parse_bbcode("向[b]最坏处着想,向最好处努力")
image
It's wrong newline.Around at bbcode tag instead of correct place.
parse_bbcode("向[b]最坏[/b]处着想,向最好处努力")
image

Reason

Chinese&Japanese do not use space or any character for separating words.
So,a Chinese&Japanese sentence will be considered as a word.
BBcode tag will separate words,rich text lable try to put a word in one line,
then,it will newline in erroneous place.

Solve

For Chinese&Japanese,unicode greater than 0x3040 and less than 0xfaff,add them to the tag stack one by one.

image

My code causes other language and punctuation "sticking" to previous Chinese&Japanese char.
——For punctuation,it conforms to the Chinese grammar standard.
——For other language,it can be solved by adding a space between other language and Chinese&Japanese.

@Calinou Calinou added this to the 3.2 milestone Jan 18, 2021
@akien-mga akien-mga requested review from bruvzg and a team January 18, 2021 17:15
//append item condition
int lipos = 0;
while (lipos < line.length()) {
if (line[lipos] >= 0x3040 && line[lipos] < 0xfaff) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This range includes multiple non CJK blocks. Probably should be limited to 3400 — 4DBF, 4E00 — 9FFF, F900 — FAFF and 20000 — 2A6DF, 2F800 — 2FA1F (last two won't work on Windows).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reference, master branch use ICU break iterator with the following rules set:line_normal_cj.txt and 4MB dictionary: cjdict.txt.

If I understand correctly, this approach should work for pure ideographs, but not for mixed syllabary + ideographs (Okurigana). But since ICU based breaking won't be backported to 3.2, it's probably better than nothing.

Also, I'm not sure if it's good for performance to add a new ItemText for each word, it might be better to do it in the _process_line instead.

@fire
Copy link
Member

fire commented Jan 18, 2021

The existence of https://w3c.github.io/i18n-tests/results/line-breaks-jazh means we don't have to redo the work. Since the requirements work has been done, we should at least see if it's possible to do a better job on it.

@TokageItLab
Copy link
Member

TokageItLab commented Jan 18, 2021

The problem with this is that the behavior of line breaks is different between strings without tags and with tags, which seems to be a bit of a hacky implementation.

Also, you have to consider that Japanese and Chinese sometimes have English mixed in with the text. I tried your PR here and got the following.
スクリーンショット 2021-01-19 3 15 36
Usually, put English words with 1-byte spaces between them, though... but it depends on the person.
(And this one is broken even without the BBTag)

I think there is a fundamental problem with line breaks in Godot, so I'll see if I can find a solution here as well.

@TokageItLab
Copy link
Member

Probably the cause of the line break problem is that the character data and tag data are in the same array. First of all, Godot doesn't support line breaks in Japanese or Chinese at all, and usually tries to write everything out on a single line. If there is any space or non-character data, the character considers it as a break in RichTextLabel::_process_line() in line 402.

while (c[end] != 0 && !(end && c[end - 1] == ' ' && c[end] != ' ')) {

	int cw = font->get_char_size(c[end], c[end + 1]).width;
	if (c[end] == '\t') {
		cw = tab_size * font->get_char_size(' ').width;
	}

	if (end > 0 && w + cw + begin > p_width) {
		break; //don't allow lines longer than assigned width
	}

	w += cw;
	fw += cw;

	end++;
}
CHECK_HEIGHT(fh);
ENSURE_WIDTH(w);

When I rewrite the condition to test it, Japanese and Chinese lines are now broken correctly, but English lines are broken incorrectly instead.

2021-01-19_6 36 18

I recommend that you implement the correct line break definition here.

@TokageItLab
Copy link
Member

@erbing315 Rather than splitting all the words and increasing the number of items, it would be better to change the determination of line breaks according to the character encoding of the chars: *c in _process_line(). Good luck.

Copy link
Author

@erbing315 erbing315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erbing315 Rather than splitting all the words and increasing the number of items, it would be better to change the determination of line breaks according to the character encoding of the chars: *c in _process_line(). Good luck.

ありがとう、でも、ぼくresolved that problem,and works in my IDE.But,can't successful checks

@akien-mga
Copy link
Member

For the reference, note that this issue should already be fixed in the master branch by @bruvzg's work on Complex Text Layouts and especially the integration of ICU dictionary data to do proper word and line wrapping.

Any intermediate solution for 3.2 should thus try not to be too disruptive as the main fix for this issue has already happened for 4.0 and later.

@erbing315
Copy link
Author

@TokageItLab Does Japanese care about newline before small kana?Like this:
楽しい時間はあ
っという間に

アマチ
ュア

@TokageItLab
Copy link
Member

@erbing315 Yes, it's true that Japanese doesn't do line breaks like that. But the line break rule may be supposed to be solved in 4.0 Just like @bruvzg and @akien-mga said. I think the problem is whether or not the line is broken correctly when enclosed in tags.

For example,

[b]最高[/b]っぽい

Then,

最高
っぽい

It is a mistake to treat things enclosed in tags like this as words. As for this, it may be already fixed in #43691.
@bruvzg Excuse me, is it right?

@erbing315 erbing315 changed the title fix Chinese&Japanese erroneous newline [3.2]fix Chinese&Japanese erroneous newline Jan 23, 2021
@erbing315
Copy link
Author

@TokageItLab No......bbcode tags still break word in master,
can bbc[b]ode bre[/b]ak word?
image

I will try to fix it,but,it's difficult because it's decided by tag stack's data structure
Maybe I will fix by plugin......

@KoBeWi
Copy link
Member

KoBeWi commented Jan 23, 2021

Line breaking by tags is another issue (#41963) that should be solved in a new PR.

Base automatically changed from 3.2 to 3.x March 16, 2021 11:11
@akien-mga akien-mga modified the milestones: 3.2, 3.3 Mar 17, 2021
@akien-mga akien-mga modified the milestones: 3.3, 3.4 Mar 26, 2021
@akien-mga akien-mga changed the title [3.2]fix Chinese&Japanese erroneous newline [3.x] Fix Chinese&Japanese erroneous newline Mar 26, 2021
@akien-mga
Copy link
Member

Superseded by #49280. Thanks for the contribution anyway!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants