Wrong display of cyrillic symbols in UTF-8 file #19743

sIspravnikov · 2022-05-18T10:51:46Z

Description

I have a file in UTF-8 encoding with cyrillic comment.
When i open this file in gitea web view, display of cyrillic symbols seems broken.
But in gitea file editor, diff page and others, this symbols displays correct.

Gitea Version

1.16.7

Can you reproduce the bug on the Gitea demo site?

Yes

Log Gist

No response

Screenshots

View file:

Edit file:

Diff changes in file:

reproduced it also on demo site
https://try.gitea.io/sIspravnikov/test/src/branch/main/build.gradle.kts

Git Version

2.30.3

Operating System

ubuntu 20.04

How are you running Gitea?

official docker-container

Database

PostgreSQL

silverwind · 2022-05-19T16:27:16Z

File displays fine in raw view for me, so this is definitely a bug. Also it triggers the "hidden unicode" incorrectly. Maybe something for @zeripath to check out.

ashimokawa · 2022-05-20T07:23:21Z

This seems to be a bug of the code formatter.

Can reproduce on codeberg.org also:

https://codeberg.org/test/test/src/commit/d94df248a7937afc16bb6d3c98cf17a8b7862ff0/build.gradle.kts

but when I delete everything except the function with the cyrillic characters they suddenly look correct.

https://codeberg.org/test/test/src/branch/main/build.gradle.kts

sIspravnikov · 2022-05-20T12:42:15Z

In issue #14434 was an idea, about chardet buffer size 1024 bytes, so i made a branch on demosite, where i placed cyrillic comment in beginning of the file:
https://try.gitea.io/sIspravnikov/test/commit/0682cf1ba2862f1a39e734ecb6fa6a7bcdaa57fb
and view page now display everything correct:
https://try.gitea.io/sIspravnikov/test/src/branch/test/build.gradle.kts
And @ashimokawa have done simillar, when everything, except the function with the cyrillic characters, was deleted, cyrillic symbols now catched in first 1024 bytes buffer.

So, i think, @lunny idea about chardet looks correct.

zeripath · 2022-05-21T08:26:19Z

Hi I'm just looking at this.

This looks like a double encoding utf8 problem.

The problem is not the escapecontrolreader - there are test cases to ensure that it's doing what it should do on utf8 code.

The problem will be earlier that that.

zeripath · 2022-05-21T08:36:46Z

The issue is that the file is being detected as ISO-8859-1.

In fact if you check the debug logs you will see:

2022/05/21 09:36:27 ...s/charset/charset.go:190:DetectEncoding() [D] [6288a48b] Detected encoding: ISO-8859-1

zeripath · 2022-05-21T08:54:16Z

I bet the reason why the detection is failing is that the 2048th byte is within a utf8 character - and... therefore a slight change in the file would cause the correct rendering.

Is this is a somewhat carefully calculated failing example? That kind of information would have been helpful information to provide - because it would have helped us to immediately understand where the problem was.

wxiaoguang · 2022-05-21T11:01:33Z

This example is the 2048 problem.

func renderFile(ctx *context.Context, entry *git.TreeEntry, treeLink, rawLink string) {
...
	buf := make([]byte, 1024)
	n, _ := util.ReadAtMost(dataRc, buf)
	buf = buf[:n]
...
		rd := charset.ToUTF8WithFallbackReader(io.MultiReader(bytes.NewReader(buf), dataRc))

The ToUTF8WithFallbackReader only reads first 2048 to detect, which may break UTF-8 runes.

Then there is a weighted algorithm to decide (guess) which encoding should be used.

Usually there should be no problem. But, with this sample:

The top confidence is not UTF-8 here.

The confidence of the result seems related to some statistics-based models (just a guess, haven't look into it), so a strings.Repeat("a" ...) won't trigger the bug, the repeated "a"'s top confidence result is still UTF-8.

wxiaoguang · 2022-05-21T11:07:31Z

Here is a designed test case how to trigger the bug:

func TestDetectEncodingEx(t *testing.T) {
	for testLen := 0; testLen < 2048; testLen++ {
		pattern := "    test { () }\n"
		input := ""
		for len(input) < testLen {
			input += pattern
		}
		input = input[:testLen]
		input += "// Выключаем"
		rd := ToUTF8WithFallbackReader(bytes.NewReader([]byte(input)))
		r, _ := io.ReadAll(rd)
		assert.EqualValuesf(t, input, string(r), "testing string len=%d", testLen)
	}
}

It will always fail:

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -127,2 +127,2 @@
        	            	     test { () }
        	            	-    te// Выключаем
        	            	+    te// Ð’Ñ‹ÐºÐ»ÑŽÑ‡Ð°ÐµÐ¼
        	Test:       	TestDetectEncodingEx
        	Messages:   	testing string len=2038

…esenting utf-8 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>

…esenting utf-8 (#19773) Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix #19743 Signed-off-by: Andrew Thornton <art27@cantab.net>

…esenting utf-8 (go-gitea#19773) Backport go-gitea#19773 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>

…esenting utf-8 (#19773) (#19774) Backport #19773 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix #19743 Signed-off-by: Andrew Thornton <art27@cantab.net>

…esenting utf-8 (go-gitea#19773) Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>

sIspravnikov added the type/bug label May 18, 2022

This comment was marked as off-topic.

Sign in to view

zeripath mentioned this issue May 21, 2022

Detect truncated utf-8 characters at the end of content as still representing utf-8 #19773

Merged

zeripath closed this as completed in #19773 May 21, 2022

zeripath mentioned this issue May 21, 2022

Detect truncated utf-8 characters at the end of content as still representing utf-8 (#19773) #19774

Merged

wxiaoguang mentioned this issue Sep 24, 2022

file encoding detection bug #14434

Closed

1 task

go-gitea locked and limited conversation to collaborators May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong display of cyrillic symbols in UTF-8 file #19743

Wrong display of cyrillic symbols in UTF-8 file #19743

sIspravnikov commented May 18, 2022 •

edited

Loading

silverwind commented May 19, 2022

ashimokawa commented May 20, 2022

sIspravnikov commented May 20, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

zeripath commented May 21, 2022

zeripath commented May 21, 2022

zeripath commented May 21, 2022

wxiaoguang commented May 21, 2022 •

edited

Loading

wxiaoguang commented May 21, 2022

Wrong display of cyrillic symbols in UTF-8 file #19743

Wrong display of cyrillic symbols in UTF-8 file #19743

Comments

sIspravnikov commented May 18, 2022 • edited Loading

Description

Gitea Version

Can you reproduce the bug on the Gitea demo site?

Log Gist

Screenshots

Git Version

Operating System

How are you running Gitea?

Database

silverwind commented May 19, 2022

ashimokawa commented May 20, 2022

sIspravnikov commented May 20, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

zeripath commented May 21, 2022

zeripath commented May 21, 2022

zeripath commented May 21, 2022

wxiaoguang commented May 21, 2022 • edited Loading

wxiaoguang commented May 21, 2022

sIspravnikov commented May 18, 2022 •

edited

Loading

wxiaoguang commented May 21, 2022 •

edited

Loading