-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong display of cyrillic symbols in UTF-8 file #19743
Comments
File displays fine in raw view for me, so this is definitely a bug. Also it triggers the "hidden unicode" incorrectly. Maybe something for @zeripath to check out. |
This seems to be a bug of the code formatter. Can reproduce on codeberg.org also: https://codeberg.org/test/test/src/commit/d94df248a7937afc16bb6d3c98cf17a8b7862ff0/build.gradle.kts but when I delete everything except the function with the cyrillic characters they suddenly look correct. https://codeberg.org/test/test/src/branch/main/build.gradle.kts |
In issue #14434 was an idea, about chardet buffer size 1024 bytes, so i made a branch on demosite, where i placed cyrillic comment in beginning of the file: So, i think, @lunny idea about chardet looks correct. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Hi I'm just looking at this. This looks like a double encoding utf8 problem. The problem is not the escapecontrolreader - there are test cases to ensure that it's doing what it should do on utf8 code. The problem will be earlier that that. |
The issue is that the file is being detected as ISO-8859-1. In fact if you check the debug logs you will see:
|
I bet the reason why the detection is failing is that the 2048th byte is within a utf8 character - and... therefore a slight change in the file would cause the correct rendering. Is this is a somewhat carefully calculated failing example? That kind of information would have been helpful information to provide - because it would have helped us to immediately understand where the problem was. |
This example is the 2048 problem.
The Then there is a weighted algorithm to decide (guess) which encoding should be used. Usually there should be no problem. But, with this sample: The top confidence is not UTF-8 here. The |
Here is a designed test case how to trigger the bug:
It will always fail:
|
…esenting utf-8 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>
…esenting utf-8 (#19773) Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix #19743 Signed-off-by: Andrew Thornton <art27@cantab.net>
…esenting utf-8 (go-gitea#19773) Backport go-gitea#19773 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>
…esenting utf-8 (#19773) (#19774) Backport #19773 Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix #19743 Signed-off-by: Andrew Thornton <art27@cantab.net>
…esenting utf-8 (go-gitea#19773) Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x if there is a truncated character at the end of the partially read file. This PR changes the detection algorithm to truncated utf8 characters at the end of the buffer. Fix go-gitea#19743 Signed-off-by: Andrew Thornton <art27@cantab.net>
Description
I have a file in UTF-8 encoding with cyrillic comment.
When i open this file in gitea web view, display of cyrillic symbols seems broken.
But in gitea file editor, diff page and others, this symbols displays correct.
Gitea Version
1.16.7
Can you reproduce the bug on the Gitea demo site?
Yes
Log Gist
No response
Screenshots
View file:
Edit file:
Diff changes in file:
reproduced it also on demo site
https://try.gitea.io/sIspravnikov/test/src/branch/main/build.gradle.kts
Git Version
2.30.3
Operating System
ubuntu 20.04
How are you running Gitea?
official docker-container
Database
PostgreSQL
The text was updated successfully, but these errors were encountered: