Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

charset: incorrect encoding for latin1 character set #18955

Closed
bb7133 opened this issue Aug 3, 2020 · 4 comments
Closed

charset: incorrect encoding for latin1 character set #18955

bb7133 opened this issue Aug 3, 2020 · 4 comments
Labels
component/charset severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug. type/compatibility wontfix This issue will not be fixed.

Comments

@bb7133
Copy link
Member

bb7133 commented Aug 3, 2020

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

tidb> create table t(a varchar(10));
Query OK, 0 rows affected (0.16 sec)

case 1:

tidb> insert into t values ('¥');
Query OK, 1 row affected (0.04 sec)
tidb> select hex(a) from t;

case 2:

tidb> insert into t values ('中');

2. What did you expect to see? (Required)

tidb> select hex(a) from t;
+--------+
| hex(a) |
+--------+
| A5   |
+--------+
1 row in set (0.00 sec)

case 2:

mysql> insert into t values ('中');
ERROR 1366 (HY000): Incorrect string value: '\xE4\xB8\xAD' for column 'a' at row 1

3. What did you see instead (Required)

case 1:

tidb> select hex(a) from t;
+--------+
| hex(a) |
+--------+
| C2A5   |
+--------+
1 row in set (0.00 sec)

The encoding of ¥ in latin should be A5.

case 2:

tidb> insert into t values ('中');
Query OK, 1 row affected (0.01 sec)

4. Affected version (Required)

All versions of TiDB

5. Root Cause Analysis

In TiDB, we treat latin1 as a subset of utf8/utf8mb4 and encoded the characters as UTF8, just like what we did for ascii.

But, it is NOT: latin1 is a single-byte encoding character set:

  1. It supports 255 characters only
  2. for characters with codepoints in 128-255, the encoding is different with UTF8.

More details can be found here: https://en.wikipedia.org/wiki/ISO/IEC_8859-1

@bb7133
Copy link
Member Author

bb7133 commented Aug 6, 2020

I think we've 3 options for this issue:

  1. Keep latin1 in TiDB as what it is. If it has been worked fine for most scenarios, it may work well in the future.

  2. Add a configuration like 'treat-old-version-latin1-as-utf8mb4', set its default to true for legacy clusters, and to false for new clusters. As is suggested by @nullnotnil , when it is false, we treat latin1 as ascii and report to the users that "To store latin1 characters that are not in ASCII, it is recommended to use the utf8 character set instead". This partially fix the issue with little cost.

  3. Add a configuration like 'treat-old-version-latin1-as-utf8mb4', this configuration is basically the same as option 2, but when it is false, we encode latin1 characters with the standard('correct') encoding. This solves the latin1 encoding issue totally but many works may be needed to add a new encoding in TiDB.

@bb7133 bb7133 changed the title charset: incorrect encoding for latin character set charset: incorrect encoding for latin1 character set Aug 6, 2020
@jebter jebter added the sig/sql-infra SIG: SQL Infra label Nov 16, 2020
@bb7133
Copy link
Member Author

bb7133 commented Nov 16, 2020

We decide to keep option 1 and close this issue for now(since it no serious problem is reported yet). Close this issue for now.

A warning is added to the official website for it: https://docs.pingcap.com/tidb/stable/character-set-and-collation#character-sets-and-collations-supported-by-tidb

@bb7133 bb7133 closed this as completed Nov 16, 2020
@ti-srebot
Copy link
Contributor

ti-srebot commented Nov 16, 2020

Please edit this comment or add a new comment to complete the following information

Bug

Note: Make Sure that 'component', and 'severity' labels are added
Example for how to fill out the template: #20100

1. Root Cause Analysis (RCA) (optional)

As is stated in the document: https://docs.pingcap.com/tidb/stable/character-set-and-collation#character-sets-and-collations-supported-by-tidb

4. Workaround (optional)

Use utf8/utf8mb4 instead.

5. Affected versions

All existing versions.

6. Fixed versions

NA(we'll not fix it for now).

@dveeden
Copy link
Contributor

dveeden commented Sep 12, 2022

Please note that latin1 in a MySQL context is not ISO-8859-1 (which is commonly known as "Latin 1"), it is CP1252 (a.k.a. Windows-1252) instead. This is a small difference, but it can be important in many cases. See https://dev.mysql.com/doc/refman/8.0/en/charset-we-sets.html for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/charset severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug. type/compatibility wontfix This issue will not be fixed.
Projects
None yet
Development

No branches or pull requests

6 participants