Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/design: update collation compatibility issues in charsets doc #30806

Merged
merged 7 commits into from
Dec 19, 2021
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 9 additions & 32 deletions docs/design/2021-08-18-charsets.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,10 @@ After receiving the non-utf-8 character set request, this solution will convert
### Collation

Add gbk_chinese_ci and gbk_bin collations. In addition, considering the performance, we can add the collation of utf8mb4 (gbk_utf8mb4_bin).
- To support gbk_chinese_ci and gbk_bin collations, it needs to turn on the `new_collations_enabled_on_first_bootstrap` switch.
- If `new_collations_enabled_on_first_bootstrap` is off, it only supports gbk_utf8mb4_bin which does not need to be converted to gbk charset before processing.
- Implement the Collator and WildcardPattern interface functions for each collation.
- gbk_chinese_ci and gbk_bin need to convert utf-8 to gbk encoding and then generate a sort key. gbk_utf8mb4_bin does not need to be converted to gbk code for processing.
- gbk_chinese_ci and gbk_bin need to convert utf-8 to gbk encoding and then generate a sort key.
- Implement the corresponding functions in the Coprocessor.

### DDL
Expand All @@ -119,43 +121,18 @@ Other behaviors that need to be dealt with:
#### Compatibility between TiDB versions

- Upgrade compatibility:
- Upgrades from versions below 4.0 do not support gbk or any character sets other than the original five (binary, ascii, latin1, utf8, utf8mb4).
- Upgrade from version 4.0 or higher
- There may be compatibility issues when performing non-utf-8-related operations during the rolling upgrade.
- The new version of the cluster is expected to have no compatibility issues when reading old data.
- There may be compatibility issues when performing operations during the rolling upgrade.
zimulala marked this conversation as resolved.
Show resolved Hide resolved
- The new version of the cluster is expected to have no compatibility issues when reading old data.
- Downgrade compatibility:
- Downgrade is not compatible. The index key uses the table of gbk_bin/gbk_chinese_ci. The lower version of TiDB will have problems when decoding, and it needs to be transcoded before downgrading.

#### Compatibility with MySQL

Illegal character related issue:
- Illegal character related issue:
- Due to the internal conversion of non-utf-8-related encoding to utf8 for processing, it is not fully compatible with MySQL in some cases in terms of illegal character processing. TiDB controls its behavior through sql_mode.

```sql
create table t3(a char(10) charset gbk);
insert into t3 values ('a');

// 0xcee5 is a valid gbk hex literal but invalid utf8mb4 hex literal.
select hex(concat(a, 0xcee5)) from t3;
-- mysql 61cee5

// 0xe4b880 is an invalid gbk hex literal but valid utf8mb4 hex literal.
select hex(concat(a, 0xe4b880)) from t3;
-- mysql 61e4b880 (test on mysql 5.7 and 8.0.22)
-- mysql returns "Cannot convert string '\x80' from binary to gbk" (test on mysql 8.0.25 and 8.0.26). TiDB will be compatible with this behavior.

// 0x80 is a hex literal that invalid for neither gbk nor utf8mb4.
select hex(concat(a, 0x80)) from t3;
-- mysql 6180 (test on mysql 5.7 and 8.0.22)
-- mysql returns "Cannot convert string '\x80' from binary to gbk" (test on mysql 8.0.25 and 8.0.26). TiDB will be compatible with this behavior.

set @@sql_mode = '';
insert into t3 values (0x80);
-- mysql gets a warning and insert null values (warning: "Incorrect string value: '\x80' for column 'a' at row 1")

set @@sql_mode = 'STRICT_TRANS_TABLES';
insert into t3 values (0x80);
-- mysql returns "Incorrect string value: '\x80' for column 'a' at row 1"
```
- Collation
- Support gbk_bin/gbk_chinese_ci when opening configuration item `new_collations_enabled_on_first_bootstrap`. When this configuration is not enabled, only binary sorting after non-utf-8-related is converted to utf8 is supported.
zimulala marked this conversation as resolved.
Show resolved Hide resolved
zimulala marked this conversation as resolved.
Show resolved Hide resolved

#### Compatibility with other components

Expand Down