Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

table: improve error message for incorrect utf8 value #25087

Merged
merged 1 commit into from
Jun 21, 2021

Conversation

dveeden
Copy link
Contributor

@dveeden dveeden commented Jun 3, 2021

What problem does this PR solve?

While working on #24991 I hit a
incorrect utf8 value error. As the HEX and string representation of
the string are concaternated it looked like a single value while it was
not. This more clearly splits and labels the two.

[2021/06/02 17:04:33.488 +02:00] [ERROR] [column.go:154] ["incorrect UTF-8 value"] [conn=0] [error="[table:1366]incorrect utf8 value 24412430303524215011f542e7a2aca97fd5ab71aae07dd06f6157662e6a522f72506e5862584e4f475057774f392e38506e73776d2e6d6f5450454a46686a6d334758686d31($A$005$!P\u0011\ufffdB碬\ufffdիq\ufffd\ufffd}\ufffdoaWf.jR/rPnXbXNOGPWwO9.8Pnswm.moTPEJFhjm3GXhm1) for column authentication_string"]
[2021/06/02 17:04:33.488 +02:00] [ERROR] [column.go:154] ["incorrect UTF-8 value"] [conn=0] [error="[table:1366]incorrect utf8 value hex=24412430303524215011f542e7a2aca97fd5ab71aae07dd06f6157662e6a522f72506e5862584e4f475057774f392e38506e73776d2e6d6f5450454a46686a6d334758686d31 string=$A$005$!P\u0011\ufffdB碬\ufffdիq\ufffd\ufffd}\ufffdoaWf.jR/rPnXbXNOGPWwO9.8Pnswm.moTPEJFhjm3GXhm1 for column authentication_string"]

Check List

Tests

  • Manual test (add detailed scripts or steps below)

Release note

  • No release note

Some thoughts/questions related to this:

  • The error has incorrect utf8 value event if the column is utf8mb4 intead of utf8. Should we use col.Charset here? Or change utf8 to UTF-8? I think the latter would be the best option.
  • Should the error message include the charset and collation?

@ti-chi-bot ti-chi-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 3, 2021
@dveeden
Copy link
Contributor Author

dveeden commented Jun 3, 2021

We could change this to something like:

diff --git a/table/column.go b/table/column.go
index 6bdcceaa1..583740000 100644
--- a/table/column.go
+++ b/table/column.go
@@ -139,19 +139,10 @@ func truncateTrailingSpaces(v *types.Datum) {
        v.SetString(str, v.Collation())
 }
 
-func handleWrongASCIIValue(ctx sessionctx.Context, col *model.ColumnInfo, casted *types.Datum, str string, i int) (types.Datum, error) {
+func handleWrongCharsetValue(ctx sessionctx.Context, col *model.ColumnInfo, casted *types.Datum, str string, i int) (types.Datum, error) {
        sc := ctx.GetSessionVars().StmtCtx
-       err := ErrTruncatedWrongValueForField.FastGen("incorrect ascii value hex=%x string=%s for column %s", casted.GetBytes(), str, col.Name)
-       logutil.BgLogger().Error("incorrect ASCII value", zap.Uint64("conn", ctx.GetSessionVars().ConnectionID), zap.Error(err))
-       truncateVal := types.NewStringDatum(str[:i])
-       err = sc.HandleTruncate(err)
-       return truncateVal, err
-}
-
-func handleWrongUtf8Value(ctx sessionctx.Context, col *model.ColumnInfo, casted *types.Datum, str string, i int) (types.Datum, error) {
-       sc := ctx.GetSessionVars().StmtCtx
-       err := ErrTruncatedWrongValueForField.FastGen("incorrect utf8 value hex=%x string=%s for column %s", casted.GetBytes(), str, col.Name)
-       logutil.BgLogger().Error("incorrect UTF-8 value", zap.Uint64("conn", ctx.GetSessionVars().ConnectionID), zap.Error(err))
+       err := ErrTruncatedWrongValueForField.FastGen("incorrect %s/%s value hex=%x string=%s for column %s", col.Charset, col.Collate, casted.GetBytes(), str, col.Name)
+       logutil.BgLogger().Error("incorrect value for charset", zap.Uint64("conn", ctx.GetSessionVars().ConnectionID), zap.Error(err))
        // Truncate to valid utf8 string.
        truncateVal := types.NewStringDatum(str[:i])
        err = sc.HandleTruncate(err)
@@ -280,7 +271,7 @@ func CastValue(ctx sessionctx.Context, val types.Datum, col *model.ColumnInfo, r
                str := casted.GetString()
                for i := 0; i < len(str); i++ {
                        if str[i] > unicode.MaxASCII {
-                               casted, err = handleWrongASCIIValue(ctx, col, &casted, str, i)
+                               casted, err = handleWrongCharsetValue(ctx, col, &casted, str, i)
                                break
                        }
                }
@@ -307,11 +298,11 @@ func CastValue(ctx sessionctx.Context, val types.Datum, col *model.ColumnInfo, r
                                w = width
                                continue
                        }
-                       casted, err = handleWrongUtf8Value(ctx, col, &casted, str, i)
+                       casted, err = handleWrongCharsetValue(ctx, col, &casted, str, i)
                        break
                } else if width > 3 && doMB4CharCheck {
                        // Handle non-BMP characters.
-                       casted, err = handleWrongUtf8Value(ctx, col, &casted, str, i)
+                       casted, err = handleWrongCharsetValue(ctx, col, &casted, str, i)
                        break
                }
                w = width

But maybe the simple change as proposed in this PR is sufficient.

@dveeden
Copy link
Contributor Author

dveeden commented Jun 3, 2021

What MySQL does:

mysql 8.0.22 > create table t1(id int auto_increment primary key, c1 varchar(255) character set ascii);
Query OK, 0 rows affected (0.17 sec)

mysql 8.0.22 > insert into t1(c1) values (0x24412430303524031A69251C34295C4B35167C7F1E5A7B63091349503974624D34504B5A424679354856336868686F52485A736E4A733368786E427575516C73446469496537);
Query OK, 1 row affected (0.02 sec)

mysql 8.0.22 > select * from t1;
+----+------------------------------------------------------------------------+
| id | c1                                                                     |
+----+------------------------------------------------------------------------+
|  1 | $A$005$�i%4)\K5|Z{c	IP9tbM4PKZBFy5HV3hhhoRHZsnJs3hxnBuuQlsDdiIe7 |
+----+------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql 8.0.22 > show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) CHARACTER SET ascii COLLATE ascii_general_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.01 sec)

mysql 8.0.22 > insert into t1(c1) values ('$A$005$�i%4)\K5|Z{cIP9tbM4PKZBFy5HV3hhhoRHZsnJs3hxnBuuQlsDdiIe7');
ERROR 1366 (HY000): Incorrect string value: '\xEF\xBF\xBDi%4...' for column 'c1' at row 1

What TiDB does (with my patch):

5.7.25-TiDB-v5.1.0-alpha-68-g5dfb4888d-dirty 127.0.0.1:4000  test  SQL  create table t1(id int auto_increment primary key, c1 varchar(255) character set ascii);
Query OK, 0 rows affected (0.0116 sec)
5.7.25-TiDB-v5.1.0-alpha-68-g5dfb4888d-dirty 127.0.0.1:4000  test  SQL  insert into t1(c1) values (0x24412430303524031A69251C34295C4B35167C7F1E5A7B63091349503974624D34504B5A424679354856336868686F52485A736E4A733368786E427575516C73446469496537);
Query OK, 1 row affected (0.0060 sec)
5.7.25-TiDB-v5.1.0-alpha-68-g5dfb4888d-dirty 127.0.0.1:4000  test  SQL  select * from t1;
+----+------------------------------------------------------------------------+
| id | c1                                                                     |
+----+------------------------------------------------------------------------+
|  1 | $A$005$�i%4)\K5|Z{c	IP9tbM4PKZBFy5HV3hhhoRHZsnJs3hxnBuuQlsDdiIe7 |
+----+------------------------------------------------------------------------+
1 row in set (0.0007 sec)
5.7.25-TiDB-v5.1.0-alpha-68-g5dfb4888d-dirty 127.0.0.1:4000  test  SQL  show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) CHARACTER SET ascii COLLATE ascii_bin DEFAULT NULL,
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=39122
1 row in set (0.0012 sec)
5.7.25-TiDB-v5.1.0-alpha-68-g5dfb4888d-dirty 127.0.0.1:4000  test  SQL  insert into t1(c1) values ('$A$005$�i%4)\K5|Z{cIP9tbM4PKZBFy5HV3hhhoRHZsnJs3hxnBuuQlsDdiIe7');
ERROR: 1366 (HY000): incorrect ascii/ascii_bin value hex=24412430303524efbfbd692534294b357c5a7b6349503974624d34504b5a424679354856336868686f52485a736e4a733368786e427575516c73446469496537 string=$A$005$�i%4)K5|Z{cIP9tbM4PKZBFy5HV3hhhoRHZsnJs3hxnBuuQlsDdiIe7 for column c1

I don't think we should try to mimic the exact error that MySQL is giving as our message seems to be more informative.

@dveeden
Copy link
Contributor Author

dveeden commented Jun 3, 2021

  • We could also put the i in the message
  • MySQL does report the row number, that might be useful.

@dveeden
Copy link
Contributor Author

dveeden commented Jun 3, 2021

With the i:

5.7.25-TiDB-v5.1.0-alpha-68-g5dfb4888d-dirty 127.0.0.1:4000  test  SQL  insert into t1(c1) values ('$A$005$�i%4)\K5|Z{cIP9tbM4PKZBFy5HV3hhhoRHZsnJs3hxnBuuQlsDdiIe7');
ERROR: 1366 (HY000): incorrect ascii/ascii_bin value hex=24412430303524efbfbd692534294b357c5a7b6349503974624d34504b5a424679354856336868686f52485a736e4a733368786e427575516c73446469496537 string=$A$005$�i%4)K5|Z{cIP9tbM4PKZBFy5HV3hhhoRHZsnJs3hxnBuuQlsDdiIe7 at character 7 for column c1

@wjhuang2016
Copy link
Member

  • We could also put the i in the message
  • MySQL does report the row number, that might be useful.

Would you like to do it in this PR, or in the next PR?

@dveeden
Copy link
Contributor Author

dveeden commented Jun 3, 2021

  • We could also put the i in the message
  • MySQL does report the row number, that might be useful.

Would you like to do it in this PR, or in the next PR?

I could include it in this PR if others think this is useful. If that isn't the case we can just merge this PR as-is.

@dveeden
Copy link
Contributor Author

dveeden commented Jun 3, 2021

@morgo what do you think about this? minimal change as proposed, or should we add more details? Or should this be identical to MySQL?

@morgo
Copy link
Contributor

morgo commented Jun 3, 2021

@morgo what do you think about this? minimal change as proposed, or should we add more details? Or should this be identical to MySQL?

I do prefer identical to MySQL, but it is not a strong preference.

@bb7133
Copy link
Member

bb7133 commented Jun 3, 2021

We could also put the i in the message
MySQL does report the row number, that might be useful.

I prefer the MySQL message because the row number is reported(and useful), however, it is not easy to archive since the row number is lost in the call stack. So I think we can keep the current patch and put the i.

(#25087 (comment)) We could change this to something like:

IMO it's a good idea!

The error has incorrect utf8 value event if the column is utf8mb4 intead of utf8. Should we use col.Charset here? Or change utf8 to UTF-8? I think the latter would be the best option.
Should the error message include the charset and collation?

I think we can include the charset but not collation(not related to the charset encoding). While if the explicit charset name is given, 'changing utf8 to UTF-8' is not needed.

@ti-chi-bot ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 8, 2021
@dveeden
Copy link
Contributor Author

dveeden commented Jun 8, 2021

I've tried to make this more similar to what MySQL does, but without the rownumber.

@dveeden
Copy link
Contributor Author

dveeden commented Jun 8, 2021

[dvaneeden@dve-carbon tidb]$ mysqlsh --quiet-start=2 mysql://root@127.0.0.1:4000/test
5.7.25-TiDB-v5.2.0-alpha-8-g8f3f2158f-dirty 127.0.0.1:4000  test  SQL insert into t1(c1) values ('$A$005$�i%4)\K5|Z{cIP9tbM4PKZBFy5HV3hhhoRHZsnJs3hxnBuuQlsDdiIe7');
ERROR: 1366 (HY000): Incorrect string value '\xEF\xBF\xBDi%4...' for column 'c1'

@morgo
Copy link
Contributor

morgo commented Jun 9, 2021

@dveeden It looks good to me, but can you write an integration test (expression/integration_test.go) confirming that this improved error message is returned?

@dveeden dveeden requested a review from a team as a code owner June 9, 2021 06:45
@dveeden dveeden requested review from wshwsh12 and removed request for a team June 9, 2021 06:45
@dveeden
Copy link
Contributor Author

dveeden commented Jun 9, 2021

@dveeden It looks good to me, but can you write an integration test (expression/integration_test.go) confirming that this improved error message is returned?

done

@dveeden
Copy link
Contributor Author

dveeden commented Jun 10, 2021

/cc @morgo @bb7133

@ti-chi-bot ti-chi-bot requested review from bb7133 and morgo June 10, 2021 09:46
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 11, 2021
@dveeden
Copy link
Contributor Author

dveeden commented Jun 14, 2021

@bb7133 @wshwsh12 PTAL

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • morgo
  • wjhuang2016

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 15, 2021
table/column.go Outdated Show resolved Hide resolved
table/column.go Outdated Show resolved Hide resolved
@dveeden dveeden force-pushed the utf8_column_invalid branch 2 times, most recently from 17496b3 to 428808b Compare June 18, 2021 07:46
@dveeden dveeden requested a review from wshwsh12 June 18, 2021 08:19
Copy link
Contributor

@wshwsh12 wshwsh12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wshwsh12
Copy link
Contributor

/merge

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 18, 2021
@dveeden
Copy link
Contributor Author

dveeden commented Jun 18, 2021

/run-check_dev_2
/run-unit-test

@ti-chi-bot
Copy link
Member

@dveeden: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@wshwsh12
Copy link
Contributor

/merge cancel

@ti-chi-bot ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Jun 20, 2021
@wshwsh12
Copy link
Contributor

wshwsh12 commented Jun 20, 2021

It seems the test testIntegrationSuite.TestCharsetErr failed, please fix it.

[2021-06-19T14:01:28.127Z] FAIL: integration_test.go:9744: testIntegrationSuite.TestCharsetErr
[2021-06-19T14:01:28.127Z] 
[2021-06-19T14:01:28.127Z] integration_test.go:9749:
[2021-06-19T14:01:28.127Z]     c.Assert(err.Error(), Equals, "[table:1366]Incorrect string value '\\xEF\\xBF\\xBDbbb...' for column 'c1'")
[2021-06-19T14:01:28.127Z] ... obtained string = "[table:1366]Incorrect string value '\\xEF\\xBF\\xBDbbb' for column 'c1'"
[2021-06-19T14:01:28.127Z] ... expected string = "[table:1366]Incorrect string value '\\xEF\\xBF\\xBDbbb...' for column 'c1'"

This now more closely matches the output from MySQL 8.0
@wshwsh12
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 321842f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/expression size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants