Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PS-9237 feature: Include support for utf8mb4_0900_ai_ci in MySQL 5.7 #5364

Draft
wants to merge 8 commits into
base: release-8.0.37-29
Choose a base branch
from

Conversation

percona-ysorokin
Copy link
Collaborator

No description provided.

…(connection)

https://perconadev.atlassian.net/browse/PS-9237

Extended the behavior of the 'default_collation_for_utf8mb4' system variable.
Now, when 'default_collation_for_utf8mb4' is set to 'utf8mb4_general_ci', its
value also affects the value of the 'character_set_client',
'collation_connection' and 'character_set_results' variables.

This helps in cases when a 8.0 mysql client (for instance, 'mysql' command
line utility or 'mysqldump' utility) that is used with the
'--default-character-set' command line option explicitly set to 'utf8mb4'
or not set at all is trying to connect to a 8.0 server with 'default_collation_for_utf8mb4' set to 'utf8mb4_general_ci'.
In both cases 'MYSQL_SET_CHARSET_NAME' API option will be set to 255
and sent to the server as part of the protocol,
and without the fix, the value of the 'character_set_client' variable will be
'utf8mb4_0900_ai_ci' (255).
This behavior is not desired as it has a lot of negative side effects, especially
for the binary logging intended to be compatible with 5.7 (where there is no
'utf8mb4_0900_ai_ci' collation).
…(set_var)

https://perconadev.atlassian.net/browse/PS-9237

Extended the behavior of the 'default_collation_for_utf8mb4' system variable.
Now, when user sets one of the charset-related variables (for instance,
'character_set_client') to 'utf8mb4', the value written to the binary log for the
subsequent QUERY event will have 'SET @@session.character_set_client=...'
set to a value based on the value of 'default_collation_for_utf8mb4' (45 for
'utf8mb4_general_ci') instead of pre-defined 255 ('utf8mb4_0900_ai_ci').

This helps in scenarios when an sql file generated by the 'mysqldump' utility
has explicit 'SET character_set_client = utf8mb4;' statements and is executed
on a server running with 'default_collation_for_utf8mb4' set to
'utf8mb4_general_ci'. With the fix this the default collation for the 'utf8mb4'
will be identified as 'utf8mb4_general_ci' (45) instead of
`utf8mb4_0900_ai_ci` (255) and will help the generated binary log to be
backward compatible with 5.7.
@percona-ysorokin
Copy link
Collaborator Author

This needs to be rebased on top of release-8.0.39-30

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

sql/sys_vars.cc Outdated
get_charset_by_csname("utf8mb4", MY_CS_PRIMARY, MYF(0));
if (thd->variables.default_collation_for_utf8mb4 !=
primary_utf8mb4_collation) {
if (var->save_result.ptr == primary_utf8mb4_collation) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-pro-type-union-access ⚠️
do not access members of unions; use (boost::)variant instead

if (thd->variables.default_collation_for_utf8mb4 !=
primary_utf8mb4_collation) {
if (var->save_result.ptr == primary_utf8mb4_collation) {
var->save_result.ptr = thd->variables.default_collation_for_utf8mb4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-pro-type-union-access ⚠️
do not access members of unions; use (boost::)variant instead

…(mtr)

https://perconadev.atlassian.net/browse/PS-9237

Added new 'main.percona_default_collation_for_utf8mb4_extensions' MTR test
case which checks the following:
* if setting 'default_collation_for_utf8mb4' affects collation variables for connections
   established from an external mysql client ('mysql' command line utility)
* if setting 'default_collation_for_utf8mb4' affects the behavior of
   'SET character_set_client = utf8mb4;'
* if collation 'utf8mb4_0900_ai_ci' (255) never appears in the binary log when
   'default_collation_for_utf8mb4' is set to 'utf8mb4_general_ci'
* if default 'mysqldump ... > mysql ...' backup-restore procedure does not lead to
   changing collations back to 'utf8mb4_0900_ai_ci' when
   'default_collation_for_utf8mb4' is set to 'utf8mb4_general_ci'
Copy link
Contributor

@dlenev dlenev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Yura!

I would like to suggest a few changes to this patch:

// we need to fix 'cs_number' here by setting it to the corresponding number
// of 'default_collation_for_utf8mb4' (currently only 'utf8mb4_general_ci',
// number 45, is supported)
const auto *primary_utf8mb4_collation =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see in other places where we check if default_collation_for_utf8mb4 needs to kick in (e.g. see sql_lex.cc) we directly use my_charset_utf8mb4_0900_ai_ci instead of getting access to it though get_charset_by_csname().

IMO it makes sense to be consistent and do the same here... Especially since this code seems to be called for each connect so saving even a few CPU cycles would be nice.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think that instead of two ifs you can simply do:

if (cs_number == my_charset_utf8mb4_0900_ai_ci.number)
  cs_numer = thd->variables.default_collation_for_utf8mb4->number;

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @dlenev I had exactly the same doubts and was going back and forth with this. From one side I did not want to hardcode primary_utf8mb4_collation to be my_charset_utf8mb4_0900_ai_ci (because who knows may be in the next version the default collation will change again). On the other hand, I totally agree that establishing the connection is a critical path and we should not add any unnecessary cycles here.
Anyway, if this caught your attention as well, then it is probably more serious than I thought.
Let's wait for the final feedback from the customer and I will add the changes you suggested into the final patch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlenev I reworked the critical paths code with simplified versions

  if (thd->variables.default_collation_for_utf8mb4 !=
      &my_charset_utf8mb4_0900_ai_ci) {
    if (client_cs == &my_charset_utf8mb4_0900_ai_ci) {
      client_cs = thd->variables.default_collation_for_utf8mb4;
    }
  }

that does not involve charset by name lookup.

sql/sys_vars.cc Outdated
// 'utf8mb4', we need to fix the returned value depending on the value of
// 'default_collation_for_utf8mb4' (currently, only 'utf8mb4_general_ci'
// is possible)
const auto *primary_utf8mb4_collation =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment as for previous patch:

Perhaps it is better simply do:

if (var->save_result.ptr == &my_charset_utf8mb4_0900_ai_ci)
  var->save_result.ptr = thd->variables.default_collation_for_utf8mb4;

instead ?

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked the same way.

Copy link
Contributor

@dlenev dlenev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more comments about second and third parts of your patch.

Otherwise it looks OK to me.

@@ -1869,6 +1869,21 @@ static bool check_charset(sys_var *, THD *thd, set_var *var) {
my_error(ER_UNKNOWN_CHARACTER_SET, MYF(0), err.ptr());
return true;
}
// if 'default_collation_for_utf8mb4' is set to something other than
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about similar code in check_collation_not_null() ?
In theory one can do SET @@global.collation_connection = utf8mb4.
Perhaps we should handle this case as well?
Or at least add comment why we don't think it is necessary/don't want to do it (indeed it might be non-trivial to distinguish cases when one uses charset name only - utf8mb4 vs case when one specifies collation explicitly -utf8mb4_0900_ai_ci.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit more pondering about it, I think it is a good idea to have test coverage for cases when user can use both charset name and collation name. Like SET NAMES utf8mb4; vs SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci and SET @@global.collation_connection = utf8mb4 vs SET @@global.collation_connection = utf8mb4_0900_ai_ci.

Do you agree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlenev I extended coverage with the following statements

SET character_set_client = utf8mb4;
SET NAMES DEFAULT;
SET NAMES utf8mb4;
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
SET CHARACTER SET DEFAULT;
SET CHARACTER SET utf8mb4;
SET collation_connection = utf8mb4_general_ci;

Note that 'SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci' will generate
character_set_client=255 in the binary log - I believe this is correct behavior as uses pecifies the collation explicitly.

Likewise, 'SET collation_connection = utf8mb4_0900_ai_ci' will generate
character_set_client=255 in the binary log.

As for 'SET collation_connection = utf8mb4' (when we try to assign a character set name to a collation variable), there is now problem here as this statatement is considered
syntactically incorrect In other words, no need to change check_collation_not_null().

…(connection postfix)

https://perconadev.atlassian.net/browse/PS-9237

Eliminated unnecessary call to 'get_charset_by_csname()' inside 'thd_init_client_charset()'
as this function is in critical path.
…(set_var postfix)

https://perconadev.atlassian.net/browse/PS-9237

Eliminated unnecessary call to 'get_charset_by_csname()' inside 'check_charset()'.
…(triggers)

https://perconadev.atlassian.net/browse/PS-9237

Fixed the value of the 'character_set_client' field for the binary log events generated
implicitly from triggers.

Updated 'main.percona_default_collation_for_utf8mb4_extensions' MTR test
case with 'INSERT' statements executed from triggers and stored procedures.
…(SET CHARACTER SET)

https://perconadev.atlassian.net/browse/PS-9237

'SET CHARACTER SET' statement now also takes into account the value of the
'default_collation_for_utf8mb4'  system variable.

'main.percona_default_collation_for_utf8mb4_extensions' MTR test case extended
with additional checks for 'SET NAMES' and `SET CHARACTER SET`.
…(SET collation_connection mtr)

https://perconadev.atlassian.net/browse/PS-9237

'main.percona_default_collation_for_utf8mb4_extensions' MTR test case
extended with additional checks for 'SET collation_connection'.
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

// 'utf8mb4_general_ci' is possible)
if (thd->variables.default_collation_for_utf8mb4 !=
&my_charset_utf8mb4_0900_ai_ci) {
if (var->save_result.ptr == &my_charset_utf8mb4_0900_ai_ci) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-pro-type-union-access ⚠️
do not access members of unions; use (boost::)variant instead

@percona-ysorokin percona-ysorokin force-pushed the dev/PS-9237-default_collation_for_utf8mb4 branch from a35c35d to bfc10f2 Compare October 22, 2024 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants