Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by: collation-aware grouping #1476

Merged
merged 1 commit into from
Dec 19, 2022
Merged

Conversation

vmg
Copy link
Contributor

@vmg vmg commented Dec 14, 2022

Hi doltfriends! Here's another compatibility fix. This one I'm more confident about: GROUP BY is not collation aware when grouping!

The included test reproduces the issue trivially, but for completeness: when using a GROUP BY statement that includes a textual column, the column's collation needs to be taken into account when creating the aggregation groups.

E.g., for the following schema and query

CREATE TABLE test (id bigint, title varchar(255) collate utf8mb4_0900_ai_ci, PRIMARY KEY(id));
INSERT INTO `test` (title) VALUES ('val1'), ('VAL1'), ('val2'), ('VAL2')
SELECT COUNT(*) FROM test GROUP BY title

We should expect the result of the SELECT to contain two rows, as there are only two groups when grouping by title. Even though we've inserted 4 unique values, the utf8mb4_0900_ai_ci with which the column is collated is case insensitive, so in practice, 'val1' = 'VAL1' and 'val2' = 'VAL2' when grouping. This is not the current behavior.

The fix is straightforward: use a weight string when hashing to generate the grouping key instead of using the expression's literal value. I did a small refactoring in collations.go so that we can write the weight string directly into the grouping key's hasher and continue using the same code for the existing HashUint API.

cc @dbussink

@timsehn
Copy link
Contributor

timsehn commented Dec 14, 2022

Thanks again! Please keep em coming.

Are you all using this at Planetscale or are these hobby fixes?

We love what you all have done with Vitess and are doing with Planetscale core. I've mentioned you all a number of times in some of our more consistently popular blogs. For instance: https://www.dolthub.com/blog/2021-09-17-database-version-control :-)

@fulghum
Copy link
Contributor

fulghum commented Dec 14, 2022

Oh wow! This is another great find. Thank you @vmg!! 🙏

@vmg
Copy link
Contributor Author

vmg commented Dec 15, 2022

Are you all using this at Planetscale or are these hobby fixes?

We're evaluating using go-mysql-server in the OSS Vitess project to mock MySQL instances in our end-to-end tests (which I guess would become integration tests once we remove live MySQL). It would be a game changer for our CI and allow us to test hundreds more permutations of tables and databases much more efficiently!

Before we get there, we need go-mysql-server to mimic MySQL semantics as accurately as possible. These are some of the incompatibilities I've found so far.

My assumption is that you are also interested on behaving as close as MySQL as possible, is that correct? For things like this specific bug, go-mysql-server was clearly doing the wrong thing, so that's why I've opened this PR, but I've found some other cases where you're returning sensible results that are not 100% the same as MySQL.

I'd like to discuss those with you, and whether you'd like those fixes too. Should I open an issue for MySQL compatibility quirks? What works best for you?

I've mentioned you all a number of times in some of our more consistently popular blogs.

I know! I read those and found them very interesting! :)

@vmg
Copy link
Contributor Author

vmg commented Dec 15, 2022

(build should be green now; I introduced a typo in my refactoring)

Copy link
Contributor

@Hydrocharged Hydrocharged left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just a few more test additions and this is good to merge!

{float64(8888)},
}

require.Equal(expected, rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a few additional tests for ENUM and SET as well. I'd expect them to work as, although they're collation-aware, they handle their values as integers, so I'd expect it to just work. But we should verify the behavior anyway.

@timsehn
Copy link
Contributor

timsehn commented Dec 15, 2022

We're evaluating using go-mysql-server in the OSS Vitess project to mock MySQL instances in our end-to-end tests (which I guess would become integration tests once we remove live MySQL). It would be a game changer for our CI and allow us to test hundreds more permutations of tables and databases much more efficiently!

Great use case. We have a number of GMS customers using it this way.

My assumption is that you are also interested on behaving as close as MySQL as possible, is that correct? For things like this specific bug, go-mysql-server was clearly doing the wrong thing, so that's why I've opened this PR, but I've found some other cases where you're returning sensible results that are not 100% the same as MySQL.

We're about 99% certain that we want to mimic MySQL exactly. These should at least be issues that folks can refer to so if you find them, we can start there. If we should fix them, we'll fix them. Otherwise, we'll just leave the issue open with an explanation on why we're leaving the divergent behavior.

@vmg
Copy link
Contributor Author

vmg commented Dec 16, 2022

Tests are updated!

@Hydrocharged
Copy link
Contributor

@vmg It looks like the tests are failing.

@vmg
Copy link
Contributor Author

vmg commented Dec 16, 2022

@Hydrocharged: I'm afraid this is not related to my PR. Somebody broke main. :)

@fulghum
Copy link
Contributor

fulghum commented Dec 16, 2022

@jcor11599 – can you take a look at the CI test failure above and see if it's related to your bindings change for prepared statements? I didn't immediately recognize the error, but happy to dig in and help figure it out if needed.

@jycor
Copy link
Contributor

jycor commented Dec 16, 2022

@vmg T'was me, sorry! Fix for it is here: #1483

@fulghum
Copy link
Contributor

fulghum commented Dec 16, 2022

Thanks for the quick fix @jcor11599!

@vmg – once you pull James' latest change from main, it should get your branch fixed up. Thanks again for working with us on this great contribution! 🙏

Signed-off-by: Vicent Marti <vmg@strn.cat>
@fulghum
Copy link
Contributor

fulghum commented Dec 19, 2022

@Hydrocharged – heads up that the CI checks are all passing now the latest fixes from main have been pulled in. Anything else you wanna see on this PR?

@Hydrocharged
Copy link
Contributor

@Hydrocharged – heads up that the CI checks are all passing now the latest fixes from main have been pulled in. Anything else you wanna see on this PR?

Nope, we’ll merge it!

@Hydrocharged Hydrocharged merged commit bfd338f into dolthub:main Dec 19, 2022
@fulghum
Copy link
Contributor

fulghum commented Dec 19, 2022

Awesome! 🙌 Thanks again for the great contribution @vmg! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants