-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize queries with LIMIT/LIMIT BY/ORDER BY for distributed with GROUP BY sharding_key #10373
Optimize queries with LIMIT/LIMIT BY/ORDER BY for distributed with GROUP BY sharding_key #10373
Conversation
253a3f8
to
aaab84b
Compare
aaab84b
to
f56560d
Compare
f56560d
to
c9fa74b
Compare
c9fa74b
to
296b2fb
Compare
95f7af3
to
d417da3
Compare
So to summarize this should solve some issues that had been introduced by the #10341 and adds support for LIMIT/LIMIT BY/ORDER BY |
6c4748a
to
a2467f2
Compare
|
@akuzm can you please take a look is it false-positive or not? (or maybe I should just restart the build? by i.e. rebasing against upstream/master) |
Never mind, the problem is in the protocol ABI breakage (queries like |
a2467f2
to
348ef12
Compare
@4ertus2 this is ready, can you take a look? |
The 00956_sensitive_data_masking is flacky (#13867 ) |
348ef12
to
f39f2c5
Compare
@4ertus2 @alexey-milovidov friendly ping, can some one take a look? |
f39f2c5
to
cc6fb4f
Compare
So here we got the result already, but timeout triggers ( |
@alexey-milovidov any thoughts on this? |
Process query until the stage where the aggregate functions were calculated and finalized. It will be used for optimize_distributed_group_by_sharding_key. v2: fix aliases v3: Fix protocol ABI breakage due to WithMergeableStateAfterAggregation Conditions >= for QueryProcessingStage::Enum has been verified, and they are ok (in InterpreterSelectQuery).
…OUP BY sharding_key Previous set of QueryProcessingStage does not allow to do this. But after WithMergeableStateAfterAggregation had been introduced the following queries can be optimized too under optimize_distributed_group_by_sharding_key: - GROUP BY sharding_key LIMIT - GROUP BY sharding_key LIMIT BY - GROUP BY sharding_key ORDER BY And right now it is still not supports: - WITH TOTALS (looks like it can be supported) - WITH ROLLUP (looks like it can be supported) - WITH CUBE - SETTINGS extremes=1 (looks like it can be supported) But will be implemented separatelly. vX: fixes v2: fix WITH * v3: fix extremes v4: fix LIMIT OFFSET (and make a little bit cleaner) v5: fix HAVING v6: fix ORDER BY v7: rebase against 20.7 v8: move out WithMergeableStateAfterAggregation v9: add optimize_distributed_group_by_sharding_key into test names
…merge (convert to UInt64) Possible values: - 1 - Do not merge aggregation states from different servers for distributed query processing - in case it is for certain that there are different keys on different shards. - 2 - same as 1 but also apply ORDER BY and LIMIT stages
cc6fb4f
to
776688a
Compare
rebased in attempt to fix it |
drop table if exists dist_01247; | ||
drop table if exists dist_layer_01247; | ||
|
||
select 'Distributed(rand)-over-Distributed(rand)'; | ||
create table dist_layer_01247 as data_01247 engine=Distributed(test_cluster_two_shards, currentDatabase(), data_01247, rand()); | ||
create table dist_01247 as data_01247 engine=Distributed(test_cluster_two_shards, currentDatabase(), dist_layer_01247, number); | ||
select count(), * from dist_01247 group by number; | ||
drop table dist_01247; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand this change, will return DROP queries...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When test DROP TABLE at the end you cannot attach with the client and reproduce the failure, that's why I'm trying to keep them, so that said that this is just a cosmetic thing.
@alexey-milovidov Is there some unspoken rule that said that tables should be removed after the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just for convenience when you run .sql files manually and don't want leftovers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I understand the logic. It looks a bit fragile because StorageDistributed should maintain its awareness of all possible pipeline steps after GROUP BY. It is Ok but we need to write randomized combinatoral tests for correctness of Distributed tables on various types of clusters.
@@ -1,2 +1,42 @@ | |||
SELECT count(), uniq(dummy) FROM remote('127.0.0.{2,3}', system.one) SETTINGS distributed_group_by_no_merge = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd better add new test instead of modifying an old simple test case.
FWIW here is previous test report (before rebase against upstream/master) - https://clickhouse-test-reports.s3.yandex.net/10373/9e4aa5954e45865d7756e1a6a174d7c1cf0ccfa0/performance_comparison/report.html#fail1 |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Optimize queries with LIMIT/LIMIT BY/ORDER BY for distributed with GROUP BY sharding_key (under optimize_skip_unused_shards and optimize_distributed_group_by_sharding_key)
Detailed description / Documentation draft:
Previous set of QueryProcessingStage does not allow to do this.
So this patch set introduces new WithMergeableStateAfterAggregation and use
it to optimize queries with GROUP BY sharding_key and:
And right now it is still not supports:
HEAD:
Continuation of: