Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: add new agg function APPROX_COUNT_DISTINCT (#17175) #18120

Merged
merged 4 commits into from
Jun 19, 2020

Conversation

ti-srebot
Copy link
Contributor

cherry-pick #17175 to release-4.0


Signed-off-by: Tong Zhigao tongzhigao@pingcap.com

What problem does this PR solve?

Issue Number: close #14632

Problem Summary:

  • Distinct count very slow and might consume high amount of memory.
  • If relative error is allowed, we can use sampling algorithm to compute approximate result.

What is changed and how it works?

  • Add new agg function APPROX_COUNT_DISTINCT.

  • Use BJKST algorithm to compute approximate result of distinct count.

  • For the calculation state, it uses a sample of element hash values with a size up to 2^16. Compared with the widely known HyperLogLog algorithm, this algorithm is less effective in terms of accuracy and memory consumption (even up to proportionality), but it is adaptive. This means that with fairly high accuracy, it consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution (i.e. in cases when most of the data sets are small). This algorithm is also very accurate for data sets with small cardinality and very efficient on CPU.

  • For TiFlash, TiDB can push down cop request and merge all partial result. For other engine, TiDB needs to collect all original data and compute all by itself.

Tests

  • Unit test

Release note

  • Add new agg function APPROX_COUNT_DISTINCT to support approximate count distinct.

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@solotzg please accept the invitation then you can push to the cherry-pick pull requests.
https://github.com/ti-srebot/tidb/invitations

Signed-off-by: Tong Zhigao <tongzhigao@pingcap.com>
Signed-off-by: Tong Zhigao <tongzhigao@pingcap.com>
@solotzg
Copy link
Contributor

solotzg commented Jun 18, 2020

/run-all-tests

2 similar comments
@solotzg
Copy link
Contributor

solotzg commented Jun 18, 2020

/run-all-tests

@solotzg
Copy link
Contributor

solotzg commented Jun 18, 2020

/run-all-tests

Copy link
Contributor

@lzmhhh123 lzmhhh123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lzmhhh123 lzmhhh123 added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 19, 2020
@solotzg
Copy link
Contributor

solotzg commented Jun 19, 2020

/run-all-tests

Copy link
Contributor

@XuHuaiyu XuHuaiyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@XuHuaiyu XuHuaiyu added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 19, 2020
@XuHuaiyu
Copy link
Contributor

/merge

@ti-srebot
Copy link
Contributor Author

Sorry @XuHuaiyu, you don't have permission to trigger auto merge event on this branch.

@XuHuaiyu
Copy link
Contributor

/run-all-tests

2 similar comments
@solotzg
Copy link
Contributor

solotzg commented Jun 19, 2020

/run-all-tests

@solotzg
Copy link
Contributor

solotzg commented Jun 19, 2020

/run-all-tests

@XuHuaiyu XuHuaiyu merged commit 6c2a572 into pingcap:release-4.0 Jun 19, 2020
@XuHuaiyu XuHuaiyu deleted the release-4.0-978370f7cbd3 branch June 19, 2020 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants