-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
statistics: remove statistics.Column.Count #43033
statistics: remove statistics.Column.Count #43033
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
if analyzeCount > 0 { | ||
factor := float64(ds.statisticTable.RealtimeCount) / hist.TotalRowCount() | ||
ndv *= factor | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the way to calculate column NDV is changed. We assume that the change is ok for most cases.
@@ -3205,8 +3205,8 @@ func TestIssue32632(t *testing.T) { | |||
"`S_ACCTBAL` decimal(15,2) NOT NULL," + | |||
"`S_COMMENT` varchar(101) NOT NULL," + | |||
"PRIMARY KEY (`S_SUPPKEY`) /*T![clustered_index] CLUSTERED */)") | |||
tk.MustExec("analyze table partsupp;") | |||
tk.MustExec("analyze table supplier;") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't modify the test, I would get the error as follows:
Diff:
--- Expected
+++ Actual
@@ -4,4 +4,4 @@
[ └─HashAgg 1.00 mpp[tiflash] funcs:sum(test.partsupp.ps_supplycost)->Column#15]
-[ └─Projection 12500.00 mpp[tiflash] test.partsupp.ps_supplycost]
-[ └─HashJoin 12500.00 mpp[tiflash] inner join, equal:[eq(test.supplier.s_suppkey, test.partsupp.ps_suppkey)]]
+[ └─Projection 8000000000.00 mpp[tiflash] test.partsupp.ps_supplycost]
+[ └─HashJoin 8000000000.00 mpp[tiflash] inner join, equal:[eq(test.supplier.s_suppkey, test.partsupp.ps_suppkey)]]
[ ├─ExchangeReceiver(Build) 10000.00 mpp[tiflash] ]
Before the PR, in getColumnNDV
, since hist.Count
is 0, we use float64(ds.statisticTable.RealtimeCount) * distinctFactor
to calculate NDV. After the PR, if we still analyze partsupp
and supplier
when they don't have any data, since hist.IsStatsInitialized()
is true, we use float64(hist.Histogram.NDV)
to calculate NDV. In another word, before the PR we use hist.Count > 0
to decide whether the table is analyzed while after the PR we use hist.IsStatsInitialized()
to decide that. I think the latter is more reasonable. I remove the two analyze commands to keep cardinality estimation the same as before. Note that it is rare in the reality that RealtimeCount
reaches 800000/10000 but auto analyze is not triggered.
/merge |
This pull request has been accepted and is ready to merge. Commit hash: c57ea0d
|
/test unit-test |
/test all |
/test unit-test |
@xuyifangreeneyes: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cherry-pick release-6.5 |
@xuyifangreeneyes: new pull request created to branch In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
What problem does this PR solve?
Issue Number: ref #42160 close #44404
Problem Summary:
What is changed and how it works?
Remove
statistics.Column.Count
.Before the PR, in order to maintain
statistics.Column.Count
, we need to readmysql.stats_top_n
andmysql.stats_buckets
for each column when init stats, which is time-consuming. On the other hand, the usage ofstatistics.Column.Count
is limited. We modify(*DataSource).getColumnNDV
to get rid ofstatistics.Column.Count
.Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.