Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize count(*) with table statistics #620

Merged
merged 12 commits into from
Jun 28, 2021

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jun 25, 2021

Which issue does this PR close?

Closes #618

Rationale for this change

Speeding up queries (or part of a query) that can utilize the availability of the number of rows in the table statistics. Often used to inspect tables.
Not having to scan the data can save considerable time / cost.

On the TPC-H lineitem table we can see a 80x speed up:

> CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem/';
> select count(*) from X;
+------------------+
| COUNT(Uint8(1)) |
+------------------+
| 6001214          |
+------------------+
1 row in set. Query took 0.001 seconds.

Master:

> CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem/';
> select count(*) from X;
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 6001214         |
+-----------------+
1 row in set. Query took 0.081 seconds.

What changes are included in this PR?

Are there any user-facing changes?

@Dandandan Dandandan marked this pull request as draft June 25, 2021 22:41
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jun 25, 2021
@Dandandan Dandandan changed the title [WIP] Optimize count(*) with table statistics Optimize count(*) with table statistics Jun 26, 2021
@Dandandan Dandandan marked this pull request as ready for review June 26, 2021 13:24
alamb
alamb previously approved these changes Jun 27, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- thank you @Dandandan.

I have filed https://github.com/influxdata/influxdb_iox/issues/1815 to get IOx using this upstream. Great stuff.

@@ -108,6 +108,11 @@ pub trait TableProvider: Sync + Send {
/// Statistics should be optional because not all data sources can provide statistics.
fn statistics(&self) -> Statistics;

/// Returns whether statistics provided are exact values or estimates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nice thing about adding a has_exact_statistics is that it is a backwards compatible API

An alternate might be to encapsulate the "Exact statistics or not" into a field on Statistics itself, which feels to me like it keeps related things together more, but has the downside of changing Statistics / APIs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a similar thought process I had.
Maybe at some point it would also be nice to tell what parts of the statistics are exact (e.g. number of rows) and what estimated (such as distinct count).

datafusion/src/optimizer/aggregate_statistics.rs Outdated Show resolved Hide resolved
datafusion/src/optimizer/aggregate_statistics.rs Outdated Show resolved Hide resolved
.unwrap();
let expected = "\
Projection: #COUNT(UInt8(1))\
\n Projection: UInt64(100) AS COUNT(Uint8(1))\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb alamb dismissed their stale review June 27, 2021 11:18

I think we need to check for empty gby as well

@Dandandan
Copy link
Contributor Author

Dandandan commented Jun 27, 2021

After this PR I also plan some follow up issues:

  • also supporting min/max statistics

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff 👍 I recommend using logical plan builder to construct the new plans to make things more robust.

.unwrap();

let plan = ctx
.create_logical_plan("select sum(a)/count(*) from test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool optimization 👍

@alamb
Copy link
Contributor

alamb commented Jun 28, 2021

Good stuff 👍 I recommend using logical plan builder to construct the new plans to make things more robust.

Perhaps we can do this as a follow on PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Parquet statistics for count(*)
3 participants