Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add small column on empty projection #7833

Merged
merged 7 commits into from
Oct 18, 2023

Conversation

ch-sc
Copy link
Contributor

@ch-sc ch-sc commented Oct 16, 2023

Which issue does this PR close?

Improves #3214.

Rationale for this change

If a projection is empty, we add the first column of the input schema since some parts of DataFusion still rely on at least having one column. Instead of selecting the first column from the input schema, these changes aim to select a column with a smaller memory size. The memory size is based on the data type.

What changes are included in this PR?

Are these changes tested?

Basic unit tests for new logic are included. All tests that involve query planning and empty projections execute this code.

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 16, 2023
// Get the projection exprs from columns in the order of the schema
/// Accumulate the memory size of a data type measured in bits.
///
/// Nested types are traversed and increment `nesting` on every level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment saying that variable-sized types are estimated using some heuristics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added a comment about variable sized types. Feel free to rephrase if you think something is missing.

LargeList(f) => nested_size(f.data_type(), nesting),
Struct(fields) => fields
.iter()
.map(|f| nested_size(f.data_type(), nesting))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle we could project a sub-field from a struct instead of the entire struct (all columns).

Copy link
Contributor Author

@ch-sc ch-sc Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I will play around with it. Though it sounds like a rare edge case to me where no other "smaller" type would be present in the schema!?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah indeed :)

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome @ch-sc ! I left a few comments.

This will yield some nice performance improvements for SELECT COUNT(*) from [source] queries even without solving #3214

@Dandandan
Copy link
Contributor

Change seems non controversial and has some good tests, so merging seems fine.

Thank you @ch-sc 😊

@Dandandan Dandandan merged commit 7acd883 into apache:main Oct 18, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants