Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Improving Performance #566

Open
andygrove opened this issue Jun 13, 2024 · 0 comments
Open

[EPIC] Improving Performance #566

andygrove opened this issue Jun 13, 2024 · 0 comments
Labels
enhancement New feature or request performance

Comments

@andygrove
Copy link
Member

andygrove commented Jun 13, 2024

What is the problem the feature request solves?

This epic is a place to track various ideas around improving query performance.

Some of these ideas apply to upstream DataFusion rather than being Comet-specific.

Planned Work

Ideas to Research

These are longer term ideas to explore.

  • Avoid transition to Comet in some cases if the overhead of R2C and C2R outweighs the benefit #571
  • Use selection vectors instead of copying batches during filter operations (see paper at https://www.pdl.cmu.edu/ftp/Database/ngom-damon2021.pdf)
  • Implement StringView / BinaryView
    • Can particulary help in the parquet scan/filter case to avoid string copies. See Andrew Lamb's talk from the June 24 Bay Area DataFusion Meetup for more info.
  • Should we implement native versions of RowToColumnar / ColumnarToRow?
  • Check that we are removing FilterExec when all filter conditions are successfully pushed down to Parquet, to avoid evaluating the filter predicates twice
  • Implement tooling for saving output of query stage to disk so that we can benchmark individual query stages in rust, outside of spark
  • Spark will remove dictionary-encoding after a cast (and maybe after other expressions?) but it could be advantageous to retain the dictionary encoding so that upstream native operators can take advantage of this?

Ideas no longer being pursued

  • Use JIT for evaluating nested expressions to avoid intermediate arrays (there was a datafusion-jit module, but it was abandonded, so we need to see why)
    • Update: previous experiments in Arrow/DataFusion did not show a speedup from this appraoch and it added a lot of extra code and complexity
  • Use mutable vectors during expression evaluation to avoid intermediate arrays (in-place updates are available in arrow-rs)

Related DataFusion issues

Related Comet issues:

Describe the potential solution

No response

Additional context

No response

@andygrove andygrove added enhancement New feature or request performance labels Jun 13, 2024
@andygrove andygrove added this to the 0.2.0 milestone Jun 14, 2024
@andygrove andygrove removed this from the 0.2.0 milestone Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

1 participant