[EPIC] Improving Performance #566

andygrove · 2024-06-13T17:54:15Z

This epic is a place to track various ideas around improving query performance.

Some of these ideas apply to upstream DataFusion rather than being Comet-specific.

Planned Work

These are longer term ideas to explore.

Avoid transition to Comet in some cases if the overhead of R2C and C2R outweighs the benefit #571
Use selection vectors instead of copying batches during filter operations (see paper at https://www.pdl.cmu.edu/ftp/Database/ngom-damon2021.pdf)
Implement StringView / BinaryView
- Can particulary help in the parquet scan/filter case to avoid string copies. See Andrew Lamb's talk from the June 24 Bay Area DataFusion Meetup for more info.
Should we implement native versions of RowToColumnar / ColumnarToRow?
Check that we are removing FilterExec when all filter conditions are successfully pushed down to Parquet, to avoid evaluating the filter predicates twice
Implement tooling for saving output of query stage to disk so that we can benchmark individual query stages in rust, outside of spark
Spark will remove dictionary-encoding after a cast (and maybe after other expressions?) but it could be advantageous to retain the dictionary encoding so that upstream native operators can take advantage of this?

Use JIT for evaluating nested expressions to avoid intermediate arrays (there was a datafusion-jit module, but it was abandonded, so we need to see why)
- Update: previous experiments in Arrow/DataFusion did not show a speedup from this appraoch and it added a lot of extra code and complexity
Use mutable vectors during expression evaluation to avoid intermediate arrays (in-place updates are available in arrow-rs)

Related Comet issues:

No response

No response

The text was updated successfully, but these errors were encountered:

andygrove added enhancement New feature or request performance labels Jun 13, 2024

andygrove added this to the 0.2.0 milestone Jun 14, 2024

andygrove removed this from the 0.2.0 milestone Aug 16, 2024