Improve parquet ListingTable speed with parquet metadata (short clickbench queries) #11719

alamb · 2024-07-30T10:34:00Z

Is your feature request related to a problem or challenge?

I spent some time looking at the ClickBench results with DataFusion 40.0.0
#11567 (comment) (thanks @pmcgleenon 🙏 )

Specifically, I looked into how we could make some of the already fast queries on the the partitioned dataset faster. Unsurprisingly, for the really fast queries the query time is actually dominated by parquet metadata analysis and DataFusion statistics creation.

For example

ClickBench Q0

SELECT COUNT(*) FROM hits;

To reproduce, run:

cd datafusion
cargo run --release --bin dfbench -- clickbench --iterations 100 --path benchmarks/data/hits_partitioned  --query 0

I profiled this using Instruments. Here are some annotated screenshots

Some of my take aways are

a substantial amount of time is spent reading the parquet metadata twice
A substantial amount of time is spent managing the ScalarValues in statistics

Describe the solution you'd like

If would be cool to make these queries faster by reducing the per file metadata handling overhead (e.g. don't read the metadata more than once and figure out some way to make statistics handling more efficient)

Describe alternatives you've considered

Note this project isn't broken down into tasks yet

I think @Ted-Jiang did some work way back to cache parquet metaddata

Additional context

No response

The text was updated successfully, but these errors were encountered:

Rachelint · 2024-07-30T17:13:10Z

Some ideas about solving

A substantial amount of time is spent managing the ScalarValues in statistics

Plan to try it today.

See one simple thing is, refactor the Statistics to:

pub struct StatisticsInner {
    /// The number of table rows.
    pub num_rows: Precision<usize>,
    /// Total bytes of the table rows.
    pub total_byte_size: Precision<usize>,
    /// Statistics on a column level. It contains a [`ColumnStatistics`] for
    /// each field in the schema of the table to which the [`Statistics`] refer.
    pub column_statistics: Vec<ColumnStatistics>,
}

pub struct Statistics {
   inner: Arc<StatisticsInner>,
}

And the clone of Arc is trivial.

Rachelint · 2024-07-30T17:18:28Z

take

alamb · 2024-07-30T17:44:57Z

That would be a very interesting experiment to try

Rachelint · 2024-08-03T17:35:11Z

Based on the detail profile about q0 in clickbench as following, maybe the optimization work can be divided into three parts:

Reduce the cost about clone and drop of Statistics
Maybe optimize the impl for get_statistics_with_limit(seems may tmp vectors exist, but not sure)
Do cache for the result of object store list operation

Trying the first possible optimization now.

alamb · 2024-08-05T19:24:32Z

#11802 is very nice 👌 It would be fascinating to know what the flamegraph looks like after that PR (aka what are next highest bottleneck)

Rachelint · 2024-08-06T01:09:29Z

#11802 is very nice 👌 It would be fascinating to know what the flamegraph looks like after that PR (aka what are next highest bottleneck)

😄 I guess they will be the plan creation and object store list.

alamb added the enhancement New feature or request label Jul 30, 2024

alamb mentioned this issue Jul 30, 2024

Update ClickBench benchmarks with DataFusion 40 #11567

Closed

github-actions bot assigned Rachelint Jul 30, 2024

Rachelint mentioned this issue Aug 4, 2024

Reduce clone of Statistics in ListingTable and PartitionedFile #11802

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parquet ListingTable speed with parquet metadata (short clickbench queries) #11719

Improve parquet ListingTable speed with parquet metadata (short clickbench queries) #11719

alamb commented Jul 30, 2024

Rachelint commented Jul 30, 2024 •

edited

Loading

Rachelint commented Jul 30, 2024

alamb commented Jul 30, 2024

Rachelint commented Aug 3, 2024 •

edited

Loading

alamb commented Aug 5, 2024

Rachelint commented Aug 6, 2024

Improve parquet ListingTable speed with parquet metadata (short clickbench queries) #11719

Improve parquet ListingTable speed with parquet metadata (short clickbench queries) #11719

Comments

alamb commented Jul 30, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Rachelint commented Jul 30, 2024 • edited Loading

Rachelint commented Jul 30, 2024

alamb commented Jul 30, 2024

Rachelint commented Aug 3, 2024 • edited Loading

alamb commented Aug 5, 2024

Rachelint commented Aug 6, 2024

Rachelint commented Jul 30, 2024 •

edited

Loading

Rachelint commented Aug 3, 2024 •

edited

Loading