Explore some specific settings to better optimize the data layout #130

revans2 · 2022-10-21T14:06:01Z

As I was working on zorder support in the rapids accelerator I spent some time to see how it could be applied to NDS runs. In my analysis, it turns out that most of the large tables had only a couple of columns that were used as a part of predicate push down. I was able to get a decent performance improvement locally by doing the following, and I think it would be good for us to test this at scale when we do the conversion from CSV to parquet/etc before running the queries. These would replace the TABLE_PARTITIONING that is currently done as a part of nds_transcode..py. It might also be nice to have a way to configure the parallelism of the shuffle based off of the table and the scale factor. Ideally we would target file sizes of around 1GiB or so.

sort catalog_returns by the cr_return_amount column. (This should help query 49)
sort catalog_sales by the cs_ship_addr_sk column. (This should help queries 16 and 76)
sort inventory by the inv_quantity_on_hand column. (This should help queries 82, 37, and 72)
sort store_returns by the sr_return_amt column. (This should help query 49)
Partition store_sales by ss_quantity and sort it by ss_wholesale_cost. You can do this by doing an orderBy("ss_quantity", "ss_wholesale_cost") and then when writing the data including a partitionBy("ss_quantity"). (This should help with queries 49, 28, and 9)
sort web_returns by wr_return_amt (This should help with query 49)
sort web_sales by ws_net_profit (This should help with queries 49 and 85)

This is not perfect. There are a lot more columns that could help with other queries, but they would require zording the data, which I will file a follow on issue to look at.

Please make sure that as we do this experiment that we keep track of the change in ingestion/transcode time, so we can weigh it against the time saved in the queries. Also we need to do these tests for both the CPU and the GPU.

It would be really great if we could test out each change in isolation to see how it impacts things, but that feels like a lot of work, and I am not sure it is really worth it.

Another thing to consider also is what platform we use when doing this. Depending on how the tables are modified in later stages the ordering improvements might be lost. We should try the tests with and without the maintenance section and see how delta and iceberg deal with these.

The text was updated successfully, but these errors were encountered:

revans2 mentioned this issue Oct 21, 2022

Explore using zorder for the layout of some data in NDS queries #131

Open

mattahrens added the performance Related to plugin performance improvements label Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore some specific settings to better optimize the data layout #130

Explore some specific settings to better optimize the data layout #130

revans2 commented Oct 21, 2022 •

edited

Loading

Explore some specific settings to better optimize the data layout #130

Explore some specific settings to better optimize the data layout #130

Comments

revans2 commented Oct 21, 2022 • edited Loading

revans2 commented Oct 21, 2022 •

edited

Loading