You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I was working on zorder support in the rapids accelerator I spent some time to see how it could be applied to NDS runs. In my analysis, it turns out that most of the large tables had only a couple of columns that were used as a part of predicate push down. I was able to get a decent performance improvement locally by doing the following, and I think it would be good for us to test this at scale when we do the conversion from CSV to parquet/etc before running the queries. These would replace the TABLE_PARTITIONING that is currently done as a part of nds_transcode..py. It might also be nice to have a way to configure the parallelism of the shuffle based off of the table and the scale factor. Ideally we would target file sizes of around 1GiB or so.
sort catalog_returns by the cr_return_amount column. (This should help query 49)
sort catalog_sales by the cs_ship_addr_sk column. (This should help queries 16 and 76)
sort inventory by the inv_quantity_on_hand column. (This should help queries 82, 37, and 72)
sort store_returns by the sr_return_amt column. (This should help query 49)
Partition store_sales by ss_quantity and sort it by ss_wholesale_cost. You can do this by doing an orderBy("ss_quantity", "ss_wholesale_cost") and then when writing the data including a partitionBy("ss_quantity"). (This should help with queries 49, 28, and 9)
sort web_returns by wr_return_amt (This should help with query 49)
sort web_sales by ws_net_profit (This should help with queries 49 and 85)
This is not perfect. There are a lot more columns that could help with other queries, but they would require zording the data, which I will file a follow on issue to look at.
Please make sure that as we do this experiment that we keep track of the change in ingestion/transcode time, so we can weigh it against the time saved in the queries. Also we need to do these tests for both the CPU and the GPU.
It would be really great if we could test out each change in isolation to see how it impacts things, but that feels like a lot of work, and I am not sure it is really worth it.
Another thing to consider also is what platform we use when doing this. Depending on how the tables are modified in later stages the ordering improvements might be lost. We should try the tests with and without the maintenance section and see how delta and iceberg deal with these.
The text was updated successfully, but these errors were encountered:
As I was working on zorder support in the rapids accelerator I spent some time to see how it could be applied to NDS runs. In my analysis, it turns out that most of the large tables had only a couple of columns that were used as a part of predicate push down. I was able to get a decent performance improvement locally by doing the following, and I think it would be good for us to test this at scale when we do the conversion from CSV to parquet/etc before running the queries. These would replace the TABLE_PARTITIONING that is currently done as a part of
nds_transcode..py
. It might also be nice to have a way to configure the parallelism of the shuffle based off of the table and the scale factor. Ideally we would target file sizes of around 1GiB or so.catalog_returns
by thecr_return_amount
column. (This should help query 49)catalog_sales
by thecs_ship_addr_sk
column. (This should help queries 16 and 76)inventory
by theinv_quantity_on_hand
column. (This should help queries 82, 37, and 72)store_returns
by thesr_return_amt
column. (This should help query 49)store_sales
byss_quantity
and sort it byss_wholesale_cost
. You can do this by doing anorderBy("ss_quantity", "ss_wholesale_cost")
and then when writing the data including apartitionBy("ss_quantity")
. (This should help with queries 49, 28, and 9)web_returns
bywr_return_amt
(This should help with query 49)web_sales
byws_net_profit
(This should help with queries 49 and 85)This is not perfect. There are a lot more columns that could help with other queries, but they would require zording the data, which I will file a follow on issue to look at.
Please make sure that as we do this experiment that we keep track of the change in ingestion/transcode time, so we can weigh it against the time saved in the queries. Also we need to do these tests for both the CPU and the GPU.
It would be really great if we could test out each change in isolation to see how it impacts things, but that feels like a lot of work, and I am not sure it is really worth it.
Another thing to consider also is what platform we use when doing this. Depending on how the tables are modified in later stages the ordering improvements might be lost. We should try the tests with and without the
maintenance
section and see how delta and iceberg deal with these.The text was updated successfully, but these errors were encountered: