Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore some specific settings to better optimize the data layout #130

Open
revans2 opened this issue Oct 21, 2022 · 0 comments
Open

Explore some specific settings to better optimize the data layout #130

revans2 opened this issue Oct 21, 2022 · 0 comments
Labels
performance Related to plugin performance improvements

Comments

@revans2
Copy link

revans2 commented Oct 21, 2022

As I was working on zorder support in the rapids accelerator I spent some time to see how it could be applied to NDS runs. In my analysis, it turns out that most of the large tables had only a couple of columns that were used as a part of predicate push down. I was able to get a decent performance improvement locally by doing the following, and I think it would be good for us to test this at scale when we do the conversion from CSV to parquet/etc before running the queries. These would replace the TABLE_PARTITIONING that is currently done as a part of nds_transcode..py. It might also be nice to have a way to configure the parallelism of the shuffle based off of the table and the scale factor. Ideally we would target file sizes of around 1GiB or so.

  1. sort catalog_returns by the cr_return_amount column. (This should help query 49)
  2. sort catalog_sales by the cs_ship_addr_sk column. (This should help queries 16 and 76)
  3. sort inventory by the inv_quantity_on_hand column. (This should help queries 82, 37, and 72)
  4. sort store_returns by the sr_return_amt column. (This should help query 49)
  5. Partition store_sales by ss_quantity and sort it by ss_wholesale_cost. You can do this by doing an orderBy("ss_quantity", "ss_wholesale_cost") and then when writing the data including a partitionBy("ss_quantity"). (This should help with queries 49, 28, and 9)
  6. sort web_returns by wr_return_amt (This should help with query 49)
  7. sort web_sales by ws_net_profit (This should help with queries 49 and 85)

This is not perfect. There are a lot more columns that could help with other queries, but they would require zording the data, which I will file a follow on issue to look at.

Please make sure that as we do this experiment that we keep track of the change in ingestion/transcode time, so we can weigh it against the time saved in the queries. Also we need to do these tests for both the CPU and the GPU.

It would be really great if we could test out each change in isolation to see how it impacts things, but that feels like a lot of work, and I am not sure it is really worth it.

Another thing to consider also is what platform we use when doing this. Depending on how the tables are modified in later stages the ordering improvements might be lost. We should try the tests with and without the maintenance section and see how delta and iceberg deal with these.

@mattahrens mattahrens added the performance Related to plugin performance improvements label Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to plugin performance improvements
Projects
None yet
Development

No branches or pull requests

2 participants