-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW2: Performance benchmark #1652
Comments
Here are some of the TPCH results I got from running our tpch benchmark suite on an 8 cores x86-64 Linux machine. The base commit from I used as baseline for arrow-rs is 2008b1d. The commit for arrow2 is c0c9c72. Default single partition CSV files generated from our tpch gen script (--batch-size 4096): CSV tables partitioned into 16 files and processed with 8 datafusion partitions (--batch-size 4096 --partitions 8): Data generated with:
Parquet tables partitioned in 8 files and processed with 8 datafusion partitions (--batch-size 4096 --partitions 8): Note query 7 could not complete with arrow2 due to OOM. Arrow2 parquet reader currently consuming almost double the memory compared to arrow-rs is a known issue. Related upstream issue: jorgecarleitao/arrow2#768. Q1 is significantly slower in arrow2 compared to the other queies (perhaps related to predicate pushdown?). I think both of these two regressions require deepdive before we merge arrow2 into master. Overall, arrow2 is around 5% faster for CSV tables and 10-20% faster for parquet tables across the board. |
Big 👍 to this, getting some concrete numbers would be really nice. FWIW some ideas for whoever picks this up that I at least would be very interested in:
|
A number of the performance improvements will be in arrow-rs 8.0.0, though some such as apache/arrow-rs#1180 will not be released until arrow-rs 9.0.0 |
bummer that the dictionary encoding optimization will need to wait for the 9.0 release, but we can test it with a 9.0 datafusion branch if needed. |
I think a big part of perf improvement comes from Parquet predicate pushdown (e.g., stats, dictionary, bloom filter and column index). Does either arrow-rs or arrow2 implement (some of) these currently? |
DataFusion does implement row group pruning based on statistics (that arrow-rs creates) arrow-rs creates statistics. It will (in version after 8.0.0) preserve dictionary encoding (meaning the output of a dictionary encoded column will also be dictionary encoded) which should be a large win for low cardinality columns It does not currently create or use column indexes or bloom filters I believe that @tustvold has a plan for implementing more sophisticated predicate pushdown (aka that a filter on one column could be used to avoid decoding swaths of others) but I am not sure what the timeline on that is -- apache/arrow-rs#1191 |
This is nice 👍 . By dictionary-based row group filtering I mean we can first decode only the dictionary pages for the column chunks where pushed down predicates (e.g., in predicate) are associated to, and skip the whole row group if no keys in the dictionary match. |
this would definitely be a neat optimization |
You'd need to check that column chunk hadn't spilled, but yes that is broadly speaking the access pattern I wish to support with the work in apache/arrow-rs#1191. Specially exploit apache/arrow-rs#1180 to cheaply get dictionary data, evaluate predicates on it, and use this to construct a refined row selection to fetch. As @alamb alludes to the timelines for this effort are not clear, and there may be other more pressing things for me to spend time on, but it is definitely something that I hope to deliver in the next month or so. |
afaik both support reading and writing dictionary encoded arrays to dictionary-encoded column chunks (but neither supports pushdown based on dict values atm). TBH, imo the biggest limiting factor in implementing anything in parquet is its lack of documentation - it is just a lot of work to decipher what is being requested, and the situation is not improving. For example, I spent a lot of time in understanding the encodings, have a PR to try to help future folks implementing it, and it has been lingering for ~9 months now. I wish parquet was a bit more inspired by e.g. Arrow or ORC in this respect. Note that datafusion's benchmarks only use "required" / non-nullable data, so most optimizations on the null values are not seen from datafusion's perspective. Last time I benched arrow2/parquet2 was much faster in nullable data; I am a bit surprised to see so many differences in non-nullable data. |
arrow-rs 8 has some improvements for nullable columns as well now courtesy of @tustvold apache/arrow-rs#1054 |
checking in on this - has anyone rerun tpch on master and arrow2 branches since the arrow-rs 9 release? |
I added some further benchmarks in #1738 which I would also be interested in the numbers for |
More than half a year passed, and I believe arrow-rs has made great progress. Does anyone have the latest benchmark data? |
I have not really been following arrow2 very closely -- I am not sure when the last time anyone got datafusion to work enough with arrow2 to run the TPCH benchmarks 🤷 |
I will bump arrow branch and make comparision. |
My 2 cents is that as mentioned by @sunchao above, parquet predicate pushdown makes a drastic difference to performance. Whilst there are still some kinks to iron out in the DataFusion integration, arrow-rs now supports the full suite, including late materialisation. Given they were previously roughly in the same ballpark, and significant optimisations have landed in arrow-rs since then, with predicate pushdown I think the performance story for arrow-rs is pretty compelling, at least w.r.t parquet. Arrow2 only supports row group level pruning, as page pruning requires skippable decode and more sophisticated handling of repetition levels, and so is a fair ways behind on this. This has been born out by some benchmarks I did, but the TPCH benchmarks would be very interesting to see also. I mention this mainly because previous attempts to update the arrow2 branch have gotten stuck in merge conflict hell, and perhaps I can save you some time, and sanity 😅 |
For TPC-H It might be the pruning wouldn't help a lot, as most of the data is AFAIK very evenly distributed, so quite hard to prune values based on min/max values without sorting data first - at least on row group level this was the case. But maybe page pruning works in some cases, that would be great. |
FWIW we support late materialization now, in addition to row group and page pruning, and which doesn't rely on aggregate statistics. It instead evaluates predicates on the subset of the columns needed by the predicate, and then uses the result to conditionally materialize the other projected columns. This isn't always advantageous, e.g. for predicates that don't eliminate large numbers of rows, we need to develop better heuristics / bailing logic, but where it works the benefits can be huge. |
Yes, I realized that, but also seeing some tpc-h generated data and filters, I would be (pleasantly) surprised if it can prune out many pages. |
I have it on my list to propose enabling the pushdown filtering by default (see #3828) -- I plan to run the TPCH benchmarks as part of that exercise to see how much / if at all they help |
From benchmarks, I think pushdown filtering is very help |
I try to modify the filter condition in tpch q1, with page index it really helped 😂 . It can boost a lot sometimes in real world. |
Given jorgecarleitao/arrow2#1429 I don't think this issue is relevant any more -- please let me know if you feel otherwise. |
Perform thorough testing to make sure the arrow2 branch will indeed result in better performance across the board.
Please give https://github.com/apache/arrow-datafusion/tree/arrow2 a try for your workloads and report any regression you find here.
The text was updated successfully, but these errors were encountered: