-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data #5404
Comments
Additional context
DuckDB Plan
|
looks like Datafusion does no push down update: seems duplicate with |
TopK is a partial factor but it is not the best.
|
@sundy-li , you are right, select one
@sundy-li thanks for the info. Do you happen to know the code/blog link to "reused the original buffer"?
|
|
@sundy-li TIL, thank you |
@jychen7 I checked in my 16-core linux with SSD, duckdb read parquet still faster. duckdb v0.6.0
datafusion:
I used to look at these codes, duckdb's memory model for Not sure but may related to: https://github.com/duckdb/duckdb/blob/master/src/common/types/vector.cpp#L1428-L1436 If the original parquet column data is |
Support for this was added to parquet by @thinkharderdev not too long ago, it may just be a case of hooking it up - apache/arrow-rs#3633 Edit: perhaps you may be able to try with #5416 and the various predicate pushdown features enabled on ParquetOptions |
@sundy-li my macOS only have 6 Core, but inspiring by your message, I find out github build version is faster than brew one (Brew is macOS package manager). More details here: duckdb/duckdb#6495 |
@tustvold thank you, nice work! let me try main (as of now, c676d10) which includes #5416 update: yes, 2x faster
I locally turn on plan result
|
@tustvold based on the I understand that your PR #5416 support Do you think this is something to improve ❓ I haven't checked the related code in the logical plan, so I will try to take a look tomorrow. |
It may be #4028 is required for this |
I ran this again today Datafusiondatafusion-cli -c "SELECT * FROM 'hits.parquet' WHERE \"URL\" LIKE '%google%' LIMIT 1;"
...
real 0m3.734s
user 0m49.389s
sys 0m2.288s DuckDBtime duckdb -c "SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;"
...
real 0m0.273s
user 0m0.863s
sys 0m0.884s I will attempt to get #4028 done soon |
Describe the problem
This is NOT a bug, but an potential improvement goal
Datafusion v19.rc1 by default turn on
repartition_file_scans
at #5295with my local Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4), for following query on clickbench 14GB
hits.parquet
:real 0.566 user 1.876031 sys 0.357483
To Reproduce
Download data file
Prepare SQL
create a file called
create.sql
create a file called
q23_no_order_limit_1.sql
Datafusion
DuckDB
Expected behavior
The text was updated successfully, but these errors were encountered: