-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge string-view2
branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default)
#11667
Conversation
* add a knob to force string view in benchmark * fix sql logic test * update doc * fix ci * fix ci only test * Update benchmarks/src/util/options.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/common/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update tests --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* add functions * add tests for hash util
* Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * add more tests * make doc happy * update new implementation * fix bug * avoid unused dep * update dep * update * fix cargo check * update doc * pick up the comments change again --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…11519) * add functions * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * avoid unused dep * update dep * update * fix cargo check * better group value view aggregation * update --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* initial support for string view regex * update tests
* Add StringView support for date_part and make_date funcs * run cargo update in datafusion-cli * cargo fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* gc string view when appropriate * make clippy happy * address comments * make doc happy * update style * Add comments and tests for gc_string_view_batch * better herustic * update test * Update datafusion/physical-plan/src/coalesce_batches.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
CI appears to be failing due to #11671 |
* fix bug in return type inference * update doc * add tests --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* better default block size * fix related test
* use inferenced schema, don't load schema again * move config to parquet-only * update * update * better format * format * update
Update here is that we are on track to release arrow Then we'll need a PR to update datafusion to arrow 52.2.0 (which should be straightforward) Then I will change this PR to ready to review and we should be able to merge it into DataFusion main 🤞 |
StringView
reading from parquet, up to 2x faster for some click bench queries (Merge string-view2
branch to main )
StringView
reading from parquet, up to 2x faster for some click bench queries (Merge string-view2
branch to main )StringView
reading from parquet, up to 2x faster for some click bench queries (Merge string-view2
branch to main )
* native support for character length * Update datafusion/functions/src/unicode/character_length.rs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
StringView
reading from parquet, up to 2x faster for some click bench queries (Merge string-view2
branch to main )string-view
branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default)
This PR is now ready for review (mostly I am hoping another committer will approve/merge it as I can't approve my own PR) |
string-view
branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default)string-view2
branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Thanks @Dandandan -- so exciting! I plan to work with @XiangpengHao and figure out what we need to do to get this feature on by default |
🚀 |
Draft until arrow52.2.0
is released to crates.io (expected Sat July 27)Which issue does this PR close?
Part of #10918
Closes #10921
schema_force_string_view
) by default #11682Rationale for this change
We have been integrating a set of StringView changes on the
string-view2
branch as they rely on un-released code in arrow-rs . Once those changes are released and DataFusion uses them we can bring this code directly to main.What changes are included in this PR?
StringViewArray
#11556) - @XiangpengHaoStringViewArray
inCoalesceBatchesStream
#11587) - @XiangpengHaoutf8_to_int_type
#11662)--string-view
to only apply to parquet formats #11663) - @XiangpengHaoAre these changes tested?
CI
Are there any user-facing changes?
When StringView is enabled, benchmarks run significantly faster