-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Officially maintained Arrow2 branch #1556
Conversation
Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>
* wip * more * Make scalar.rs compile * Fix various compilation error due to API difference * Make datafusion core compile * fmt * wip * wip: compile ballista * Pass all datafusion tests * Compile ballista
Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>
…ts (#10) * Fix tests * Ignore last test, fix clippy, fmt and enable integration * more clippy fix
Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>
Thank you everyone for all the reviews and comments so far. @Igosuki and I have addressed most of them. Here are the two remaining todo items:
I will keep working on this tomorrow. In the mean time, feel free to send PRs to my fork if you are interested in helping. After these two items are fixed, I will run another round of benchmark to double check the performance fix. It's quite interesting that I got the opposite performance test result initial even without that file buf fix :P I will dig into what's causing that as well. |
Yes sorry about that, these were simply comments to Indicate that these
particular feature tests were not passing.
Le dim. 16 janv. 2022 à 09:52, QP Hou ***@***.***> a écrit :
… Thank you everyone for all the reviews and comments so far. @Igosuki
<https://github.com/Igosuki> and I have addressed most of them. Here are
the two remaining todo items:
- Get the parquet row group filter test to pass
- Restore sql integration test migration. All those sql tests were
migrated and passing previously, but those changes got lost when we merged
the sql test refactoring from master.
I will keep working on this tomorrow. In the mean time, feel free to send
PRs to my fork if you are interested in helping. After these two items are
fixed, I will run another round of benchmark to double check the
performance fix. It's quite interesting that I got the opposite performance
test result initial even without that file buf fix :P I will dig into
what's causing that as well.
—
Reply to this email directly, view it on GitHub
<#1556 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDFBSPYHZOIG4O2XOGWZDUWKBLXANCNFSM5L5P5AVQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The parquet row group test failure turned out to be a red herring. The asserted expected result is actually not correct. I have filed a follow up issue at #1591. I changed the expected result in this branch to fix the test failure for now. What the predicate pruning logic returns in this branch is more correct than what we have in master, but still wrong. The proper fix is out of scope of arrow2 migration and tracked in #1591. We are now passing all 856 unit tests. 2 more integration tests to fix, which are caused by difference in how arrow2 formats binary array. |
I also noticed my benchmarks were ran with data generated from |
5ad816a
to
61ace8f
Compare
update: all datafusion unit and integration tests are passing now, down to a single test failure in datafusion-cli related to json display format. |
I think we should merge it into the arrow2 branch and keep iterating from there. I suspect the next big chunk of work is the |
oops, looks like the arrow2 branch got updated with latest commits from master, anyone mind if I revert it back to 2008b1d and handle master catch up in a follow up PR? |
All integration and unit tests are passing now, the MIRI check is failing due to an upstream tokio issue I believe. I will file some follow up issues tomorrow to track the remaining work needed for us to make the final call on master merge. |
Thank you @jorgecarleitao @yjshen and @Igosuki for your hard work on the migration thus far :) |
Wow! Milestone reached! Thanks for driving on this and making it happen @houqp 👍 |
Great work @houqp
Le jeu. 20 janv. 2022 à 08:27, Yijie Shen ***@***.***> a
écrit :
… Wow! Milestone reached! Thanks for driving on this and making it happen
@houqp <https://github.com/houqp> 👍
—
Reply to this email directly, view it on GitHub
<#1556 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDFBWXQK6FFRP4FXB26JDUW62PLANCNFSM5L5P5AVQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @houqp , epic work ! |
@andygrove What tool did you use to get such a smooth CPU chart ? |
Quick update on this, I have cleaned up the issues in the arrow2 milestone: https://github.com/apache/arrow-datafusion/milestone/3. The main remaining items are:
I will keep work on issues in the arrow2 milestone whenever I have capacity. If anyone of you are interested in helping, please feel free to comment on those issues or send PRs to the official arrow2 branch. |
Which issue does this PR close?
Close arrow2 milestone https://github.com/apache/arrow-datafusion/milestone/3
Rationale for this change
Provide a complete arrow2 based datafusion implementation for full evaluation of the migration. This should give us a good feeling of the arrow2 API UX as well as a starting point for performance benchmarks within datafusion and downstream projects.
The goal is to merge this code into an official arrow2 branch in the short run, until we are comfortable doing the switch in master.
What changes are included in this PR?
Here is a TPCH benchmark I ran on my Linux laptop (baseline 2008b1d):
On avg, we are getting around 5% speed up across the board, with q5 at 11% and q12 at only 1% as the two outliners. If this performance gain can also be replicated in downstream projects, then I think it would be a strong case for us to do the arrow2 swtich. On top of this, we end up with a nice 1000+ lines of code reduction ;)
Are there any user-facing changes?
Yes, downstream consumer of datafusion will need to switch to arrow2 as well.