Officially maintained Arrow2 branch #1556

houqp · 2022-01-14T01:54:54Z

Which issue does this PR close?

Close arrow2 milestone https://github.com/apache/arrow-datafusion/milestone/3

Rationale for this change

Provide a complete arrow2 based datafusion implementation for full evaluation of the migration. This should give us a good feeling of the arrow2 API UX as well as a starting point for performance benchmarks within datafusion and downstream projects.

The goal is to merge this code into an official arrow2 branch in the short run, until we are comfortable doing the switch in master.

What changes are included in this PR?

Switched to arrow2
Enabled miri test (current miri failure is caused by Cannot run even basic Tokio programs rust-lang/miri#602)

Here is a TPCH benchmark I ran on my Linux laptop (baseline 2008b1d):

On avg, we are getting around 5% speed up across the board, with q5 at 11% and q12 at only 1% as the two outliners. If this performance gain can also be replicated in downstream projects, then I think it would be a strong case for us to do the arrow2 swtich. On top of this, we end up with a nice 1000+ lines of code reduction ;)

Are there any user-facing changes?

Yes, downstream consumer of datafusion will need to switch to arrow2 as well.

Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>

* wip * more * Make scalar.rs compile * Fix various compilation error due to API difference * Make datafusion core compile * fmt * wip * wip: compile ballista * Pass all datafusion tests * Compile ballista

Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>

* WIP: on making cargo test compile * make cargo test compile * fix

…ts (#10) * Fix tests * Ignore last test, fix clippy, fmt and enable integration * more clippy fix

Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>

houqp · 2022-01-16T08:52:01Z

Thank you everyone for all the reviews and comments so far. @Igosuki and I have addressed most of them. Here are the two remaining todo items:

Get the parquet row group filter test to pass
Restore sql integration test migration. All those sql tests were migrated and passing previously, but those changes got lost when we merged the sql test refactoring from master.

I will keep working on this tomorrow. In the mean time, feel free to send PRs to my fork if you are interested in helping. After these two items are fixed, I will run another round of benchmark to double check the performance fix. It's quite interesting that I got the opposite performance test result initial even without that file buf fix :P I will dig into what's causing that as well.

Igosuki · 2022-01-16T15:39:41Z

Yes sorry about that, these were simply comments to Indicate that these particular feature tests were not passing. Le dim. 16 janv. 2022 à 09:52, QP Hou ***@***.***> a écrit :

…

Thank you everyone for all the reviews and comments so far. @Igosuki <https://github.com/Igosuki> and I have addressed most of them. Here are the two remaining todo items: - Get the parquet row group filter test to pass - Restore sql integration test migration. All those sql tests were migrated and passing previously, but those changes got lost when we merged the sql test refactoring from master. I will keep working on this tomorrow. In the mean time, feel free to send PRs to my fork if you are interested in helping. After these two items are fixed, I will run another round of benchmark to double check the performance fix. It's quite interesting that I got the opposite performance test result initial even without that file buf fix :P I will dig into what's causing that as well. — Reply to this email directly, view it on GitHub <#1556 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADDFBSPYHZOIG4O2XOGWZDUWKBLXANCNFSM5L5P5AVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

houqp · 2022-01-17T05:48:34Z

The parquet row group test failure turned out to be a red herring. The asserted expected result is actually not correct. I have filed a follow up issue at #1591. I changed the expected result in this branch to fix the test failure for now. What the predicate pruning logic returns in this branch is more correct than what we have in master, but still wrong. The proper fix is out of scope of arrow2 migration and tracked in #1591.

We are now passing all 856 unit tests. 2 more integration tests to fix, which are caused by difference in how arrow2 formats binary array.

houqp · 2022-01-17T05:53:21Z

I also noticed my benchmarks were ran with data generated from tpch-gen.sh, which only produces single partition CSV files. @andygrove could you share with me how you generated your sf100 dataset?

yjshen · 2022-01-17T06:22:55Z

The parquet row group test failure turned out to be a red herring. The asserted expected result is actually not correct. I have filed a follow up issue at #1591.

I shared the same observation in houqp#16, but ignored the test at the time.

houqp · 2022-01-18T07:10:47Z

update: all datafusion unit and integration tests are passing now, down to a single test failure in datafusion-cli related to json display format.

alamb · 2022-01-18T15:40:49Z

I think we should merge it into the arrow2 branch and keep iterating from there. I suspect the next big chunk of work is the RecordBatch removal / adaptation in jorgecarleitao/arrow2#717

houqp · 2022-01-19T05:05:48Z

oops, looks like the arrow2 branch got updated with latest commits from master, anyone mind if I revert it back to 2008b1d and handle master catch up in a follow up PR?

houqp · 2022-01-20T07:02:33Z

All integration and unit tests are passing now, the MIRI check is failing due to an upstream tokio issue I believe. I will file some follow up issues tomorrow to track the remaining work needed for us to make the final call on master merge.

houqp · 2022-01-20T07:04:12Z

Thank you @jorgecarleitao @yjshen and @Igosuki for your hard work on the migration thus far :)

yjshen · 2022-01-20T07:27:37Z

Wow! Milestone reached! Thanks for driving on this and making it happen @houqp 👍

Igosuki · 2022-01-20T12:42:21Z

Great work @houqp Le jeu. 20 janv. 2022 à 08:27, Yijie Shen ***@***.***> a écrit :

…

Wow! Milestone reached! Thanks for driving on this and making it happen @houqp <https://github.com/houqp> 👍 — Reply to this email directly, view it on GitHub <#1556 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADDFBWXQK6FFRP4FXB26JDUW62PLANCNFSM5L5P5AVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

xudong963 · 2022-01-20T14:07:18Z

Thanks @houqp , epic work !

Igosuki · 2022-01-23T11:21:38Z

@andygrove What tool did you use to get such a smooth CPU chart ?

houqp · 2022-01-24T05:11:39Z

Quick update on this, I have cleaned up the issues in the arrow2 milestone: https://github.com/apache/arrow-datafusion/milestone/3. The main remaining items are:

I will keep work on issues in the arrow2 milestone whenever I have capacity. If anyone of you are interested in helping, please feel free to comment on those issues or send PRs to the official arrow2 branch.

jorgecarleitao and others added 30 commits July 5, 2021 17:05

Wip.

099398e

resolve merge conflicts and bump to latest arrow2

a5b2557

use lexicographical_partition_ranges from arrow2

a0c9669

Merge remote-tracking branch 'upstream/master' into arrow22

3218759

Fix build errors

a035200

Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>

Fix DataFusion test and try to make ballista compile (#4)

843fbe6

* wip * more * Make scalar.rs compile * Fix various compilation error due to API difference * Make datafusion core compile * fmt * wip * wip: compile ballista * Pass all datafusion tests * Compile ballista

pin arrow-flight to 0.1 in arrow2 repo

fccbddb

turn on io_parquet_compression feature for arrow2

77c69cf

estimate array memory usage with estimated_bytes_size

2d2e379

Merge remote-tracking branch 'upstream/master' into arrow2-merge

cb187a6

fix compile and tests

25363d2

Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>

Make ballista compile (#6)

7a5294b

Make cargo test compile (#7)

4030615

* WIP: on making cargo test compile * make cargo test compile * fix

fix str to timestamp scalarvalue casting

fde82cf

fixing datafusion tests (#8)

b585f3b

fix crypto expression tests

99907fd

fix floating point precision

b2f709d

fix list scalar to_arry method for timestamps

ed5281c

Fix tests (#9)

f9504e7

Ignore last test, fix cargo clippy, format and pass integration tes…

33b6931

…ts (#10) * Fix tests * Ignore last test, fix clippy, fmt and enable integration * more clippy fix

bump to latest arrow2, remove ord for interval type

ca53b64

add back case insenstive regex support

8702e12

support type cast failure message

41153dc

bump to arrow2 and parquet2 0.7, replace arrow-flight with arrow-format

ba57aa8

chore: arrow2 to 0.8, parquet to 0.8, prost to 0.9, tonic to 0.6

387fdf6

Merge remote-tracking branch 'upstream/master' into arrow22

0d504e6

Fix build and tests

ea6d7fa

Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>

Merge remote-tracking branch 'origin/master' into arrow2_merge

44db376

merge latest datafusion

ca9b485

start migrating avro to arrow2

b9125bc

jimexist mentioned this pull request Jan 16, 2022

add from_slice trait to ease arrow2 migration #1588

Merged

fix sql tests

a27de10

jimexist mentioned this pull request Jan 17, 2022

support from_slice for binary, string, and boolean array types #1589

Merged

fix parquet row group filter test

7e8b8d9

remove empty python/src/dataframe.rs file

8a6fb2c

implement bit_length function

60e869e

houqp force-pushed the arrow2_merge branch 2 times, most recently from 5ad816a to 61ace8f Compare January 18, 2022 07:00

fix binary array print formatting

1e352c3

houqp force-pushed the arrow2_merge branch from 61ace8f to 1e352c3 Compare January 18, 2022 07:07

houqp force-pushed the arrow2_merge branch from e24d775 to aaaed9b Compare January 20, 2022 06:38

fix cli json print and avro example

2698383

houqp force-pushed the arrow2_merge branch from aaaed9b to 2698383 Compare January 20, 2022 06:40

houqp merged commit c0c9c72 into apache:arrow2 Jan 20, 2022

houqp mentioned this pull request Jan 24, 2022

ARROW2: Optimize parquet read memory usage #1657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Officially maintained Arrow2 branch #1556

Officially maintained Arrow2 branch #1556

houqp commented Jan 14, 2022 •

edited

Loading

houqp commented Jan 16, 2022 •

edited

Loading

Igosuki commented Jan 16, 2022 via email

houqp commented Jan 17, 2022

houqp commented Jan 17, 2022

yjshen commented Jan 17, 2022

houqp commented Jan 18, 2022 •

edited

Loading

alamb commented Jan 18, 2022

houqp commented Jan 19, 2022

houqp commented Jan 20, 2022 •

edited

Loading

houqp commented Jan 20, 2022

yjshen commented Jan 20, 2022

Igosuki commented Jan 20, 2022 via email

xudong963 commented Jan 20, 2022

Igosuki commented Jan 23, 2022 •

edited

Loading

houqp commented Jan 24, 2022

Officially maintained Arrow2 branch #1556

Officially maintained Arrow2 branch #1556

Conversation

houqp commented Jan 14, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

houqp commented Jan 16, 2022 • edited Loading

Igosuki commented Jan 16, 2022 via email

houqp commented Jan 17, 2022

houqp commented Jan 17, 2022

yjshen commented Jan 17, 2022

houqp commented Jan 18, 2022 • edited Loading

alamb commented Jan 18, 2022

houqp commented Jan 19, 2022

houqp commented Jan 20, 2022 • edited Loading

houqp commented Jan 20, 2022

yjshen commented Jan 20, 2022

Igosuki commented Jan 20, 2022 via email

xudong963 commented Jan 20, 2022

Igosuki commented Jan 23, 2022 • edited Loading

houqp commented Jan 24, 2022

houqp commented Jan 14, 2022 •

edited

Loading

houqp commented Jan 16, 2022 •

edited

Loading

houqp commented Jan 18, 2022 •

edited

Loading

houqp commented Jan 20, 2022 •

edited

Loading

Igosuki commented Jan 23, 2022 •

edited

Loading