Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] add support for reading ORC #1513

Merged
merged 6 commits into from
May 5, 2023
Merged

Conversation

zuochunwei
Copy link
Contributor

@zuochunwei zuochunwei commented Apr 26, 2023

What changes were proposed in this pull request?

add support for reading ORC files encoded with RLE v1/v2

the PR oap-velox orc supprt should be merged before this PR

(Please fill in changes proposed in this fix)

(Fixes: #ISSUE-ID)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@zuochunwei
Copy link
Contributor Author

@JkSelf

@JkSelf
Copy link
Contributor

JkSelf commented Apr 26, 2023

@zuochunwei Can you also add some orc test in gluten unit test scala side to check whether offload to native and the correctness?

@zuochunwei
Copy link
Contributor Author

@zuochunwei Can you also add some orc test in gluten unit test scala side to check whether offload to native and the correctness?

I have added query_benchmark test for ORC RLE V1 and V2, it's ok, and another partner has helped me test reading ORC from Hive/HDFS.

what scala test do you want me to provide?

@zuochunwei zuochunwei changed the title [VL] add support for reading ORC RleV2 [VL] add support for reading ORC of RleV2 Apr 26, 2023
@JkSelf
Copy link
Contributor

JkSelf commented Apr 27, 2023

We can read an orc file to check whether the scan operator can be offloaded and then verify the correctness to compare with vanilla spark. We can also add the orc file format test in TPC-H/DS later.

@JkSelf
Copy link
Contributor

JkSelf commented Apr 27, 2023

@zuochunwei You may need to change the velox repo and branch here if you changed the velox code.

@zuochunwei
Copy link
Contributor Author

@zuochunwei You may need to change the velox repo and branch here if you changed the velox code.

I proposed a corresponding PR for oap-project/velox, please refer to PR

@zuochunwei
Copy link
Contributor Author

We can read an orc file to check whether the scan operator can be offloaded and then verify the correctness to compare with vanilla spark. We can also add the orc file format test in TPC-H/DS later.

We have tested read orc file through the TableScan operator, and it's OK

@zhejiangxiaomai
Copy link
Contributor

Here is a example to use your local branch to test gluten CI. @zuochunwei
https://github.com/oap-project/gluten/pull/1503/files

@zuochunwei
Copy link
Contributor Author

Here is a example to use your local branch to test gluten CI. @zuochunwei https://github.com/oap-project/gluten/pull/1503/files

OK, thank you!

@zuochunwei zuochunwei changed the title [VL] add support for reading ORC of RleV2 [VL] add support for reading ORC Apr 27, 2023
@JkSelf
Copy link
Contributor

JkSelf commented Apr 27, 2023

@zuochunwei I tested orc scan in gluten with following unit test. It can work. Can you add this unit test in TestOperator.scala in this PR? And we can add the orc file format test in TPC-H/DS in the following PRs later.

test("orc scan") {
val df = spark.read.format("orc").load("./cpp/velox/benchmarks/data/bm_lineitem/orc/lineitem.orc")
df.createOrReplaceTempView("lineitem_orc")
runQueryAndCompare(
"select l_orderkey from lineitem_orc") { df => {
assert(getExecutedPlan(df).count(plan => {
plan.isInstanceOf[BatchScanExecTransformer]}) == 1)
}}
}

@zuochunwei
Copy link
Contributor Author

@zuochunwei I tested orc scan in gluten with following unit test. It can work. Can you add this unit test in TestOperator.scala in this PR? And we can add the orc file format test in TPC-H/DS in the following PRs later.

test("orc scan") { val df = spark.read.format("orc").load("./cpp/velox/benchmarks/data/bm_lineitem/orc/lineitem.orc") df.createOrReplaceTempView("lineitem_orc") runQueryAndCompare( "select l_orderkey from lineitem_orc") { df => { assert(getExecutedPlan(df).count(plan => { plan.isInstanceOf[BatchScanExecTransformer]}) == 1) }} }

done

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

5 similar comments
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

Co-authored-by: yangyimin <yangyimin@meituan.com>
@github-actions
Copy link

github-actions bot commented May 4, 2023

Run Gluten Clickhouse CI

@zhejiangxiaomai
Copy link
Contributor

zhejiangxiaomai commented May 4, 2023

run mvn clean install -Pspark-3.2 -PskipTests -Pbackends-velox can see code style error.

@github-actions
Copy link

github-actions bot commented May 4, 2023

Run Gluten Clickhouse CI

Co-authored-by: yangyimin <yangyimin@meituan.com>
@github-actions
Copy link

github-actions bot commented May 4, 2023

Run Gluten Clickhouse CI

Co-authored-by: yangyimin <yangyimin@meituan.com>
@github-actions
Copy link

github-actions bot commented May 4, 2023

Run Gluten Clickhouse CI

Co-authored-by: yangyimin <yangyimin@meituan.com>
@github-actions
Copy link

github-actions bot commented May 4, 2023

Run Gluten Clickhouse CI

Copy link
Contributor

@zhejiangxiaomai zhejiangxiaomai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@zhejiangxiaomai zhejiangxiaomai merged commit cdf845b into apache:main May 5, 2023
@zuochunwei zuochunwei deleted the orcSupport branch June 19, 2023 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants