Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Fix ORC related failed UT #3437

Closed
wants to merge 182 commits into from

Conversation

chenxu14
Copy link
Contributor

No description provided.

lgbo-ustc and others added 4 commits October 18, 2023 13:37
…#3418

What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)

Fixes: apache#3417

How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

unit tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
This operator should be removed out of WholeStageTransform

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link

Run Gluten Clickhouse CI

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI

marin-ma and others added 2 commits October 18, 2023 18:28
Make the needed changes to integrate with upstream Velox, and start to depend on a new Velox branch `update`, which shall be rebased frequently to keep updated with Velox upstream.

Changes mainly including:
Change to use Velox provided Arrow 13.0.
Integrate with Velox decimal type and date type refactor.
Integrate with Velox memory module updates.
Integrate with Velox parquet writer updates.
Function integration and fixes.
Solve Velox library linking issues, and fix Gluten CI issues.

Lacks:
ORC support
Performance gap on TPC-H/DS
GHA docker image update

---------

Co-authored-by: JiaKe <ke.a.jia@intel.com>
Co-authored-by: Rong Ma <rong.ma@intel.com>
Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com>
Co-authored-by: Hongze Zhang <hongze.zhang@intel.com>
Co-authored-by: Joey <joey.ljy@alibaba-inc.com>
Co-authored-by: PHILO-HE <feilong.he@intel.com>
Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just merged upstream_velox branch into main. Could you change this PR to target on the main branch?

@@ -403,13 +403,11 @@ class VeloxTestSettings extends BackendTestSettings {
enableSuite[GlutenOrcPartitionDiscoverySuite]
.exclude("read partitioned table - normal case")
.exclude("read partitioned table - with nulls")
.disableByReason("Blocked by ORC Velox upstream not ready")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some ORC tests are disabled with the message ReaderFactory is not registered for format orc like VeloxTestSettings, could you also enable them in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will try it latter

@chenxu14 chenxu14 requested a review from rui-mo October 19, 2023 03:37
@github-actions
Copy link

Run Gluten Clickhouse CI

@rui-mo
Copy link
Contributor

rui-mo commented Oct 19, 2023

Please change this PR to target on the main branch, thanks.

@@ -358,8 +356,6 @@ class VeloxTestSettings extends BackendTestSettings {
.exclude("Return correct results when data columns overlap with partition " +
"columns (nested data)")
.exclude("SPARK-31116: Select nested schema with case insensitive mode")
// ReaderFactory is not registered for format orc.
.exclude("SPARK-15474 Write and read back non-empty schema with empty dataframe - orc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhouyuan zhouyuan changed the title Fix ORC related failed UT [VL] Fix ORC related failed UT Oct 19, 2023
To avoid partial aggregation flushing on a low cardinality case.
@rui-mo
Copy link
Contributor

rui-mo commented Oct 19, 2023

The failed tests nested column: Count(nested sub-field) not push down seem not relevant to ORC. We can disable them in this PR, and I will take a look further.

* Refine the document about parquet write
@github-actions
Copy link

Run Gluten Clickhouse CI

@chenxu14
Copy link
Contributor Author

The failed tests nested column: Count(nested sub-field) not push down seem not relevant to ORC. We can disable them in this PR, and I will take a look further.

disable GlutenOrcV2AggregatePushDownSuite, Velox don’t support ORC v2 yet.

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@@ -239,6 +239,7 @@ void VeloxBackend::init(const std::unordered_map<std::string, std::string>& conf
registerConnector(hiveConnector);
velox::parquet::registerParquetReaderFactory();
velox::dwrf::registerDwrfReaderFactory();
velox::dwrf::registerOrcReaderFactory();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #3445 is going to remove the registration for Parquet and DWRF reader. I'm wondering if the registration for ORC reader can also be done in Velox, then this change can also be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Adjusted this in PR 417
After that merge, I can submit a new PR base on main branch

FelixYBW and others added 18 commits November 16, 2023 19:55
rebase velox to 2023-11-15

arrow version changed to 14.1.0
* Avoid unnecessary filter binding for subfield
…pache#3753)

By default, file handle cache is disabled in gluten for performance consideration.
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)

(Fixes: apache#3668)

How was this patch tested?
TEST BY UT

性能测试数据

3000W 行正常数据
测试SQL: select count(1) from $test_tbl where to_date($col) > '1990-01-01'
PR改动前耗时: 2.983s, 2.686s, 2.804s
PR改动后耗时: 2.94s,2.861s,2.842s;

3000W行数据 (其中2500W行是NULL,500W是正常数据)
测试SQL: select count(1) from $test_tbl where to_date($col) > '1990-01-01'
PR改动前耗时:0.621s, 0.614s, 0.677s
PR改动后耗时:0.631s,0.641s,0.692s;

3000W行数据 (其中2500W数据是不符合日期格式的随机字符串,500W行是正常数据)
测试SQL: select count(1) from $test_tbl where to_date($col) > '1990-01-01'
PR改动前耗时:6.148s,6.018s,5.845s
PR改动后耗时:3.188s,3.055s,3.08s

对比发现,正常数据测试情况下性能接近,在某些异常场景下性能有所提升
)

* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20231117)

* fix build due to ClickHouse/ClickHouse#56664

---------

Co-authored-by: kyligence-git <gluten@kyligence.io>
Co-authored-by: Chang Chen <baibaichen@gmail.com>
remove redundant velox build

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Co-authored-by: Hongze Zhang <mailtozhz@126.com>
* Support get native plan tree string

* fix test

* fix

* address comments

---------

Co-authored-by: Kent Yao <yao@apache.org>
apache#3781)

When there are more than one input row in the final stage with the bloom_filter_agg, it will be core dump for the CH backend.

The RC is: when merging values in the final stage, the input data maybe a non-init AggregateFunctionGroupBloomFilterData, it will use the wrong filter size and filter_hashes values to init the first AggregateFunctionGroupBloomFilterData, which leads to set the wrong filter size when merging values.

Close apache#3779.
This PR add support for Google Cloud Storage using the Velox GCS Filesystem.
update pacakge.sh to cover spark34

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
@JkSelf
Copy link
Contributor

JkSelf commented Nov 21, 2023

@chenxu14 We have enabled spark 3.4 in Gluten. Can you help to also enable the orc unit test in Spark 3.4? Thanks.

ulysses-you and others added 6 commits November 21, 2023 18:47
Co-authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Support inject native plan string to Spark explain. A new param `spark.gluten.sql.injectNativePlanStringToExplain` is introduced to enable this feature.
Copy link

Run Gluten Clickhouse CI

@chenxu14 chenxu14 closed this Nov 22, 2023
@chenxu14
Copy link
Contributor Author

move work to #3805

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.