[VL] Fix ORC related failed UT #3437

chenxu14 · 2023-10-18T09:25:39Z

No description provided.

…#3418 What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Fixes: apache#3417 How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) unit tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

This operator should be removed out of WholeStageTransform Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

github-actions · 2023-10-18T09:26:00Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2023-10-18T09:26:15Z

Run Gluten Clickhouse CI

github-actions · 2023-10-18T09:42:35Z

Run Gluten Clickhouse CI

…che#3433)

Make the needed changes to integrate with upstream Velox, and start to depend on a new Velox branch `update`, which shall be rebased frequently to keep updated with Velox upstream. Changes mainly including: Change to use Velox provided Arrow 13.0. Integrate with Velox decimal type and date type refactor. Integrate with Velox memory module updates. Integrate with Velox parquet writer updates. Function integration and fixes. Solve Velox library linking issues, and fix Gluten CI issues. Lacks: ORC support Performance gap on TPC-H/DS GHA docker image update --------- Co-authored-by: JiaKe <ke.a.jia@intel.com> Co-authored-by: Rong Ma <rong.ma@intel.com> Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com> Co-authored-by: Hongze Zhang <hongze.zhang@intel.com> Co-authored-by: Joey <joey.ljy@alibaba-inc.com> Co-authored-by: PHILO-HE <feilong.he@intel.com>

rui-mo

We just merged upstream_velox branch into main. Could you change this PR to target on the main branch?

rui-mo · 2023-10-19T00:41:46Z

gluten-ut/spark32/src/test/scala/io/glutenproject/utils/velox/VeloxTestSettings.scala

@@ -403,13 +403,11 @@ class VeloxTestSettings extends BackendTestSettings {
  enableSuite[GlutenOrcPartitionDiscoverySuite]
    .exclude("read partitioned table - normal case")
    .exclude("read partitioned table - with nulls")
-    .disableByReason("Blocked by ORC Velox upstream not ready")


Some ORC tests are disabled with the message ReaderFactory is not registered for format orc like VeloxTestSettings, could you also enable them in this PR?

OK, will try it latter

github-actions · 2023-10-19T03:37:08Z

Run Gluten Clickhouse CI

rui-mo · 2023-10-19T03:40:23Z

Please change this PR to target on the main branch, thanks.

rui-mo · 2023-10-19T03:43:10Z

gluten-ut/spark32/src/test/scala/io/glutenproject/utils/velox/VeloxTestSettings.scala

@@ -358,8 +356,6 @@ class VeloxTestSettings extends BackendTestSettings {
    .exclude("Return correct results when data columns overlap with partition " +
      "columns (nested data)")
    .exclude("SPARK-31116: Select nested schema with case insensitive mode")
-    // ReaderFactory is not registered for format orc.
-    .exclude("SPARK-15474 Write and read back non-empty schema with empty dataframe - orc")


Below tests are all about ORC issue, thanks.
https://github.com/oap-project/gluten/blob/main/gluten-ut/spark32/src/test/scala/io/glutenproject/utils/velox/VeloxTestSettings.scala#L361-L368

…pache#3428)

To avoid partial aggregation flushing on a low cardinality case.

rui-mo · 2023-10-19T06:10:22Z

The failed tests nested column: Count(nested sub-field) not push down seem not relevant to ORC. We can disable them in this PR, and I will take a look further.

* Refine the document about parquet write

github-actions · 2023-10-20T02:43:40Z

Run Gluten Clickhouse CI

chenxu14 · 2023-10-20T02:43:57Z

The failed tests nested column: Count(nested sub-field) not push down seem not relevant to ORC. We can disable them in this PR, and I will take a look further.

disable GlutenOrcV2AggregatePushDownSuite, Velox don’t support ORC v2 yet.

github-actions · 2023-10-20T03:40:10Z

Run Gluten Clickhouse CI

github-actions · 2023-10-20T03:50:31Z

Run Gluten Clickhouse CI

rui-mo · 2023-10-20T05:48:13Z

cpp/velox/compute/VeloxBackend.cc

@@ -239,6 +239,7 @@ void VeloxBackend::init(const std::unordered_map<std::string, std::string>& conf
  registerConnector(hiveConnector);
  velox::parquet::registerParquetReaderFactory();
  velox::dwrf::registerDwrfReaderFactory();
+  velox::dwrf::registerOrcReaderFactory();


PR #3445 is going to remove the registration for Parquet and DWRF reader. I'm wondering if the registration for ORC reader can also be done in Velox, then this change can also be removed.

OK, Adjusted this in PR 417
After that merge, I can submit a new PR base on main branch

rebase velox to 2023-11-15 arrow version changed to 14.1.0

* Avoid unnecessary filter binding for subfield

…pache#3650) * move getLocalFilesNode logic to scan transformer

…pache#3753) By default, file handle cache is disabled in gluten for performance consideration.

What changes were proposed in this pull request? (Please fill in changes proposed in this fix) (Fixes: apache#3668) How was this patch tested? TEST BY UT 性能测试数据 3000W 行正常数据测试SQL: select count(1) from $test_tbl where to_date($col) > '1990-01-01' PR改动前耗时： 2.983s， 2.686s, 2.804s PR改动后耗时： 2.94s，2.861s，2.842s； 3000W行数据 (其中2500W行是NULL，500W是正常数据) 测试SQL: select count(1) from $test_tbl where to_date($col) > '1990-01-01' PR改动前耗时：0.621s， 0.614s, 0.677s PR改动后耗时：0.631s，0.641s，0.692s； 3000W行数据 (其中2500W数据是不符合日期格式的随机字符串，500W行是正常数据) 测试SQL: select count(1) from $test_tbl where to_date($col) > '1990-01-01' PR改动前耗时：6.148s，6.018s，5.845s PR改动后耗时：3.188s，3.055s，3.08s 对比发现，正常数据测试情况下性能接近，在某些异常场景下性能有所提升

) * [GLUTEN-1632][CH]Daily Update Clickhouse Version (20231117) * fix build due to ClickHouse/ClickHouse#56664 --------- Co-authored-by: kyligence-git <gluten@kyligence.io> Co-authored-by: Chang Chen <baibaichen@gmail.com>

remove redundant velox build Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

Co-authored-by: Hongze Zhang <mailtozhz@126.com>

* Support get native plan tree string * fix test * fix * address comments --------- Co-authored-by: Kent Yao <yao@apache.org>

apache#3781) When there are more than one input row in the final stage with the bloom_filter_agg, it will be core dump for the CH backend. The RC is: when merging values in the final stage, the input data maybe a non-init AggregateFunctionGroupBloomFilterData, it will use the wrong filter size and filter_hashes values to init the first AggregateFunctionGroupBloomFilterData, which leads to set the wrong filter size when merging values. Close apache#3779.

This PR add support for Google Cloud Storage using the Velox GCS Filesystem.

update pacakge.sh to cover spark34 Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

…intermediate types (apache#3721)

JkSelf · 2023-11-21T08:41:18Z

@chenxu14 We have enabled spark 3.4 in Gluten. Can you help to also enable the orc unit test in Spark 3.4? Thanks.

Co-authored-by: Kent Yao <yao@apache.org>

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

Support inject native plan string to Spark explain. A new param `spark.gluten.sql.injectNativePlanStringToExplain` is introduced to enable this feature.

apache#3792)

github-actions · 2023-11-22T04:22:21Z

Run Gluten Clickhouse CI

chenxu14 · 2023-11-22T04:28:36Z

move work to #3805

lgbo-ustc and others added 4 commits October 18, 2023 13:37

[VL] Remove ColumnarToRow for Gluten columnar table cache (apache#3430)

7fbd3fc

[VL] Copy compress small partition buffer (apache#3420)

1ab906a

[CORE] fix CoalesceExec (apache#3372)

353c45a

This operator should be removed out of WholeStageTransform Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

marin-ma and others added 2 commits October 18, 2023 18:28

Revert "[VL] Copy compress small partition buffer (apache#3420)" (apa…

4dbb181

…che#3433)

rui-mo reviewed Oct 19, 2023

View reviewed changes

[VL] Copy-compress small partition buffer (apache#3434)

048d981

chenxu14 requested a review from rui-mo October 19, 2023 03:37

rui-mo reviewed Oct 19, 2023

View reviewed changes

JkSelf and others added 2 commits October 19, 2023 13:11

[GLUTEN-3425] Create not existing HDFS folder when writing HDFS file (a…

a713d5e

…pache#3428)

[CORE] Code refactor: simplify transformer classes (apache#3426)

6090cea

zhouyuan changed the title ~~Fix ORC related failed UT~~ [VL] Fix ORC related failed UT Oct 19, 2023

[VL] Increase kAbandonPartialAggregationMinRows (apache#3439)

86abf0a

To avoid partial aggregation flushing on a low cardinality case.

[VL][DOC] Refine the document about parquet write (apache#3441)

a9324e6

* Refine the document about parquet write

chenxu14 force-pushed the chenxu14_dev branch from 2306a87 to 8afa641 Compare October 20, 2023 02:43

chenxu14 force-pushed the chenxu14_dev branch from 8afa641 to cbacfc5 Compare October 20, 2023 03:39

chenxu14 force-pushed the chenxu14_dev branch from cbacfc5 to 37eb835 Compare October 20, 2023 03:50

[GLUTEN-2961][VL][FOLLOWUP] Fix issue on macOS (apache#3455)

60e92e4

rui-mo reviewed Oct 20, 2023

View reviewed changes

FelixYBW and others added 18 commits November 16, 2023 19:55

[VL] rebase to velox 2023-11-14 (apache#3747)

cb4e7b7

rebase velox to 2023-11-15 arrow version changed to 14.1.0

[VL]Avoid unnecessary filter binding for subfield (apache#3300)

a662f3f

* Avoid unnecessary filter binding for subfield

[GLUTEN-3378][CORE] Move getLocalFilesNode logic to scan transformer (a…

49b8e06

…pache#3650) * move getLocalFilesNode logic to scan transformer

[GLUTEN-3739][VL] Add a config to control velox's file handle cache (a…

23c2e4c

…pache#3753) By default, file handle cache is disabled in gluten for performance consideration.

[VL] Enable spill-to-disk for partial aggregation (apache#3697)

5f5d18a

[GLUTEN-3749][VL] fix redundant Velox build (apache#3759)

48496c0

remove redundant velox build Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

[VL] Activate random kill tasks GHA CI job (apache#3761)

dc75cca

[CELEBORN] Fix push small data (apache#3766)

cb18fc1

[VL] [Minor] Fix compile error in debug mode (apache#3765)

7553cd1

Co-authored-by: Hongze Zhang <mailtozhz@126.com>

[CORE] Support get native plan tree string (apache#3729)

6983898

* Support get native plan tree string * fix test * fix * address comments --------- Co-authored-by: Kent Yao <yao@apache.org>

Add config to specify the window type in velox backend (apache#3703)

92356bc

[GLUTEN-3715] [VL] Add GCS support in velox backend (apache#2634)

1b489d1

This PR add support for Google Cloud Storage using the Velox GCS Filesystem.

[VL] update pacakge.sh for spark34 (apache#3786)

b22c862

update pacakge.sh to cover spark34 Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

[VL] Respect spark.gluten.sql.debug in native side (apache#3748)

018da4c

[GLUTEN-3719][VL] Introduce VeloxIntermediateData to adjust agg func …

f29077e

…intermediate types (apache#3721)

zhouyuan mentioned this pull request Nov 21, 2023

Fix ORC related failed UT oap-project/velox#417

Merged

ulysses-you and others added 6 commits November 21, 2023 18:47

Make debug behavior clear (apache#3793)

7c640bd

Co-authored-by: Kent Yao <yao@apache.org>

[VL] Fix parquet writer passing wrong param (apache#3790)

0d4e43c

[VL] Ban flaky unit tests (apache#3798)

30865e5

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

[CORE][VL] Add naitve plan string and plan with stats (apache#3787)

b40a5f0

Support inject native plan string to Spark explain. A new param `spark.gluten.sql.injectNativePlanStringToExplain` is introduced to enable this feature.

[VL] Add configuration for generating 4k window size gzip parquet file (

5966579

apache#3792)

[VL] Fix ORC related failed UT

fc04d0c

chenxu14 force-pushed the chenxu14_dev branch from c1e8875 to fc04d0c Compare November 22, 2023 04:21

chenxu14 closed this Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Fix ORC related failed UT #3437

[VL] Fix ORC related failed UT #3437

chenxu14 commented Oct 18, 2023

github-actions bot commented Oct 18, 2023

github-actions bot commented Oct 18, 2023

github-actions bot commented Oct 18, 2023

rui-mo left a comment

rui-mo Oct 19, 2023

chenxu14 Oct 19, 2023

github-actions bot commented Oct 19, 2023

rui-mo commented Oct 19, 2023

rui-mo Oct 19, 2023

rui-mo commented Oct 19, 2023

github-actions bot commented Oct 20, 2023

chenxu14 commented Oct 20, 2023

github-actions bot commented Oct 20, 2023

github-actions bot commented Oct 20, 2023

rui-mo Oct 20, 2023

chenxu14 Oct 20, 2023

JkSelf commented Nov 21, 2023

github-actions bot commented Nov 22, 2023

chenxu14 commented Nov 22, 2023

[VL] Fix ORC related failed UT #3437

[VL] Fix ORC related failed UT #3437

Conversation

chenxu14 commented Oct 18, 2023

github-actions bot commented Oct 18, 2023

github-actions bot commented Oct 18, 2023

github-actions bot commented Oct 18, 2023

rui-mo left a comment

Choose a reason for hiding this comment

rui-mo Oct 19, 2023

Choose a reason for hiding this comment

chenxu14 Oct 19, 2023

Choose a reason for hiding this comment

github-actions bot commented Oct 19, 2023

rui-mo commented Oct 19, 2023

rui-mo Oct 19, 2023

Choose a reason for hiding this comment

rui-mo commented Oct 19, 2023

github-actions bot commented Oct 20, 2023

chenxu14 commented Oct 20, 2023

github-actions bot commented Oct 20, 2023

github-actions bot commented Oct 20, 2023

rui-mo Oct 20, 2023

Choose a reason for hiding this comment

chenxu14 Oct 20, 2023

Choose a reason for hiding this comment

JkSelf commented Nov 21, 2023

github-actions bot commented Nov 22, 2023

chenxu14 commented Nov 22, 2023