[GLUTEN-3425] Create not existing Hdfs folder in gluten side when writing hdfs file. #3428

JkSelf · 2023-10-18T00:55:30Z

What changes were proposed in this pull request?

When Gluten calls the Velox Parquet writer to write a Parquet file, the temporary path obtained from Spark may not have been created yet. While writing a local file, Velox automatically creates the necessary file path if it does not exist. However, this is not the case for HDFS paths. We attempted to create the HDFS path on the Velox side, but the community was not receptive to this approach. As a result, we decided to create the HDFS path on the Gluten side.

How was this patch tested?

manual local test

github-actions · 2023-10-18T00:55:52Z

#3425

github-actions · 2023-10-18T00:56:09Z

Run Gluten Clickhouse CI

JkSelf · 2023-10-18T04:56:28Z

@rui-mo Can you help to review? Thanks.

rui-mo

Can we add a test to ensure the functionality?

ep/build-velox/src/get_velox.sh

...c/main/scala/org/apache/spark/sql/execution/datasources/velox/VeloxFormatWriterInjects.scala

PHILO-HE

LGTM! Just few trivial comment.

...c/main/scala/org/apache/spark/sql/execution/datasources/velox/VeloxFormatWriterInjects.scala

JkSelf · 2023-10-19T01:28:28Z

Can we add a test to ensure the functionality?

Offline discussion with @rui-mo . It is hard to add the unit test to test.

PHILO-HE

LGTM! Thanks!

GlutenPerfBot · 2023-10-19T05:56:40Z

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query	log/native_3428_time.csv	log/native_master_10_18_2023_4dbb181cc_time.csv	difference	percentage
q1	40.34	43.58	3.246	108.05%
q2	24.28	24.34	0.061	100.25%
q3	41.09	35.79	-5.304	87.09%
q4	35.09	41.52	6.425	118.31%
q5	70.73	70.78	0.058	100.08%
q6	8.65	5.74	-2.908	66.36%
q7	103.49	86.23	-17.264	83.32%
q8	105.17	80.16	-25.014	76.22%
q9	152.15	117.35	-34.803	77.13%
q10	62.77	47.73	-15.040	76.04%
q11	23.21	20.15	-3.058	86.82%
q12	39.87	25.94	-13.932	65.05%
q13	70.83	49.65	-21.174	70.11%
q14	26.54	15.32	-11.223	57.72%
q15	43.86	27.13	-16.730	61.85%
q16	20.97	16.07	-4.907	76.61%
q17	109.72	122.53	12.801	111.67%
q18	218.42	163.25	-55.175	74.74%
q19	28.76	13.15	-15.604	45.74%
q20	40.53	25.17	-15.365	62.09%
q21	325.61	235.27	-90.344	72.25%
q22	16.80	15.49	-1.303	92.24%
total	1608.87	1282.32	-326.557	79.70%

zhouyuan · 2023-10-19T09:08:51Z

@JkSelf
The parquet write job failed on Jenkins with recent Velox changes:

Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 2000) (sr270 executor 9): java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to open hdfs file: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet, with error: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet already exists as a directory

PHILO-HE · 2023-10-19T12:05:59Z

@JkSelf The parquet write job failed on Jenkins with recent Velox changes:

Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 2000) (sr270 executor 9): java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to open hdfs file: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet, with error: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet already exists as a directory

Looks we should make parent directory extracted from filePath?

JkSelf · 2023-10-20T07:51:13Z

@JkSelf The parquet write job failed on Jenkins with recent Velox changes:

Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 2000) (sr270 executor 9): java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to open hdfs file: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet, with error: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet already exists as a directory

Looks we should make parent directory extracted from filePath?

It is my mistake to use the wrong test scripts and caused this error not found previously. Yes. It should be the parent directory. I use the parent directory and got the following exception in velox. It seems we can't create the path in gluten. I may need some time to investigate the root cause. And will revert this PR firstly.


2023-10-20 07:31:10,288 ERROR util.TaskResources: Task 25 failed by error:
java.lang.RuntimeException: IOError: Couldn't serialize thrift: Insufficient space in external MemoryBuffer

        at io.glutenproject.datasource.DatasourceJniWrapper.close(Native Method)
        at org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects$$anon$1.close(VeloxFormatWriterInjects.scala:117)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseCurrentWriter(FileFormatDataWriter.scala:75)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:86)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:116)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:386)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1525)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:394)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$17(FileFormatWriter.scala:303)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2023-10-20 07:31:10,293 ERROR datasources.FileFormatWriter: Job job_202310200731034252066363461590636_0025 aborted.
I1020 07:31:10.298817 796153 FileSink.cpp:123] closing file: hdfs://sr246:9000/user/sparkuser/parquet-write-1g/_temporary/0/_temporary/attempt_202310200731034252066363461590636_0025_m_000000_25/part-00000-289d2c68-b0ca-4e8d-8877-59250b390ced-c000.zstd.parquet,  total size: 137.07MB
E1020 07:31:10.462782 796153 Exceptions.h:69] Line: ../../velox/connectors/hive/storage_adapters/hdfs/HdfsWriteFile.cpp:57, Function:close, Expression: success == 0 (-1 vs. 0) Failed to close hdfs file: File does not exist: /user/sparkuser/parquet-write-1g/_temporary/0/_temporary/attempt_202310200731034252066363461590636_0025_m_000000_25/part-00000-289d2c68-b0ca-4e8d-8877-59250b390ced-c000.zstd.parquet (inode 0) Holder libhdfs3_client_rand_0.124599_count_1_pid_795698_tid_140455000884992 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2880)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.fsync(FSNamesystem.java:3352)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.fsync(NameNodeRpcServer.java:1438)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.fsync(ClientNamenodeProtocolServerSideTranslatorPB.java:1049)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
, Source: RUNTIME, ErrorCode: INVALID_STATE

…FS file (apache#3428)" This reverts commit a713d5e.

JkSelf force-pushed the create_write_path branch from 9113611 to 268e26e Compare October 18, 2023 02:12

rui-mo reviewed Oct 18, 2023

View reviewed changes

rui-mo previously approved these changes Oct 18, 2023

View reviewed changes

rui-mo reviewed Oct 18, 2023

View reviewed changes

ep/build-velox/src/get_velox.sh Outdated Show resolved Hide resolved

JkSelf dismissed rui-mo’s stale review via 89bf2e1 October 18, 2023 06:04

JkSelf force-pushed the create_write_path branch from 268e26e to 89bf2e1 Compare October 18, 2023 06:04

rui-mo previously approved these changes Oct 18, 2023

View reviewed changes

rui-mo reviewed Oct 18, 2023

View reviewed changes

...c/main/scala/org/apache/spark/sql/execution/datasources/velox/VeloxFormatWriterInjects.scala Outdated Show resolved Hide resolved

...c/main/scala/org/apache/spark/sql/execution/datasources/velox/VeloxFormatWriterInjects.scala Outdated Show resolved Hide resolved

JkSelf dismissed rui-mo’s stale review via dca6f32 October 18, 2023 06:23

PHILO-HE reviewed Oct 18, 2023

View reviewed changes

...c/main/scala/org/apache/spark/sql/execution/datasources/velox/VeloxFormatWriterInjects.scala Outdated Show resolved Hide resolved

...c/main/scala/org/apache/spark/sql/execution/datasources/velox/VeloxFormatWriterInjects.scala Outdated Show resolved Hide resolved

JkSelf force-pushed the create_write_path branch 4 times, most recently from f15a0ab to 25a701c Compare October 18, 2023 07:33

zhouyuan mentioned this pull request Oct 18, 2023

[GLUTEN-3428] Remove the code of creating hdfs path if not existing in velox oap-project/velox#416

Merged

JkSelf force-pushed the create_write_path branch 2 times, most recently from c24aefc to d1bc829 Compare October 19, 2023 01:24

JkSelf commented Oct 19, 2023

View reviewed changes

...c/main/scala/org/apache/spark/sql/execution/datasources/velox/VeloxFormatWriterInjects.scala Outdated Show resolved Hide resolved

Create Hdfs tmp path in gluten side when writing hdfs file

2fc6c37

JkSelf force-pushed the create_write_path branch 2 times, most recently from cbd2765 to 2fc6c37 Compare October 19, 2023 01:38

PHILO-HE approved these changes Oct 19, 2023

View reviewed changes

PHILO-HE merged commit a713d5e into apache:main Oct 19, 2023
13 checks passed

JkSelf added a commit to JkSelf/gluten that referenced this pull request Oct 20, 2023

Revert "[GLUTEN-3425] Create not existing HDFS folder when writing HD…

e292ecd

…FS file (apache#3428)" This reverts commit a713d5e.

zhouyuan mentioned this pull request Nov 16, 2023

[VL] Create Hdfs folder in Gluten side when writing hdfs file #3425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-3425] Create not existing Hdfs folder in gluten side when writing hdfs file. #3428

[GLUTEN-3425] Create not existing Hdfs folder in gluten side when writing hdfs file. #3428

JkSelf commented Oct 18, 2023

github-actions bot commented Oct 18, 2023

github-actions bot commented Oct 18, 2023

JkSelf commented Oct 18, 2023

rui-mo left a comment

PHILO-HE left a comment

JkSelf commented Oct 19, 2023

PHILO-HE left a comment

GlutenPerfBot commented Oct 19, 2023

zhouyuan commented Oct 19, 2023

PHILO-HE commented Oct 19, 2023

JkSelf commented Oct 20, 2023

[GLUTEN-3425] Create not existing Hdfs folder in gluten side when writing hdfs file. #3428

[GLUTEN-3425] Create not existing Hdfs folder in gluten side when writing hdfs file. #3428

Conversation

JkSelf commented Oct 18, 2023

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Oct 18, 2023

github-actions bot commented Oct 18, 2023

JkSelf commented Oct 18, 2023

rui-mo left a comment

Choose a reason for hiding this comment

PHILO-HE left a comment

Choose a reason for hiding this comment

JkSelf commented Oct 19, 2023

PHILO-HE left a comment

Choose a reason for hiding this comment

GlutenPerfBot commented Oct 19, 2023

zhouyuan commented Oct 19, 2023

PHILO-HE commented Oct 19, 2023

JkSelf commented Oct 20, 2023