Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-3425] Create not existing Hdfs folder in gluten side when writing hdfs file. #3428

Merged
merged 1 commit into from
Oct 19, 2023

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Oct 18, 2023

What changes were proposed in this pull request?

When Gluten calls the Velox Parquet writer to write a Parquet file, the temporary path obtained from Spark may not have been created yet. While writing a local file, Velox automatically creates the necessary file path if it does not exist. However, this is not the case for HDFS paths. We attempted to create the HDFS path on the Velox side, but the community was not receptive to this approach. As a result, we decided to create the HDFS path on the Gluten side.

How was this patch tested?

manual local test

@github-actions
Copy link

#3425

@github-actions
Copy link

Run Gluten Clickhouse CI

@JkSelf JkSelf force-pushed the create_write_path branch from 9113611 to 268e26e Compare October 18, 2023 02:12
@JkSelf
Copy link
Contributor Author

JkSelf commented Oct 18, 2023

@rui-mo Can you help to review? Thanks.

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test to ensure the functionality?

rui-mo
rui-mo previously approved these changes Oct 18, 2023
rui-mo
rui-mo previously approved these changes Oct 18, 2023
Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just few trivial comment.

@JkSelf
Copy link
Contributor Author

JkSelf commented Oct 19, 2023

Can we add a test to ensure the functionality?

Offline discussion with @rui-mo . It is hard to add the unit test to test.

@JkSelf JkSelf force-pushed the create_write_path branch 2 times, most recently from cbd2765 to 2fc6c37 Compare October 19, 2023 01:38
Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@PHILO-HE PHILO-HE merged commit a713d5e into apache:main Oct 19, 2023
13 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_3428_time.csv log/native_master_10_18_2023_4dbb181cc_time.csv difference percentage
q1 40.34 43.58 3.246 108.05%
q2 24.28 24.34 0.061 100.25%
q3 41.09 35.79 -5.304 87.09%
q4 35.09 41.52 6.425 118.31%
q5 70.73 70.78 0.058 100.08%
q6 8.65 5.74 -2.908 66.36%
q7 103.49 86.23 -17.264 83.32%
q8 105.17 80.16 -25.014 76.22%
q9 152.15 117.35 -34.803 77.13%
q10 62.77 47.73 -15.040 76.04%
q11 23.21 20.15 -3.058 86.82%
q12 39.87 25.94 -13.932 65.05%
q13 70.83 49.65 -21.174 70.11%
q14 26.54 15.32 -11.223 57.72%
q15 43.86 27.13 -16.730 61.85%
q16 20.97 16.07 -4.907 76.61%
q17 109.72 122.53 12.801 111.67%
q18 218.42 163.25 -55.175 74.74%
q19 28.76 13.15 -15.604 45.74%
q20 40.53 25.17 -15.365 62.09%
q21 325.61 235.27 -90.344 72.25%
q22 16.80 15.49 -1.303 92.24%
total 1608.87 1282.32 -326.557 79.70%

@zhouyuan
Copy link
Contributor

@JkSelf
The parquet write job failed on Jenkins with recent Velox changes:

Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 2000) (sr270 executor 9): java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to open hdfs file: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet, with error: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet already exists as a directory

@PHILO-HE
Copy link
Contributor

@JkSelf The parquet write job failed on Jenkins with recent Velox changes:

Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 2000) (sr270 executor 9): java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to open hdfs file: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet, with error: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet already exists as a directory

Looks we should make parent directory extracted from filePath?

@JkSelf
Copy link
Contributor Author

JkSelf commented Oct 20, 2023

@JkSelf The parquet write job failed on Jenkins with recent Velox changes:

Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 2000) (sr270 executor 9): java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to open hdfs file: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet, with error: /user/sparkuser/ETL/newparquet_zstd/_temporary/0/_temporary/attempt_202310190834151922475248029547116_0002_m_000000_2000/part-00000-e52c3743-6c6b-459e-b6c6-a2bc676fa493-c000.zstd.parquet already exists as a directory

Looks we should make parent directory extracted from filePath?

It is my mistake to use the wrong test scripts and caused this error not found previously. Yes. It should be the parent directory. I use the parent directory and got the following exception in velox. It seems we can't create the path in gluten. I may need some time to investigate the root cause. And will revert this PR firstly.


2023-10-20 07:31:10,288 ERROR util.TaskResources: Task 25 failed by error:
java.lang.RuntimeException: IOError: Couldn't serialize thrift: Insufficient space in external MemoryBuffer

        at io.glutenproject.datasource.DatasourceJniWrapper.close(Native Method)
        at org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects$$anon$1.close(VeloxFormatWriterInjects.scala:117)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseCurrentWriter(FileFormatDataWriter.scala:75)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:86)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:116)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:386)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1525)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:394)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$17(FileFormatWriter.scala:303)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2023-10-20 07:31:10,293 ERROR datasources.FileFormatWriter: Job job_202310200731034252066363461590636_0025 aborted.
I1020 07:31:10.298817 796153 FileSink.cpp:123] closing file: hdfs://sr246:9000/user/sparkuser/parquet-write-1g/_temporary/0/_temporary/attempt_202310200731034252066363461590636_0025_m_000000_25/part-00000-289d2c68-b0ca-4e8d-8877-59250b390ced-c000.zstd.parquet,  total size: 137.07MB
E1020 07:31:10.462782 796153 Exceptions.h:69] Line: ../../velox/connectors/hive/storage_adapters/hdfs/HdfsWriteFile.cpp:57, Function:close, Expression: success == 0 (-1 vs. 0) Failed to close hdfs file: File does not exist: /user/sparkuser/parquet-write-1g/_temporary/0/_temporary/attempt_202310200731034252066363461590636_0025_m_000000_25/part-00000-289d2c68-b0ca-4e8d-8877-59250b390ced-c000.zstd.parquet (inode 0) Holder libhdfs3_client_rand_0.124599_count_1_pid_795698_tid_140455000884992 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2880)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.fsync(FSNamesystem.java:3352)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.fsync(NameNodeRpcServer.java:1438)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.fsync(ClientNamenodeProtocolServerSideTranslatorPB.java:1049)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
, Source: RUNTIME, ErrorCode: INVALID_STATE

JkSelf added a commit to JkSelf/gluten that referenced this pull request Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants