Debug utility method to dump a table or columnar batch to Parquet #3646

res-life · 2021-09-24T11:36:20Z

This fixes #3115

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-09-24T11:41:09Z

@jlowe @firestarman It's a draft PR, help to review.
Only implemented dump columnar batch, not implemented dump cudf Table.

revans2

I am a little conflicted here. On one hand I would like to be able to write out a table, not just a ColumnarBatch (because I think it is more likely that we are debugging some intermediate computation that will be in a table form and not a ColumnarBatch form. But on the other hand I really like that you were able to reuse so much existing code so there is less of a maintenance issue.

jlowe · 2021-09-24T19:07:11Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala

+        Some(dumpToParquetFileImpl(columnarBatch, filePrefix))
+      }
+    } catch {
+      case e: Exception =>


Why is this suppressing exceptions? Do we think it's common that we wouldn't mind if the data fails to write? I'd rather have the caller suppress exceptions when desired if that's not the common case.

done, will throw if exception occurs

jlowe · 2021-09-24T19:49:12Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala

+      ParquetWriteSupport.setSchema(schema, hadoopConf)
+      hadoopConf.setBoolean(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key, false)
+      hadoopConf.set(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key,
+        ParquetOutputTimestampType.INT96.toString)


Timestamps should be written with TIMESTAMP_MICROSECONDS which means there's no conversions necessary from the internal format. INT96 requires conversions and we only use it because Spark's defaults to it. This is a debug tool, thus we want ideally no conversions applied to the data.

done, dump now avoid any conversion.

firestarman · 2021-09-27T06:50:50Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala

@@ -0,0 +1,120 @@
+/*
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.


Suggested change

* Copyright (c) 2020-2021, NVIDIA CORPORATION.

* Copyright (c) 2021, NVIDIA CORPORATION.

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-09-27T13:13:28Z

build

res-life · 2021-09-27T13:18:11Z

@revans2 added dump code for CUDF table, please help to review
@jlowe please help to review
@firestarman please help to review

revans2

This feels really overly complicated. Would it be better to just normalize everything to a prefix, schema, and table? Then if it is a ColumnarBatch we have the schema, and can translate it to a table without much effort. And for the table we have a helper method, similar to parquetWriterOptionsFromTable that will generate a schema from the table?

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-09-28T05:30:59Z

build

res-life · 2021-09-28T08:24:38Z

@revans2 addressed all your comments, please review again.

tgravescs · 2021-10-04T13:44:12Z

how about adding note to the dev docs or sending out a note that this now exists?

gpu dump utils

b16f9f1

Signed-off-by: Chong Gao <res_life@163.com>

res-life requested review from jlowe and firestarman September 24, 2021 11:37

revans2 reviewed Sep 24, 2021

View reviewed changes

jlowe reviewed Sep 24, 2021

View reviewed changes

sameerz assigned res-life Sep 25, 2021

sameerz added the task Work required that improves the product but is not user facing label Sep 25, 2021

firestarman reviewed Sep 27, 2021

View reviewed changes

Chong Gao added 4 commits September 27, 2021 21:02

dump to parquet file

2883cfd

Signed-off-by: Chong Gao <res_life@163.com>

update

47d7c4c

update file license header

a50c684

Signed-off-by: Chong Gao <res_life@163.com>

delete tmp file

996d43f

Signed-off-by: Chong Gao <res_life@163.com>

res-life marked this pull request as ready for review September 27, 2021 13:13

revans2 reviewed Sep 27, 2021

View reviewed changes

refactor

1b3a0b5

Signed-off-by: Chong Gao <res_life@163.com>

revans2 approved these changes Sep 28, 2021

View reviewed changes

res-life merged commit df44623 into NVIDIA:branch-21.10 Sep 29, 2021

res-life deleted the dump-utils branch September 29, 2021 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug utility method to dump a table or columnar batch to Parquet #3646

Debug utility method to dump a table or columnar batch to Parquet #3646

res-life commented Sep 24, 2021

res-life commented Sep 24, 2021

revans2 left a comment

jlowe Sep 24, 2021

res-life Sep 27, 2021

jlowe Sep 24, 2021

res-life Sep 27, 2021

firestarman Sep 27, 2021

res-life Sep 27, 2021

res-life commented Sep 27, 2021

res-life commented Sep 27, 2021

revans2 left a comment

res-life commented Sep 28, 2021

res-life commented Sep 28, 2021

tgravescs commented Oct 4, 2021

		@@ -0,0 +1,120 @@
		/*
		* Copyright (c) 2020-2021, NVIDIA CORPORATION.

	* Copyright (c) 2020-2021, NVIDIA CORPORATION.
	* Copyright (c) 2021, NVIDIA CORPORATION.

Debug utility method to dump a table or columnar batch to Parquet #3646

Debug utility method to dump a table or columnar batch to Parquet #3646

Conversation

res-life commented Sep 24, 2021

res-life commented Sep 24, 2021

revans2 left a comment

Choose a reason for hiding this comment

jlowe Sep 24, 2021

Choose a reason for hiding this comment

res-life Sep 27, 2021

Choose a reason for hiding this comment

jlowe Sep 24, 2021

Choose a reason for hiding this comment

res-life Sep 27, 2021

Choose a reason for hiding this comment

firestarman Sep 27, 2021

Choose a reason for hiding this comment

res-life Sep 27, 2021

Choose a reason for hiding this comment

res-life commented Sep 27, 2021

res-life commented Sep 27, 2021

revans2 left a comment

Choose a reason for hiding this comment

res-life commented Sep 28, 2021

res-life commented Sep 28, 2021

tgravescs commented Oct 4, 2021