-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug utility method to dump a table or columnar batch to Parquet #3646
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
@jlowe @firestarman It's a draft PR, help to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little conflicted here. On one hand I would like to be able to write out a table, not just a ColumnarBatch (because I think it is more likely that we are debugging some intermediate computation that will be in a table form and not a ColumnarBatch form. But on the other hand I really like that you were able to reuse so much existing code so there is less of a maintenance issue.
Some(dumpToParquetFileImpl(columnarBatch, filePrefix)) | ||
} | ||
} catch { | ||
case e: Exception => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this suppressing exceptions? Do we think it's common that we wouldn't mind if the data fails to write? I'd rather have the caller suppress exceptions when desired if that's not the common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, will throw if exception occurs
ParquetWriteSupport.setSchema(schema, hadoopConf) | ||
hadoopConf.setBoolean(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key, false) | ||
hadoopConf.set(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key, | ||
ParquetOutputTimestampType.INT96.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timestamps should be written with TIMESTAMP_MICROSECONDS
which means there's no conversions necessary from the internal format. INT96 requires conversions and we only use it because Spark's defaults to it. This is a debug tool, thus we want ideally no conversions applied to the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, dump now avoid any conversion.
@@ -0,0 +1,120 @@ | |||
/* | |||
* Copyright (c) 2020-2021, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Copyright (c) 2020-2021, NVIDIA CORPORATION. | |
* Copyright (c) 2021, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
build |
@revans2 added dump code for CUDF table, please help to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels really overly complicated. Would it be better to just normalize everything to a prefix, schema, and table? Then if it is a ColumnarBatch we have the schema, and can translate it to a table without much effort. And for the table we have a helper method, similar to parquetWriterOptionsFromTable
that will generate a schema from the table?
sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/DumpUtils.scala
Outdated
Show resolved
Hide resolved
build |
@revans2 addressed all your comments, please review again. |
how about adding note to the dev docs or sending out a note that this now exists? |
This fixes #3115
Signed-off-by: Chong Gao res_life@163.com