Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug utility method to dump a table or columnar batch to Parquet #3646

Merged
merged 6 commits into from
Sep 29, 2021

Conversation

res-life
Copy link
Collaborator

This fixes #3115

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

@jlowe @firestarman It's a draft PR, help to review.
Only implemented dump columnar batch, not implemented dump cudf Table.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little conflicted here. On one hand I would like to be able to write out a table, not just a ColumnarBatch (because I think it is more likely that we are debugging some intermediate computation that will be in a table form and not a ColumnarBatch form. But on the other hand I really like that you were able to reuse so much existing code so there is less of a maintenance issue.

Some(dumpToParquetFileImpl(columnarBatch, filePrefix))
}
} catch {
case e: Exception =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this suppressing exceptions? Do we think it's common that we wouldn't mind if the data fails to write? I'd rather have the caller suppress exceptions when desired if that's not the common case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, will throw if exception occurs

ParquetWriteSupport.setSchema(schema, hadoopConf)
hadoopConf.setBoolean(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key, false)
hadoopConf.set(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key,
ParquetOutputTimestampType.INT96.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamps should be written with TIMESTAMP_MICROSECONDS which means there's no conversions necessary from the internal format. INT96 requires conversions and we only use it because Spark's defaults to it. This is a debug tool, thus we want ideally no conversions applied to the data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, dump now avoid any conversion.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Sep 25, 2021
@@ -0,0 +1,120 @@
/*
* Copyright (c) 2020-2021, NVIDIA CORPORATION.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2020-2021, NVIDIA CORPORATION.
* Copyright (c) 2021, NVIDIA CORPORATION.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Chong Gao added 4 commits September 27, 2021 21:02
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life res-life marked this pull request as ready for review September 27, 2021 13:13
@res-life
Copy link
Collaborator Author

@revans2 added dump code for CUDF table, please help to review
@jlowe please help to review
@firestarman please help to review

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels really overly complicated. Would it be better to just normalize everything to a prefix, schema, and table? Then if it is a ColumnarBatch we have the schema, and can translate it to a table without much effort. And for the table we have a helper method, similar to parquetWriterOptionsFromTable that will generate a schema from the table?

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

@revans2 addressed all your comments, please review again.

@res-life res-life merged commit df44623 into NVIDIA:branch-21.10 Sep 29, 2021
@res-life res-life deleted the dump-utils branch September 29, 2021 01:13
@tgravescs
Copy link
Collaborator

how about adding note to the dev docs or sending out a note that this now exists?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Debug utility method to dump a table or columnar batch to Parquet
6 participants