-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet support for MapType #3234
Parquet support for MapType #3234
Conversation
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
build |
a.elementType, | ||
name, | ||
writeInt96, | ||
true).build()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this true when the others are nullable? A comment would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if the comment for passing nullable
param to the listBuilder and others is enough. We already documented why we are hard-coding nullable to true
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
I have pushed a PR for the cudf side for removing null_mask if there are no nulls in a table here. I am not exactly sure how to test this out as I haven't yet been successful in creating a column from python tests that can create a null_mask even though the column doesn't have nulls. We might need a scala test for it |
The simplest way I know of is to create a column with nulls in it and then filter out all of the nulls. copy_if_else is not smart enough to know that you are filtering on nulls so it will always still allocate a validity buffer. |
Maybe my comment here is too late since that PR has been merged: Why do we need to remove the null mask? Why don't we just check for null count and ignore the null mask if there is not any null? As per Arrow specs, the null mask can be optional with all-valid bit. |
I filed an issue against cudf for that, but because of performance reasons they didn't want to actually check for a null count, but instead the presence of the validity buffer. This is here as a work around for that. |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
7b04994
to
569edb4
Compare
I apologize for force-pushing, I had to because git was confused about the number of commits after I rebased |
build |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for force-pushing, I had to because git was confused about the number of commits after I rebased
Why was the PR being rebased? That's typically going to require a force-push on its own. Once a PR is posted for review, normally one would only add merge commits to bring a PR up to date rather than rebase the branch.
This PR adds support for Maps and nested Maps for parquet.
Signed-off-by: Raza Jafri rjafri@nvidia.com