Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet support for MapType #3234

Merged
merged 7 commits into from
Aug 25, 2021

Conversation

razajafri
Copy link
Collaborator

This PR adds support for Maps and nested Maps for parquet.

Signed-off-by: Raza Jafri rjafri@nvidia.com

@razajafri
Copy link
Collaborator Author

build

@sameerz sameerz added the feature request New feature or request label Aug 15, 2021
@sameerz sameerz added this to the Aug 16 - Aug 27 milestone Aug 15, 2021
@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri requested review from jlowe and revans2 August 17, 2021 02:11
a.elementType,
name,
writeInt96,
true).build())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this true when the others are nullable? A comment would be good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if the comment for passing nullable param to the listBuilder and others is enough. We already documented why we are hard-coding nullable to true

@razajafri
Copy link
Collaborator Author

I have pushed a PR for the cudf side for removing null_mask if there are no nulls in a table here.

I am not exactly sure how to test this out as I haven't yet been successful in creating a column from python tests that can create a null_mask even though the column doesn't have nulls. We might need a scala test for it

@revans2
Copy link
Collaborator

revans2 commented Aug 18, 2021

I have pushed a PR for the cudf side for removing null_mask if there are no nulls in a table here.

I am not exactly sure how to test this out as I haven't yet been successful in creating a column from python tests that can create a null_mask even though the column doesn't have nulls. We might need a scala test for it

The simplest way I know of is to create a column with nulls in it and then filter out all of the nulls. copy_if_else is not smart enough to know that you are filtering on nulls so it will always still allocate a validity buffer.

@ttnghia
Copy link
Collaborator

ttnghia commented Aug 23, 2021

I have pushed a PR for the cudf side for removing null_mask if there are no nulls in a table [here] rapidsai/cudf#9061).

Maybe my comment here is too late since that PR has been merged: Why do we need to remove the null mask? Why don't we just check for null count and ignore the null mask if there is not any null? As per Arrow specs, the null mask can be optional with all-valid bit.

@revans2
Copy link
Collaborator

revans2 commented Aug 23, 2021

Maybe my comment here is too late since that PR has been merged: Why do we need to remove the null mask? Why don't we just check for null count and ignore the null mask if there is not any null? As per Arrow specs, the null mask can be optional with all-valid bit.

I filed an issue against cudf for that, but because of performance reasons they didn't want to actually check for a null count, but instead the presence of the validity buffer. This is here as a work around for that.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

I apologize for force-pushing, I had to because git was confused about the number of commits after I rebased

@razajafri razajafri requested review from revans2 and jlowe August 23, 2021 22:12
@razajafri
Copy link
Collaborator Author

build

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for force-pushing, I had to because git was confused about the number of commits after I rebased

Why was the PR being rebased? That's typically going to require a force-push on its own. Once a PR is posted for review, normally one would only add merge commits to bring a PR up to date rather than rebase the branch.

@razajafri razajafri merged commit 5a9fecc into NVIDIA:branch-21.10 Aug 25, 2021
@razajafri razajafri deleted the parquet_writer_map_support branch August 25, 2021 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants