Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of writing NULLs in parquet #4191

Merged
merged 41 commits into from
Jul 26, 2023

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Jul 14, 2023

Bug fix, closes #1548

Original State:
Before this change, nulls were handled incorrectly for primitive types as well as for arrays and vectors, such that null values were written as Deephaven null representations in the parquet file. For example, a null in an integer column would be written as QueryConstants.NULL_INT in the parquet file. Therefore, any reader other than deephaven's parquet reader would not be able to process it correctly.

Change description:
This PR fixes multiple small bugs in both java and python code for processing all nulls in a standard manner such that instead of writing different Deephaven null representations to the parquet file, we encode them by storing offsets of null values. #4186 was also discovered and fixed along with this PR for proper representation of floats in the parquet file.

Along with that, we have done refactoring in the ParquetFileWriter class to make the code more readable and add more documentation.

@malhotrashivam malhotrashivam self-assigned this Jul 14, 2023
@malhotrashivam malhotrashivam added parquet Related to the Parquet integration DocumentationNeeded labels Jul 14, 2023
@malhotrashivam malhotrashivam added this to the July 2023 milestone Jul 14, 2023
@malhotrashivam malhotrashivam added the NoReleaseNotesNeeded No release notes are needed. label Jul 14, 2023
@malhotrashivam malhotrashivam requested a review from lbooker42 July 19, 2023 22:15
jmao-denver
jmao-denver previously approved these changes Jul 20, 2023
Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python changes look good.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change makes sense to me, good find. I suggested a bit of a refactoring for how we implement it, and I think we need to extend it to vectors and arrays.

jmao-denver
jmao-denver previously approved these changes Jul 26, 2023
Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python changes LGTM

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight suggestions

@rcaudy rcaudy enabled auto-merge (squash) July 26, 2023 15:05
@rcaudy rcaudy merged commit acc5474 into deephaven:main Jul 26, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jul 26, 2023
@malhotrashivam malhotrashivam added the bug Something isn't working label Jul 31, 2023
@malhotrashivam malhotrashivam changed the title [WIP] Improved handling of writing NULLs in parquet Improved handling of writing NULLs in parquet Aug 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: Null-encoding for primitives is broken/disabled
4 participants