Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support decimal type in orc reader #3239

Merged
merged 6 commits into from
Aug 20, 2021

Conversation

firestarman
Copy link
Collaborator

This fixes #3177

Signed-off-by: Firestarman firestarmanllc@gmail.com

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@jlowe jlowe added this to the Aug 16 - Aug 27 milestone Aug 17, 2021
// We should honor the actual decimal type used for the cudf column first, then the precision.
// Instead of always inferring a deciaml type from the precision of Spark. Because both
// DECIMAL32 and DECIMAL64 can support precisions <= MAX_INT_DIGITS. Cudf may use either one
// for this case. e.g. ORC reader will always read decimal as DECIMAL64, even the precision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the plugin's ORC reader be downcasting columns to DECIMAL32 in the case of excessively-precise decimals? That could save space on operations downstream from the query.

Copy link
Collaborator Author

@firestarman firestarman Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be better to cast columns to DECIMAL32 when the precision allows. However I think this is an improvement since it is not required for the functionality.

I thought about this for a while. Seems we need to iterate the columns (including nested columns) to find out the ones matching the case and do the casting. And compond columns need to be rebuilt with new cast children. The whole process looks like a rule for space optimization, which can be used in more places than just the ORC reader.

Instead of integrating the process with ORC readers, how about implementing it as an utils method and applying it to the ORC reader in another PR ?
Or any other suggestions ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be consistent with how the decimals are loaded. The Parquet reader is already converting DECIMAL64 columns into DECIMAL32 columns on read when appropriate, so I think we should do the same here. The code to convert is already written and can be refactored from the Parquet code as a utility method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Parquet code for this is part of ParquetPartitionReaderBase#evolveSchemaIfNeededAndClose

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Updated.

…or.java


Address the comment

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
@sameerz sameerz added the feature request New feature or request label Aug 17, 2021
@firestarman
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator

LGTM

@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit for me

}

/**
* Cast columns with precision that can be stored in an int to DECIMAL32, to save space.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It might be good to expand on this a bit. Our plugin makes the assumption that if the precision is small enough to fit in a DECIMAL32, then CUDF has it stored as a DECIMAL32. Getting this wrong can lead to a number of problems later on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it in a following PR to avoid running the premerge again for just doc udpate to get new approvals.

@firestarman firestarman merged commit d3ae0d0 into NVIDIA:branch-21.10 Aug 20, 2021
@firestarman firestarman deleted the decimal-orc branch August 20, 2021 01:59
razajafri pushed a commit to razajafri/spark-rapids that referenced this pull request Aug 23, 2021
* Support decimal type in orc reader

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support decimal type in ORC reader
5 participants