Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DurationType in cudf parquet reader via arrow:schema #15617

Merged
Merged
Show file tree
Hide file tree
Changes from 55 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
053f7da
Read duration type in cudf parquet via arrow:schema
mhaseeb123 Apr 30, 2024
aa4e9bb
reverting an inadvertently removed code line.
mhaseeb123 Apr 30, 2024
6c67c28
clang-format changes
mhaseeb123 Apr 30, 2024
0e6fc4a
Update cpp/src/io/parquet/reader_impl_helpers.cpp
mhaseeb123 Apr 30, 2024
a6eca13
Co-walk arrow and parquet schema
mhaseeb123 May 1, 2024
ced5dd9
fixing copyrights
mhaseeb123 May 1, 2024
b192352
fix the hardcoded if conditions for duration type
mhaseeb123 May 1, 2024
18d5e6c
add boolean check for arrow type columns
mhaseeb123 May 1, 2024
8f55983
add basic testing for duration type
mhaseeb123 May 1, 2024
6883c7e
revert clangd induced formatting
mhaseeb123 May 1, 2024
ab5cacd
more reverting clangd
mhaseeb123 May 1, 2024
649148c
remove raw for loops, verify equal fields at each schema level
mhaseeb123 May 2, 2024
416dbbd
Remove flatbuffer files. Add flatbuffers via CMake
mhaseeb123 May 2, 2024
c5a7b0e
Make arrow schema use in PQ reader optional. Add tests.
mhaseeb123 May 2, 2024
6f18766
minor updates for better readability
mhaseeb123 May 2, 2024
e4b9e74
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
mhaseeb123 May 2, 2024
dc7564a
fix arrow schema walk to handle list type columns. Add more pytests
mhaseeb123 May 3, 2024
0c4e7c4
add comments for the dummy node hack
mhaseeb123 May 3, 2024
0514b5c
Adding `map` type to parquet testing.
mhaseeb123 May 3, 2024
a1f8fe7
relocate files, fix copyirghts and ruff checks
mhaseeb123 May 6, 2024
a36c1c6
minor fix for verify copyright hook
mhaseeb123 May 6, 2024
59d84f4
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
mhaseeb123 May 6, 2024
6b9bde5
update copyright messages
mhaseeb123 May 6, 2024
041ff76
Merge branch 'arrow-schema-support-pq-reader' of https://github.com/m…
mhaseeb123 May 6, 2024
cb691dd
segfault-proof the `validate_schemas` method
mhaseeb123 May 6, 2024
59610cd
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
mhaseeb123 May 6, 2024
ed83908
C++ friendly base64 encoder/decoder implementations
mhaseeb123 May 7, 2024
fbd3356
minor updates
mhaseeb123 May 7, 2024
b93c2c0
fix the erroneous inequality check to equality
mhaseeb123 May 7, 2024
d01f94c
use string find instead of custom function for better speed
mhaseeb123 May 7, 2024
b8c338b
optimize base64 encode
mhaseeb123 May 7, 2024
e47bbfb
fix minor signed comparison error
mhaseeb123 May 7, 2024
0b5ec61
speed optimization for decoder
mhaseeb123 May 7, 2024
83a13a7
Apply suggestions from code review
mhaseeb123 May 8, 2024
69be7db
applying suggestions from reviewers
mhaseeb123 May 8, 2024
0d41d99
minor updates from reviewer suggestions
mhaseeb123 May 8, 2024
56bbc15
add ctests for base64 encoder and decoder
mhaseeb123 May 8, 2024
bd54430
minor comments update
mhaseeb123 May 9, 2024
e954b45
Apply styling suggestions from code review
mhaseeb123 May 9, 2024
b870359
minor updates and better styling
mhaseeb123 May 9, 2024
c34c248
adding const to decode_ipc_message fn
mhaseeb123 May 9, 2024
dda87d1
avoid returning raw pointer in decode_ipc_message
mhaseeb123 May 9, 2024
e9f441d
move base64 definitions to a source file and add it to cmake
mhaseeb123 May 10, 2024
ac85ecc
apply suggestions from the reviews
mhaseeb123 May 10, 2024
45261f1
Apply suggestions from code review
mhaseeb123 May 10, 2024
f92fcc8
improve round trip tests for thorough arrow schema testing plus minor…
mhaseeb123 May 10, 2024
1c36d36
Update cpp/src/io/parquet/reader_impl_helpers.cpp
mhaseeb123 May 10, 2024
336574a
minor syntactical updates to tests
mhaseeb123 May 10, 2024
b0289b8
Apply suggestions from code review
mhaseeb123 May 13, 2024
3a602cc
small improvements and using zip iterator instead of counting iterato…
mhaseeb123 May 13, 2024
63b4df3
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
vuule May 13, 2024
7fbbea0
Remove explicit check for dtypes as already being done
mhaseeb123 May 13, 2024
6ab3b17
move `use_arrow_schema` to the end of parameters
mhaseeb123 May 14, 2024
4d74b24
Update tests to construct `expected` and use `assert_eq` for dtypes
mhaseeb123 May 14, 2024
a80f562
Remove `use_arrow_schema` from public Python APIs.
mhaseeb123 May 14, 2024
4e368d8
Remove `use_arrow_schema` from Cython API args as well
mhaseeb123 May 14, 2024
93ec789
Throw some Nulls in python tests
mhaseeb123 May 14, 2024
09eadcf
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
galipremsagar May 14, 2024
1d94cc8
Merge remote-tracking branch 'upstream/branch-24.06' into arrow-schem…
mhaseeb123 May 14, 2024
50d0b77
Update .pre-commit-config.yaml
galipremsagar May 14, 2024
56b2edc
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
mhaseeb123 May 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -143,9 +143,11 @@ repos:
hooks:
- id: verify-copyright
exclude: |
(?x)
cpp/include/cudf_test/cxxopts[.]hpp$

(?x)^(
cpp/include/cudf_test/cxxopts[.]hpp|
cpp/src/io/parquet/ipc/Message_generated[.]h|
cpp/src/io/parquet/ipc/Schema_generated[.]h
)$
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

default_language_version:
python: python3
4 changes: 4 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,8 @@ include(cmake/thirdparty/get_cccl.cmake)
include(cmake/thirdparty/get_rmm.cmake)
# find arrow
include(cmake/thirdparty/get_arrow.cmake)
# find flatbuffers
include(cmake/thirdparty/get_flatbuffers.cmake)
# find dlpack
include(cmake/thirdparty/get_dlpack.cmake)
# find cuCollections, should come after including CCCL
Expand Down Expand Up @@ -429,6 +431,7 @@ add_library(
src/io/text/bgzip_utils.cpp
src/io/text/multibyte_split.cu
src/io/utilities/arrow_io_source.cpp
src/io/utilities/base64_utilities.cpp
src/io/utilities/column_buffer.cpp
src/io/utilities/column_buffer_strings.cu
src/io/utilities/config_utils.cpp
Expand Down Expand Up @@ -742,6 +745,7 @@ target_include_directories(
"$<BUILD_INTERFACE:${CUDF_GENERATED_INCLUDE_DIR}/include>"
PRIVATE "$<BUILD_INTERFACE:${CUDF_SOURCE_DIR}/src>"
"$<BUILD_INTERFACE:${nanoarrow_SOURCE_DIR}/src>"
"$<BUILD_INTERFACE:${FlatBuffers_SOURCE_DIR}/include>"
INTERFACE "$<INSTALL_INTERFACE:include>"
)

Expand Down
33 changes: 33 additions & 0 deletions cpp/cmake/thirdparty/get_flatbuffers.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# =============================================================================
# Copyright (c) 2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing permissions and limitations under
# the License.
# =============================================================================

# Use CPM to find or clone flatbuffers
function(find_and_configure_flatbuffers VERSION)

rapids_cpm_find(
flatbuffers ${VERSION}
GLOBAL_TARGETS flatbuffers
CPM_ARGS
GIT_REPOSITORY https://github.com/google/flatbuffers.git
GIT_TAG v${VERSION}
GIT_SHALLOW TRUE
)

rapids_export_find_package_root(
BUILD flatbuffers "${flatbuffers_BINARY_DIR}" EXPORT_SET cudf-exports
)

endfunction()

find_and_configure_flatbuffers(24.3.25)
28 changes: 28 additions & 0 deletions cpp/include/cudf/io/parquet.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ class parquet_reader_options {
bool _convert_strings_to_categories = false;
// Whether to use PANDAS metadata to load columns
bool _use_pandas_metadata = true;
// Whether to read and use ARROW schema
bool _use_arrow_schema = true;
// Cast timestamp columns to a specific type
data_type _timestamp_type{type_id::EMPTY};

Expand Down Expand Up @@ -126,6 +128,13 @@ class parquet_reader_options {
*/
[[nodiscard]] bool is_enabled_use_pandas_metadata() const { return _use_pandas_metadata; }

/**
* @brief Returns true/false depending whether to use arrow schema while reading.
*
* @return `true` if arrow schema is used while reading
*/
[[nodiscard]] bool is_enabled_use_arrow_schema() const { return _use_arrow_schema; }

/**
* @brief Returns optional tree of metadata.
*
Expand Down Expand Up @@ -214,6 +223,13 @@ class parquet_reader_options {
*/
void enable_use_pandas_metadata(bool val) { _use_pandas_metadata = val; }

/**
* @brief Sets to enable/disable use of arrow schema to read.
*
* @param val Boolean value whether to use arrow schema
*/
void enable_use_arrow_schema(bool val) { _use_arrow_schema = val; }

/**
* @brief Sets reader column schema.
*
Expand Down Expand Up @@ -328,6 +344,18 @@ class parquet_reader_options_builder {
return *this;
}

/**
* @brief Sets to enable/disable use of arrow schema to read.
*
* @param val Boolean value whether to use arrow schema
* @return this for chaining
*/
parquet_reader_options_builder& use_arrow_schema(bool val)
{
options._use_arrow_schema = val;
return *this;
}

/**
* @brief Sets reader metadata.
*
Expand Down
Loading
Loading