Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orc read map #3394

Merged
merged 6 commits into from
Sep 9, 2021
Merged

Orc read map #3394

merged 6 commits into from
Sep 9, 2021

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Sep 6, 2021

orc read map
This fixes #3297
Signed-off-by: Chong Gao res_life@163.com

res-life and others added 3 commits September 6, 2021 11:05
Signed-off-by: Chong Gao <chongg@nvidia.com>
Signed-off-by: Chong Gao <chongg@nvidia.com>
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

res-life commented Sep 6, 2021

Note: please do not merge, is waiting CUDF to merge: rapidsai/cudf#9132

@@ -82,9 +82,16 @@ def test_basic_read(std_input_path, name, read_func, v1_enabled_list, orc_impl,
StructGen([['child0', byte_gen], ['child1', orc_basic_struct_gen]]),
StructGen([['child0', ArrayGen(short_gen)], ['child1', double_gen]])]

all_basic_map_gens = [MapGen(f(nullable=False), f()) for f in [BooleanGen, ByteGen, ShortGen, IntegerGen, LongGen, FloatGen, DoubleGen, TimestampGen]] + [simple_string_to_string_map_gen]
all_basic_map_gens += [DateGen(start=date(1590, 1, 1))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this is not a Map type?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same question. DateGen is obviously not a MapGen. You may refer to the MapGen for Parquet reader tests here https://github.com/NVIDIA/spark-rapids/blob/branch-21.10/integration_tests/src/main/python/parquet_test.py#L45

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, referred parquet map gen

all_basic_map_gens += [DateGen(start=date(1590, 1, 1))]
orc_map_gens_sample = all_basic_map_gens + [MapGen(StringGen(pattern='key_[0-9]', nullable=False), ArrayGen(string_gen), max_length=10),
MapGen(RepeatSeqGen(IntegerGen(nullable=False), 10), long_gen, max_length=10),
MapGen(StringGen(pattern='key_[0-9]', nullable=False), simple_string_to_string_map_gen)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your PR support Map(xx, Struct)? If yes, better add this for supporting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added map(struct, struct)

@firestarman firestarman added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Sep 6, 2021
@firestarman firestarman marked this pull request as draft September 7, 2021 01:15
@firestarman
Copy link
Collaborator

Move to draft because it depends on the open PR rapidsai/cudf#9132.

@@ -82,9 +82,16 @@ def test_basic_read(std_input_path, name, read_func, v1_enabled_list, orc_impl,
StructGen([['child0', byte_gen], ['child1', orc_basic_struct_gen]]),
StructGen([['child0', ArrayGen(short_gen)], ['child1', double_gen]])]

all_basic_map_gens = [MapGen(f(nullable=False), f()) for f in [BooleanGen, ByteGen, ShortGen, IntegerGen, LongGen, FloatGen, DoubleGen, TimestampGen]] + [simple_string_to_string_map_gen]
all_basic_map_gens += [DateGen(start=date(1590, 1, 1))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same question. DateGen is obviously not a MapGen. You may refer to the MapGen for Parquet reader tests here https://github.com/NVIDIA/spark-rapids/blob/branch-21.10/integration_tests/src/main/python/parquet_test.py#L45

Signed-off-by: Chong Gao <res_life@163.com>
@sameerz sameerz added the feature request New feature or request label Sep 7, 2021
Signed-off-by: Chong Gao <res_life@163.com>
@firestarman
Copy link
Collaborator

This is likely to conflict with #3393, so this PR will need to be updated if 3393 is merged before this one.

@firestarman firestarman marked this pull request as ready for review September 9, 2021 05:55
@firestarman
Copy link
Collaborator

build

@firestarman
Copy link
Collaborator

LGTM

@res-life
Copy link
Collaborator Author

res-life commented Sep 9, 2021

build

@res-life res-life merged commit fe243c2 into NVIDIA:branch-21.10 Sep 9, 2021
@res-life res-life deleted the orc-read-map branch March 13, 2022 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] ORC reader supports reading Map columns.
4 participants