Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] [JSON reader] to support column prune #9990

Closed
wbo4958 opened this issue Jan 7, 2022 · 10 comments
Closed

[FEA] [JSON reader] to support column prune #9990

wbo4958 opened this issue Jan 7, 2022 · 10 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Jan 7, 2022

This is part of FEA of NVIDIA/spark-rapids#9

We have a JSON file with below lines

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

when specifying the reading column name only to (name, string type), json reader throws exception
ai.rapids.cudf.CudfException: cuDF failure at: /home/bobwang/work.d/nvspark/cudf/cpp/src/io/json/reader_impl.cu:421: Must specify types for all columns

Looks like JSON reader requires all column names or inferring schema without column names.

So we hope JSON reader can read the columns that users specified, instead of specifying all column names.

@wbo4958 wbo4958 added feature request New feature or request Needs Triage Need team to review and classify labels Jan 7, 2022
@revans2
Copy link
Contributor

revans2 commented Jan 7, 2022

This is a blocker for Spark to be able to use the JSON reader. Because we do not know all of the columns, the user just gives the ones that they want to read.

@wbo4958 wbo4958 changed the title [FEA] to support column prune for json reader [FEA] [JSON reader] to support column prune Jan 10, 2022
@vuule
Copy link
Contributor

vuule commented Jan 12, 2022

This is a blocker for Spark to be able to use the JSON reader. Because we do not know all of the columns, the user just gives the ones that they want to read.

Would it be viable to read all columns and then select the ones of interest?

@revans2
Copy link
Contributor

revans2 commented Jan 12, 2022

Not totally. In general we rely on Spark to tell us the schema of the data we want to read and then we pass it on to CUDF to select the correct columns and return them to us in the format we want. The java API does not even have a way to tell us what the columns are that were returned. We definitely need to fix that anyways. But, even if it did tell us the names of all of the columns, we would have to ask cudf to resolve the schema for us each time. Then once it is done we would throw away the columns we didn't want and cast all of the columns we did find into the schema that Spark requested. This could work, but is not an ideal long term solution. Especially because Spark parses a lot of values very differently from how CUDF does, and part of the plan was to ask CUDF to return everything as strings so we could use our customized code to try and parse them into values in a way that is much closer to how Spark does it. I don't see how we can ask for everything to be strings and not know what the columns are that we want up front. This does not have to be done for 22.02. It would be great if it is done in time, but we have already decided that JSON parsing will be off by default in Spark for 22.02 as just an experimental feature that someone could try.

@vuule
Copy link
Contributor

vuule commented Jan 12, 2022

I'm asking as a short term solution, because column pruning would need to be reworked when we add nested type support.

Does this feature request include pruning of nested columns (same as #8848)?

@revans2
Copy link
Contributor

revans2 commented Jan 13, 2022

Long term yes we would want to be able to prune child columns as well. Unless the change is simple in the short term I would rather have us concentrate on getting a long term solution.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz sameerz added the Spark Functionality that helps Spark RAPIDS label Mar 23, 2022
@vuule vuule added the cuIO cuIO issue label Jun 8, 2022
@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Jun 24, 2022
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Jul 1, 2022
@elstehle
Copy link
Contributor

elstehle commented Sep 6, 2022

I'm currently trying to figure out what the interface for this would look like for the new nested JSON reader.

Would it be sufficient to take a nested schema of the columns that are to be selected and that any [child] column that is not explicitly selected in that schema would not appear in the nested data being returned?

JSON lines input:
{"a":0.0,"b":{"x":0.10, "y":0.11}}
{"a":1.0,"b":{"x":1.10, "y":1.11}}

Schema:
├─ a/
├─ b/
│  ├─ b.x
│  ├─ b.y
-- EX 1 --
Select schema:
[a, b:[x,y]]

Schema returned:
├─ a/
├─ b/
│  ├─ b.x
│  ├─ b.y
-- EX 2 --
Select schema:
[a, b:[y]]

Schema returned:
├─ a/
├─ b/
│  ├─ b.y
-- EX 3 --
Select schema:
[b]

Schema returned:
├─ b/
...which, would just be a struct column with validity and no child columns (as no child columns were _selected_)

@revans2
Copy link
Contributor

revans2 commented Sep 8, 2022

Would it be sufficient to take a nested schema of the columns that are to be selected and that any [child] column that is not explicitly selected in that schema would not appear in the nested data being returned?

Yes that would work for us. We have the full list of what we want to read.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Oct 26, 2022
@GregoryKimball
Copy link
Contributor

After doing some testing on the 23.02 branch, the nested JSON reader no longer throws when dtype is specified for a subset of columns:
df = cudf.read_json('{"a": 1}\n{"b":1}', lines=True, dtype={'a':'int'}, engine='cudf_experimental')

      a     b
0     1  <NA>
1  <NA>   1.0

Reading and infering types for unspecified columns seems like the desired behavior. If we wanted to drop unspecified columns as a performance improvement I expect the results would be underwhelming due to all the parsing work we would still have to do.

Please let me know if this issue is still needed.

@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
@GregoryKimball
Copy link
Contributor

I believe we can close this in favor of #13473

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

6 participants