[FEA] [JSON reader] to support column prune #9990

wbo4958 · 2022-01-07T03:40:59Z

This is part of FEA of NVIDIA/spark-rapids#9

We have a JSON file with below lines

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

when specifying the reading column name only to (name, string type), json reader throws exception
ai.rapids.cudf.CudfException: cuDF failure at: /home/bobwang/work.d/nvspark/cudf/cpp/src/io/json/reader_impl.cu:421: Must specify types for all columns

Looks like JSON reader requires all column names or inferring schema without column names.

So we hope JSON reader can read the columns that users specified, instead of specifying all column names.

The text was updated successfully, but these errors were encountered:

revans2 · 2022-01-07T15:02:20Z

This is a blocker for Spark to be able to use the JSON reader. Because we do not know all of the columns, the user just gives the ones that they want to read.

vuule · 2022-01-12T21:29:21Z

This is a blocker for Spark to be able to use the JSON reader. Because we do not know all of the columns, the user just gives the ones that they want to read.

Would it be viable to read all columns and then select the ones of interest?

revans2 · 2022-01-12T21:41:43Z

Not totally. In general we rely on Spark to tell us the schema of the data we want to read and then we pass it on to CUDF to select the correct columns and return them to us in the format we want. The java API does not even have a way to tell us what the columns are that were returned. We definitely need to fix that anyways. But, even if it did tell us the names of all of the columns, we would have to ask cudf to resolve the schema for us each time. Then once it is done we would throw away the columns we didn't want and cast all of the columns we did find into the schema that Spark requested. This could work, but is not an ideal long term solution. Especially because Spark parses a lot of values very differently from how CUDF does, and part of the plan was to ask CUDF to return everything as strings so we could use our customized code to try and parse them into values in a way that is much closer to how Spark does it. I don't see how we can ask for everything to be strings and not know what the columns are that we want up front. This does not have to be done for 22.02. It would be great if it is done in time, but we have already decided that JSON parsing will be off by default in Spark for 22.02 as just an experimental feature that someone could try.

vuule · 2022-01-12T21:52:52Z

I'm asking as a short term solution, because column pruning would need to be reworked when we add nested type support.

Does this feature request include pruning of nested columns (same as #8848)?

revans2 · 2022-01-13T12:52:26Z

Long term yes we would want to be able to prune child columns as well. Unless the change is simple in the short term I would rather have us concentrate on getting a long term solution.

github-actions · 2022-02-12T13:04:39Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

elstehle · 2022-09-06T13:23:38Z

I'm currently trying to figure out what the interface for this would look like for the new nested JSON reader.

Would it be sufficient to take a nested schema of the columns that are to be selected and that any [child] column that is not explicitly selected in that schema would not appear in the nested data being returned?

JSON lines input:
{"a":0.0,"b":{"x":0.10, "y":0.11}}
{"a":1.0,"b":{"x":1.10, "y":1.11}}

Schema:
├─ a/
├─ b/
│  ├─ b.x
│  ├─ b.y

-- EX 1 --
Select schema:
[a, b:[x,y]]

Schema returned:
├─ a/
├─ b/
│  ├─ b.x
│  ├─ b.y

-- EX 2 --
Select schema:
[a, b:[y]]

Schema returned:
├─ a/
├─ b/
│  ├─ b.y

-- EX 3 --
Select schema:
[b]

Schema returned:
├─ b/
...which, would just be a struct column with validity and no child columns (as no child columns were _selected_)

revans2 · 2022-09-08T15:08:14Z

Would it be sufficient to take a nested schema of the columns that are to be selected and that any [child] column that is not explicitly selected in that schema would not appear in the nested data being returned?

Yes that would work for us. We have the full list of what we want to read.

GregoryKimball · 2022-12-01T18:53:27Z

After doing some testing on the 23.02 branch, the nested JSON reader no longer throws when dtype is specified for a subset of columns:
df = cudf.read_json('{"a": 1}\n{"b":1}', lines=True, dtype={'a':'int'}, engine='cudf_experimental')

      a     b
0     1  <NA>
1  <NA>   1.0

Reading and infering types for unspecified columns seems like the desired behavior. If we wanted to drop unspecified columns as a performance improvement I expect the results would be underwhelming due to all the parsing work we would still have to do.

Please let me know if this issue is still needed.

GregoryKimball · 2023-06-07T16:24:33Z

I believe we can close this in favor of #13473

wbo4958 added feature request New feature or request Needs Triage Need team to review and classify labels Jan 7, 2022

nartal1 mentioned this issue Jan 7, 2022

[FEA] JSON: Basic Reader for String type NVIDIA/spark-rapids#4135

Closed

wbo4958 changed the title ~~[FEA] to support column prune for json reader~~ [FEA] [JSON reader] to support column prune Jan 10, 2022

github-actions bot added the inactive-30d label Feb 12, 2022

sameerz added the Spark Functionality that helps Spark RAPIDS label Mar 23, 2022

vuule added the cuIO cuIO issue label Jun 8, 2022

GregoryKimball removed the Needs Triage Need team to review and classify label Jun 24, 2022

GregoryKimball added this to the Nested JSON reader milestone Jul 1, 2022

GregoryKimball added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Oct 26, 2022

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

GregoryKimball closed this as completed Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] [JSON reader] to support column prune #9990

[FEA] [JSON reader] to support column prune #9990

wbo4958 commented Jan 7, 2022 •

edited

Loading

revans2 commented Jan 7, 2022

vuule commented Jan 12, 2022

revans2 commented Jan 12, 2022

vuule commented Jan 12, 2022

revans2 commented Jan 13, 2022

github-actions bot commented Feb 12, 2022

elstehle commented Sep 6, 2022

revans2 commented Sep 8, 2022

GregoryKimball commented Dec 1, 2022

GregoryKimball commented Jun 7, 2023

[FEA] [JSON reader] to support column prune #9990

[FEA] [JSON reader] to support column prune #9990

Comments

wbo4958 commented Jan 7, 2022 • edited Loading

revans2 commented Jan 7, 2022

vuule commented Jan 12, 2022

revans2 commented Jan 12, 2022

vuule commented Jan 12, 2022

revans2 commented Jan 13, 2022

github-actions bot commented Feb 12, 2022

elstehle commented Sep 6, 2022

revans2 commented Sep 8, 2022

GregoryKimball commented Dec 1, 2022

GregoryKimball commented Jun 7, 2023

wbo4958 commented Jan 7, 2022 •

edited

Loading