-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] [JSON reader] to support column prune #9990
Comments
This is a blocker for Spark to be able to use the JSON reader. Because we do not know all of the columns, the user just gives the ones that they want to read. |
Would it be viable to read all columns and then select the ones of interest? |
Not totally. In general we rely on Spark to tell us the schema of the data we want to read and then we pass it on to CUDF to select the correct columns and return them to us in the format we want. The java API does not even have a way to tell us what the columns are that were returned. We definitely need to fix that anyways. But, even if it did tell us the names of all of the columns, we would have to ask cudf to resolve the schema for us each time. Then once it is done we would throw away the columns we didn't want and cast all of the columns we did find into the schema that Spark requested. This could work, but is not an ideal long term solution. Especially because Spark parses a lot of values very differently from how CUDF does, and part of the plan was to ask CUDF to return everything as strings so we could use our customized code to try and parse them into values in a way that is much closer to how Spark does it. I don't see how we can ask for everything to be strings and not know what the columns are that we want up front. This does not have to be done for 22.02. It would be great if it is done in time, but we have already decided that JSON parsing will be off by default in Spark for 22.02 as just an experimental feature that someone could try. |
I'm asking as a short term solution, because column pruning would need to be reworked when we add nested type support. Does this feature request include pruning of nested columns (same as #8848)? |
Long term yes we would want to be able to prune child columns as well. Unless the change is simple in the short term I would rather have us concentrate on getting a long term solution. |
This issue has been labeled |
I'm currently trying to figure out what the interface for this would look like for the new nested JSON reader. Would it be sufficient to take a nested schema of the columns that are to be selected and that any [child] column that is not explicitly selected in that schema would not appear in the nested data being returned?
|
Yes that would work for us. We have the full list of what we want to read. |
After doing some testing on the 23.02 branch, the nested JSON reader no longer throws when
Reading and infering types for unspecified columns seems like the desired behavior. If we wanted to drop unspecified columns as a performance improvement I expect the results would be underwhelming due to all the parsing work we would still have to do. Please let me know if this issue is still needed. |
I believe we can close this in favor of #13473 |
This is part of FEA of NVIDIA/spark-rapids#9
We have a JSON file with below lines
when specifying the reading column name only to (name, string type), json reader throws exception
ai.rapids.cudf.CudfException: cuDF failure at: /home/bobwang/work.d/nvspark/cudf/cpp/src/io/json/reader_impl.cu:421: Must specify types for all columns
Looks like JSON reader requires all column names or inferring schema without column names.
So we hope JSON reader can read the columns that users specified, instead of specifying all column names.
The text was updated successfully, but these errors were encountered: