-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add option to read JSON field as unparsed string #14239
Comments
We have some code that @ttnghia wrote. It will convert a range of tokens to a normalized string that matches what Spark wants. We did this for some Spark specific functionality with JSON parsing related to returning a Map instead of a Struct. I am not sure if this is really something that CUDF wants, but it is at least a starting point. |
Here are some examples, showing input and expected output.
|
There is a separate use case for arrays where the array element type differs between records. Spark infers the type as This is not necessarily a high priority and could be split out into a separate issue, but I'd like to point it out here for visibility.
|
Addresses #14239 This PR adds an option to read mixed types as string columns. It also adds related functional changes to nested JSON reader (libcudf, cuDF-python, Java). Details: - Added new option `mixed_types_as_string` bool in json_reader_options - This feature requires 2 things: finding end of struct/list nodes, parse struct/list type as string. - For Struct and List, node_range_end was node_range_begin+1 earlier (since it was not used anywhere). Now it is calculated properly by copying only struct and list tokens and their node_range_end is calculated. (Since end token is child of begin token, scattering end token's index to parent' token's corresponding node's node_range_end will get the node_range_end of List and Struct nodes). - In `reduce_to_column_tree()` (which infers the schema), the list and struct node_range_end are changed to node_begin+1 so that it does not copy entire list/struct strings to host for column names. - `reinitialize_as_string` reinitializes an initialized column as string. - Mixed type columns are parsed as strings since their column category is changed to `NC_STR`. - Added tests Authors: - Karthikeyan (https://github.com/karthikeyann) - Andy Grove (https://github.com/andygrove) Approvers: - Andy Grove (https://github.com/andygrove) - Jason Lowe (https://github.com/jlowe) - Elias Stehle (https://github.com/elstehle) - Bradley Dice (https://github.com/bdice) - Shruti Shivakumar (https://github.com/shrshi) URL: #14572
We made significant progress on this issue with #14572, and I believe we will be able to close it after #14936. @andygrove would you please let us know if there are other cases to consider? |
For all the examples in #14239 (comment), I see the correct results with #14936. For the mixed array example in #14239 (comment) I still do not see the correct results, so I filed a separate issue for this one (#15120). |
Is your feature request related to a problem? Please describe.
When reading JSON in Spark, if a field has mixed types, Spark will infer the type as String to avoid data loss due to the uncertainty of the actual data type.
For example, given this input file, Spark will read column
bar
as a numeric type and columnfoo
as a string type.Here is the Spark code that demonstrates this:
Currently, Spark RAPIDS fails for this example because cuDF does not support mixed types in a column:
Describe the solution you'd like
I would like the ability to specify to read certain columns as unparsed strings.
Describe alternatives you've considered
I am also exploring some workarounds in the Spark RAPIDS plugin.
Additional context
The text was updated successfully, but these errors were encountered: