Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add option to read JSON field as unparsed string #14239

Open
Tracked by #9458 ...
andygrove opened this issue Sep 29, 2023 · 5 comments
Open
Tracked by #9458 ...

[FEA] Add option to read JSON field as unparsed string #14239

andygrove opened this issue Sep 29, 2023 · 5 comments
Assignees
Labels
2 - In Progress Currently a work in progress cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.

When reading JSON in Spark, if a field has mixed types, Spark will infer the type as String to avoid data loss due to the uncertainty of the actual data type.

For example, given this input file, Spark will read column bar as a numeric type and column foo as a string type.

$ cat test.json
{ "foo": [1,2,3], "bar": 123 }
{ "foo": { "a": 1 }, "bar": 456 }

Here is the Spark code that demonstrates this:

scala> val df = spark.read.json("test.json")
df: org.apache.spark.sql.DataFrame = [bar: bigint, foo: string]                 

scala> df.show
+---+-------+
|bar|    foo|
+---+-------+
|123|[1,2,3]|
|456|{"a":1}|
+---+-------+

Currently, Spark RAPIDS fails for this example because cuDF does not support mixed types in a column:

Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-181-cuda11/thirdparty/cudf/cpp/src/io/json/json_column.cu:577: A mix of lists and structs within the same column is not supported
  at ai.rapids.cudf.Table.readJSON(Native Method)

Describe the solution you'd like
I would like the ability to specify to read certain columns as unparsed strings.

Describe alternatives you've considered
I am also exploring some workarounds in the Spark RAPIDS plugin.

Additional context

@andygrove andygrove added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Sep 29, 2023
@revans2
Copy link
Contributor

revans2 commented Oct 2, 2023

We have some code that @ttnghia wrote. It will convert a range of tokens to a normalized string that matches what Spark wants. We did this for some Spark specific functionality with JSON parsing related to returning a Map instead of a Struct.

https://github.com/NVIDIA/spark-rapids-jni/blob/54ef9991f46fa873d580315212aeae345da7152a/src/main/cpp/src/map_utils.cu#L63-L112

I am not sure if this is really something that CUDF wants, but it is at least a starting point.

@andygrove
Copy link
Contributor Author

Here are some examples, showing input and expected output.

# Example 1: Mixed primitive types in struct

INPUT:

{ "a": "123" }
{ "a": 123 }

EXPECTED:

+-----------+
|    my_json|
+-----------+
|{"a":"123"}|
|{"a":"123"}|
+-----------+

# Example 2: Mixed structs and lists in struct

INPUT:

{ "a": [1,2,3] }
{ "a": { "b": 1 } }

EXPECTED:

+-----------------+
|          my_json|
+-----------------+
|  {"a":"[1,2,3]"}|
|{"a":"{\"b\":1}"}|
+-----------------+

# Example 3: Mixed structs and primitives in struct

INPUT:

{ "a": "fox" }
{ "a": { "b": 1 } }

EXPECTED:

+-----------------+
|my_json          |
+-----------------+
|{"a":"fox"}      |
|{"a":"{\"b\":1}"}|
+-----------------+

# Example 4: Mixed lists and primitives in struct

INPUT:

{ "a": [1,2,3] }
{ "a": "fox" }

EXPECTED:

+---------------+
|my_json        |
+---------------+
|{"a":"[1,2,3]"}|
|{"a":"fox"}    |
+---------------+

@andygrove
Copy link
Contributor Author

There is a separate use case for arrays where the array element type differs between records. Spark infers the type as Array<String> in this case.

This is not necessarily a high priority and could be split out into a separate issue, but I'd like to point it out here for visibility.

# Example: Mixed primitive arrays in struct

INPUT:

{ "a": [1,2,3] }
{ "a": [true,false,true] }
{ "a": ["a", "b", "c"] }

EXPECTED:

+-----------------------------+
|my_json                      |
+-----------------------------+
|{"a":["1","2","3"]}          |
|{"a":["true","false","true"]}|
|{"a":["a","b","c"]}          |
+-----------------------------+

rapids-bot bot pushed a commit that referenced this issue Jan 22, 2024
Addresses #14239




This PR adds an option to read mixed types as string columns.
It also adds related functional changes to nested JSON reader (libcudf, cuDF-python, Java).

Details:
- Added new option `mixed_types_as_string` bool in json_reader_options
- This feature requires 2 things: finding end of struct/list nodes, parse struct/list type as string.
- For Struct and List, node_range_end was node_range_begin+1 earlier (since it was not used anywhere). Now it is calculated properly by copying only struct and list tokens and their node_range_end is calculated. (Since end token is child of begin token, scattering end token's index to parent' token's corresponding node's node_range_end will get the node_range_end of List and Struct nodes).
- In `reduce_to_column_tree()` (which infers the schema), the list and struct node_range_end are changed to node_begin+1 so that it does not copy entire list/struct strings to host for column names.
- `reinitialize_as_string` reinitializes an initialized column as string.
- Mixed type columns are parsed as strings since their column category is changed to `NC_STR`.
- Added tests

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Andy Grove (https://github.com/andygrove)

Approvers:
  - Andy Grove (https://github.com/andygrove)
  - Jason Lowe (https://github.com/jlowe)
  - Elias Stehle (https://github.com/elstehle)
  - Bradley Dice (https://github.com/bdice)
  - Shruti Shivakumar (https://github.com/shrshi)

URL: #14572
@GregoryKimball
Copy link
Contributor

We made significant progress on this issue with #14572, and I believe we will be able to close it after #14936. @andygrove would you please let us know if there are other cases to consider?

@andygrove
Copy link
Contributor Author

andygrove commented Feb 22, 2024

For all the examples in #14239 (comment), I see the correct results with #14936.

For the mixed array example in #14239 (comment) I still do not see the correct results, so I filed a separate issue for this one (#15120).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: In progress
Development

No branches or pull requests

4 participants