Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] from_json generated inconsistent result comparing with CPU for input column with nested json strings #8558

Closed
cindyyuanjiang opened this issue Jun 12, 2023 · 11 comments
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@cindyyuanjiang
Copy link
Collaborator

Describe the bug
from_json generated inconsistent result comparing with CPU for input column with nested json strings.

Steps/Code to reproduce bug
Here is a repro case:

scala> spark.conf.set("spark.rapids.sql.expression.JsonToStructs", "true")

scala> import org.apache.spark.sql.types.{IntegerType,StringType,StructType,StructField}

scala> import org.apache.spark.sql.Row

scala> val data = Seq(Row(1, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}""", """{"time":"26/08/2015"}""", """{"Zipcode":[706,123,4646],"ZipCodeType":"SPECIAL","City":"San Jose","State":[{"Zipcode":706,"ZipCodeType":"SPECIAL","City":"San Jose","State":"CA"}]}"""), Row(2, """{"Zipcode":706,"ZipCodeType":"SPECIAL","City":"San Jose","State":"CA"}""", """{"time":"01/08/2015"}""", ""), Row(3, "", "", ""), Row(4, null, null, null))

scala> val schema = StructType(Array(StructField("id",IntegerType,true), StructField("json",StringType,true), StructField("time_strF",StringType,true), StructField("nest_jsonF",StringType,true)))

scala> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

scala> val out_schema = StructType(Array(StructField("Zipcode",StringType,true), StructField("ZipCodeType",StringType,true), StructField("City",StringType,true), StructField("State",StringType,true)))

scala> val out_struct = df.withColumn("output", from_json(col("nest_jsonF"), out_schema))

scala> out_struct.select("output").show(false)

For GPU, the output is:

+---------------------------------------------------------------------+
|output                                                               |
+---------------------------------------------------------------------+
|{[706, 123, 4646], SPECIAL, San Jose, [{706, SPECIAL, San Jose, CA}]}|
|null                                                                 |
|null                                                                 |
|null                                                                 |
+---------------------------------------------------------------------+

For CPU, the output is:

+-------------------------------------------------------------------------------------------------------------+
|output                                                                                                       |
+-------------------------------------------------------------------------------------------------------------+
|{[706,123,4646], SPECIAL, San Jose, [{"Zipcode":706,"ZipCodeType":"SPECIAL","City":"San Jose","State":"CA"}]}|
|null                                                                                                         |
|null                                                                                                         |
|null                                                                                                         |
+-------------------------------------------------------------------------------------------------------------+

Expected behavior
Same as above CPU output.

Environment details (please complete the following information)

  • Environment location: Standalone
  • Spark 3.3.1
@cindyyuanjiang cindyyuanjiang added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 12, 2023
@cindyyuanjiang cindyyuanjiang self-assigned this Jun 12, 2023
@cindyyuanjiang cindyyuanjiang removed the ? - Needs Triage Need team to review and classify label Jun 12, 2023
@cindyyuanjiang cindyyuanjiang added the ? - Needs Triage Need team to review and classify label Jun 14, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jun 20, 2023
@andygrove
Copy link
Contributor

I just picked this issue up and ran the test in the issue description. The behavior has changed since this issue was filed. The query falls back to CPU due to:

cannot run on GPU because expression JsonToStructs from_json(StructField(Zipcode,StringType,true), StructField(ZipCodeType,StringType,true), StructField(City,StringType,true), StructField(State,StringType,true), nest_jsonF#7, Some(UTC)) 
produces an unsupported type StructType(StructField(Zipcode,StringType,true),StructField(ZipCodeType,StringType,true),StructField(City,StringType,true),StructField(State,StringType,true)); 
from_json on GPU only supports MapType<StringType, StringType> input schema

@andygrove
Copy link
Contributor

andygrove commented Oct 3, 2023

Changing the output schema to Map<String,String> fails.

val out_struct = df.withColumn("output", from_json(col("nest_jsonF"), DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType)))

out_struct.select("output").show(false)
Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-184-cuda11/src/main/cpp/src/map_utils.cu:94: Incorrect output size computation.

@andygrove
Copy link
Contributor

The first issue here is that GpuOverrides checks are incorrect and disable the JSON to struct functionality. We should allow StructType here. This change was introduced in 2b2835e

      (a, conf, p, r) => new UnaryExprMeta[JsonToStructs](a, conf, p, r) {
        override def tagExprForGpu(): Unit =
          a.schema match {
            case MapType(_: StringType, _: StringType, _) => ()
            case _ =>
              willNotWorkOnGpu("from_json on GPU only supports MapType<StringType, StringType> " +
                               "input schema")
          }

@andygrove
Copy link
Contributor

Ok, I now realize I have gone full circle on this and now understand why the tests were xfailed and why the feature was disabled.

To support parsing JSON to struct and to support reading some parts of the JSON as string (per the example here), we will need something like rapidsai/cudf#14239

@andygrove
Copy link
Contributor

we ask to read the state column as a string, but cuDF returns List<Struct> and we do not have access to the JSON key names, just the values.

COLUMN 3 - LIST
OFFSETS
0 NULL
1 [0 - 1)
2 NULL
COLUMN 3:DATA - STRUCT
COLUMN 3:DATA:CHILD_0 - INT64
0 706
COLUMN 3:DATA:CHILD_1 - STRING
0 "SPECIAL" 5350454349414c
COLUMN 3:DATA:CHILD_2 - STRING
0 "San Jose" 53616e204a6f7365
COLUMN 3:DATA:CHILD_3 - STRING
0 "CA" 4341

We either need cuDF to return this as unparsed string or we need to implement our own parsing, using cuDF for the JSON tokenizer.

@revans2
Copy link
Collaborator

revans2 commented Oct 4, 2023

The C++ for the JSON parser returns a table_with_metadata. https://github.com/rapidsai/cudf/blob/29556a2514f4d274164a27a80539410da7e132d6/cpp/include/cudf/io/types.hpp#L231

We strip off much of the metadata to try and make the API consistent with the other reader APIs that just return the data in the same layout as the schema we passed in. You could use that, but then what happens if you have mixed data types? Like if one of the rows it happens to be a string and the others are structs? I think the only long term solution is to have separate processing for JSOn after the tokenization similar to what we do for map. The special map processing code already does this. But in speaking with some people in CUDF they are going to investigate if this is something that they want to support themselves or if we just need to write our own parser after the CUDF tokenization. There are enough differences already that I am leaning towards our own custom parsing.

@andygrove
Copy link
Contributor

I started on a prototype for this issue in #10326 and this needs updating now that rapidsai/cudf#14954 has been merged

@revans2
Copy link
Collaborator

revans2 commented Mar 14, 2024

With the most recent changes (including #10575) in we are now getting an exception instead of the wrong data.

With spark.rapids.sql.json.read.mixedTypesAsString.enabled set to true or false we get back

Caused by: java.lang.IllegalStateException: Don't know how to transform ColumnVector{rows=1, type=LIST, nullCount=Optional.empty, offHeap=(ID: 98 7f7af001a3a0)} to StringType for JSON
  at org.apache.spark.sql.rapids.GpuJsonReadCommon$.throwMismatchException(GpuJsonReadCommon.scala:273)
  at org.apache.spark.sql.rapids.GpuJsonReadCommon$.nestedColumnViewMismatchTransform(GpuJsonReadCommon.scala:294)
  at org.apache.spark.sql.rapids.GpuJsonReadCommon$.$anonfun$convertToDesiredType$1(GpuJsonReadCommon.scala:317)
  at com.nvidia.spark.rapids.ColumnCastUtil$.$anonfun$deepTransformView$9(ColumnCastUtil.scala:135)
  at scala.Option.map(Option.scala:230)
  at com.nvidia.spark.rapids.ColumnCastUtil$.$anonfun$deepTransformView$1(ColumnCastUtil.scala:134)

rapidsai/cudf#15278 is the issue that was filed to fix it in CUDF.

@revans2 revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Mar 14, 2024
@revans2
Copy link
Collaborator

revans2 commented Mar 14, 2024

@andygrove do you still plan on trying to fix this?

@andygrove andygrove removed their assignment Mar 15, 2024
@andygrove
Copy link
Contributor

@andygrove do you still plan on trying to fix this?

I am not actively working on this, so have unassigned myself.

@ttnghia
Copy link
Collaborator

ttnghia commented Sep 24, 2024

Now the issue seems already being fixed somehow:

24/09/24 03:57:25 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(from_json(StructField(Zipcode,StringType,true), StructField(ZipCodeType,StringType,true), StructField(City,StringType,true), StructField(State,StringType,true), nest_jsonF#7, Some(UTC)) as string) AS output#56 will run on GPU
    *Expression <Cast> cast(from_json(StructField(Zipcode,StringType,true), StructField(ZipCodeType,StringType,true), StructField(City,StringType,true), StructField(State,StringType,true), nest_jsonF#7, Some(UTC)) as string) will run on GPU
      *Expression <JsonToStructs> from_json(StructField(Zipcode,StringType,true), StructField(ZipCodeType,StringType,true), StructField(City,StringType,true), StructField(State,StringType,true), nest_jsonF#7, Some(UTC)) will run on GPU

+-------------------------------------------------------------------------------------------------------------+
|output                                                                                                       |
+-------------------------------------------------------------------------------------------------------------+
|{[706,123,4646], SPECIAL, San Jose, [{"Zipcode":706,"ZipCodeType":"SPECIAL","City":"San Jose","State":"CA"}]}|
|null                                                                                                         |
|null                                                                                                         |
|null                                                                                                         |
+-------------------------------------------------------------------------------------------------------------+

Close as fixed.

@ttnghia ttnghia closed this as completed Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

5 participants