Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GetJsonObject does not validate the input is JSON in the same way as Spark #10194

Closed
revans2 opened this issue Jan 12, 2024 · 2 comments · Fixed by #10581
Closed

[BUG] GetJsonObject does not validate the input is JSON in the same way as Spark #10194

revans2 opened this issue Jan 12, 2024 · 2 comments · Fixed by #10581
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 12, 2024

Describe the bug
The current GPU implementation of GetJsonObject does not check if the JSON data is valid. The CPU version uses a JSON parser that allows single quotes and unescaped control characters.

https://github.com/apache/spark/blob/a3266b411723310ec10fc1843ddababc15249db0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L108-L114

If there are any errors when parsing the data then the result is converted to a null.

https://github.com/apache/spark/blob/a3266b411723310ec10fc1843ddababc15249db0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L267

We probably need to do at least some validation that the data is correct.

Steps/Code to reproduce bug

scala> val df = Seq("""{"url":"http://test.com"}""","{'url':'http://test.com'}","{'url':'http://test.com\t'}", """{"url":"http://"http://test.com"","something":1234"}""","""{"url":"http://test.com","something":1234, ""","""{"url":"http://test.com",}""","""{"url":"http://test.com",,}""","""[{"url": "http://test.com"}]""").toDF("jsonstr")
df: org.apache.spark.sql.DataFrame = [jsonstr: string]

scala> df.repartition(1).selectExpr("get_json_object(jsonstr, '$.url') as url").show(false)

+---------------+
|url            |
+---------------+
|http://test.com|
|null           |
|null           |
|http://        |
|http://test.com|
|http://test.com|
|http://test.com|
|null           |
+---------------+


scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> df.repartition(1).selectExpr("get_json_object(jsonstr, '$.url') as url").show(false)
+-----------------+
|url              |
+-----------------+
|http://test.com  |
|http://test.com  |
|http://test.com\t|
|null             |
|null             |
|null             |
|null             |
|null             |
+-----------------+

Expected behavior
We produce the same results as Spark on the CPU.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 12, 2024
@revans2
Copy link
Collaborator Author

revans2 commented Jan 12, 2024

Be aware that this is a little odd with how in handles escape sequence validation too. The only characters that I have found that can handle an escape are ",', /; \, b, f, n, r, t all other characters appear to cause the validation to fail.

@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jan 23, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 30, 2024
@res-life res-life self-assigned this Feb 6, 2024
@sameerz sameerz changed the title [BUG] GetJsonObject does not validate the input is JSON in the same was as Spark [BUG] GetJsonObject does not validate the input is JSON in the same way as Spark Feb 11, 2024
@res-life
Copy link
Collaborator

PR: NVIDIA/spark-rapids-jni#1868 will fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
4 participants