-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] JSON input support #9
Comments
We have had a customer ask about this, so we might bump up the priority on this. I have been looking at the JSON parsing and how that relates to the existing CSV parsing. Just like CSV parsing it is not great, but I think we could do with JSON what we want to do with CSV and parse all of the atomic types as Strings and then handle casting/parsing them ourselves. This will make the code a lot more robust in terms of Spark compatibility. But there are still a number of issues that we have to look into and probably address.
|
Oh, also a single line can contain multiple entries if the top level is an array.
produces
But if there is extra JSON like stuff at the end of the line, it is ignored.
produces the following with no errors. Which feels really odd to me.
There may be a lot of other odd cases, that we need to look into. |
We might want to look at some of the Spark JSON tests, but they are not that complete. |
Oh that is interesting. The parsing of the JSON keys is case sensitive, but auto detection of the schema is not totally so you can get errors if you let Spark detect the schema and there are keys with different cases. i.e. A vs a. So we should test if we can select the keys in a case sensitive way. Also if there are multiple keys in a record. For spark it looks like the last one wins. Not sure what CUDF does in those cases.
produces
with no errors |
Added Needs Triage back on so we can look at this again because I think most of the analysis of this is done. |
log for dup scan issue Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Is your feature request related to a problem? Please describe.
High Priority
GetJsonObject
/get_json_object
get_json_object is being traced by a separate epic. One of them is in the spark-rapids repo and another is in spark-rapids-jni. The JNI version is trying to write a new parser from scratch to match what Spark is doing.
JsonTuple
/json_tuple
For now it is implemented in terms of multiple calls to
get_json_object
. This is likely to change in the future, but for now it should be tracked withget_json_object
JsonToStructs
andScanJson
JsonToStructs
andScanJson
share a common backend and most bugs in one are reflected in the other with the exception of a few issues that are specific to howJsonToStructs
prepares it's input so that the CUDF JSON parser can handle it.from_json
generated inconsistent result comparing with CPU for input column with nested json strings #8558Medium Priority
JsonToStructs
andScanJson
America/Los_Angeles
when all we support is UTC. #10488dropFieldIfAllNull
option #4718from_json
#9774JsonToStruct
andJsonScan
and consolidate some testing and implementation #9750Low Priority
JsonToStructs
andScanJson
lineSep
configuration optionmultiLine
configuration optionprimitivesAsString
configuration optionprefersDecimal
configuration optionfrom_json
#9664from_json
#9723from_json
(Spark 3.1.x only) #9724Testing
JsonToStructs
andScanJson
The text was updated successfully, but these errors were encountered: