-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[followup] Refactor JSON function and add TO_JSON_STRING, ARRAY_LENGHT functions #870
Conversation
…T functions Signed-off-by: Lantao Jin <ltjin@amazon.com>
fetched rows / total rows = 1/1 | ||
+----------------------------------------+ | ||
| json_array_object | | ||
+----------------------------------------+ | ||
| {"array":[1.0,2.0,0.0,-1.0,1.1,-0.11]} | | ||
+----------------------------------------+ | ||
|
||
### `TO_JSON_STRING` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the naming: to_json
is good name for Spark since there is no json object data type. So to_json
in Spark is converting StructType to json String. But to_json
might not a good name for PPL since it's easy to cause misunderstanding as converting something to json object. to_json_string
could be more straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes sounds like a better distinction ...
@@ -102,7 +104,9 @@ public interface BuiltinFunctionTransformer { | |||
.put(COALESCE, "coalesce") | |||
.put(LENGTH, "length") | |||
.put(TRIM, "trim") | |||
.put(ARRAY_LENGTH, "array_size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We choose array_length
instead of array_size
since we have had a function json_array_length
. So it maps to the Spark builtin function array_size
.
@@ -126,26 +130,12 @@ public interface BuiltinFunctionTransformer { | |||
.put( | |||
JSON_ARRAY_LENGTH, | |||
args -> { | |||
// Check if the input is an array (from json_array()) or a JSON string | |||
if (args.get(0) instanceof UnresolvedFunction) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic in this if-else
could cause bugs.
Both
| eval b = json_array_length(json_array(1,2,3))
and
| eval b = json_array_length("[1,2,3]")
work fine.
But imaging following two cases:
| eval a = json_array(1,2,3), b = json_array_length(a)
| eval a = "[1,2,3]", b = json_array_length(a)
The args of json_array_length
are both UnresolvedAttribute("a")
which go into the else
branch. But a
in the first query will be resolved to UnresolveFunction("array"..)
and a
in the second case will be resolved to StringLiteral
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin I approved but plz make sure we cover the next issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin did we cover the next issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #780 , we can access the first element in a array by
| eval first_elem = json_extract('{"items": ["a", "b", "c"]}', '$.items[0]') // json object string
or
| eval first_elem = json_extract('["a", "b", "c"]', '$.[0]') // json array string
But it requires the input is a json string. If the input is json array (ArrayType), I think we need a new function such as array_get
:
| eval first_elem = array_get(json_array("a", "b", "c"), 0)
But as a workaround, we could rewrite by
| eval first_elem = json_extract(to_json_string(json_array("a", "b", "c")), '$.[0]')
So keep #675 open until we have array_get
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Em, I think I can add array_get
ASAP in this PR too.
Let merge this first. From the description of #675 ,
json_extract('{"items": ["a", "b", "c"]}', '$.items[0]')
seems the right solution.
…T functions (opensearch-project#870) Signed-off-by: Lantao Jin <ltjin@amazon.com>
Description
This is a followup of PR #780
Refactor
JSON
: Currently theJSON
can evaluate STRING/JSON_ARRAY/JSON_OBJECT. It brings many confusions and bugs. It because in Splunk or OpenSearch, the data could be stored as JSON object,JSON
function evaluates JSON object and return its value or null if invalid. But in Spark, there is no JSON object, but it contains STRING, StructType and ArrayType. So to evaluate a json string,JSON
function should accept STRING type only.Add
ARRAY_LENGHT
: for the same reason, the originalJSON_ARRAY_LENGHT
accepts both STRINGand JSON_ARRAY(ArrayType). We separateJSON_ARRAY_LENGHT
toJSON_ARRAY_LENGHT
andARRAY_LENGHT
. TheJSON_ARRAY_LENGHT
only accepts STRING type,ARRAY_LENGHT
only accepts ArrayType.Add
TO_JSON_STRING
: After the refactor of (1), we still need a method to convert JSON_ARRAY/JSON_OBJECT to valid JSON STRING.TO_JSON_STRING
accepts both StructType and ArrayType as input and returns JSON formatted string.Related Issues
Resolves #869
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.