Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PPL JSON functions: construction and extraction #780

Merged
merged 13 commits into from
Oct 25, 2024

Conversation

LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Oct 15, 2024

Description

This PR is the 1st part of addressing JSON functions requirement. It includes JSON construction and extraction related functions:

  • json_object()
  • json()
  • json_array()
  • json_array_lenght()
  • json_extract()
  • json_keys()
  • json_valid()

The 2nd part (not this PR) of JSON functions are conversion and truncation functions:

  • json_append
  • json_delete
  • json_extend
  • json_set
  • json_array_all_match
  • json_array_any_match
  • json_array_filter
  • json_array_map
  • json_array_reduce

PPL JSON Functions

JSON

Description

json(value) Evaluates whether a value can be parsed as JSON. Returns the json string if valid, null otherwise.

Argument type: STRING/JSON_ARRAY/JSON_OBJECT

Return type: STRING

A STRING expression of a valid JSON object format.

Example:

os> source=people | eval `valid_json()` = json('[1,2,3,{"f1":1,"f2":[5,6]},4]') | fields valid_json
fetched rows / total rows = 1/1
+---------------------------------+
| valid_json                      |
+---------------------------------+
| [1,2,3,{"f1":1,"f2":[5,6]},4]   |
+---------------------------------+

os> source=people | eval `invalid_json()` = json('{"invalid": "json"') | fields invalid_json
fetched rows / total rows = 1/1
+----------------+
| invalid_json   |
+----------------+
| null           |
+----------------+

JSON_OBJECT

Description

json_object(<key>, <value>[, <key>, <value>]...) returns a JSON object from members of key-value pairs.

Argument type:

  • A <key> must be STRING.
  • A <value> can be any data types.

Return type: JSON_OBJECT (Spark StructType)

A StructType expression of a valid JSON object.

Example:

os> source=people | eval result = json(json_object('key', 123.45)) | fields result
fetched rows / total rows = 1/1
+------------------+
| result           |
+------------------+
| {"key":123.45}   |
+------------------+

os> source=people | eval result = json(json_object('outer', json_object('inner', 123.45))) | fields result
fetched rows / total rows = 1/1
+------------------------------+
| result                       |
+------------------------------+
| {"outer":{"inner":123.45}}   |
+------------------------------+

JSON_ARRAY

Description

json_array(<value>...) Creates a JSON ARRAY using a list of values.

Argument type:

  • A <value> can be any kind of value such as string, number, or boolean.

Return type: ARRAY (Spark ArrayType)

An array of any supported data type for a valid JSON array.

Example:

os> source=people | eval `json_array` = json_array(1, 2, 0, -1, 1.1, -0.11)
fetched rows / total rows = 1/1
+----------------------------+
| json_array                 |
+----------------------------+
| 1.0,2.0,0.0,-1.0,1.1,-0.11 |
+----------------------------+

os> source=people | eval `json_array_object` = json(json_object("array", json_array(1, 2, 0, -1, 1.1, -0.11)))
fetched rows / total rows = 1/1
+----------------------------------------+
| json_array_object                      |
+----------------------------------------+
| {"array":[1.0,2.0,0.0,-1.0,1.1,-0.11]} |
+----------------------------------------+

JSON_ARRAY_LENGTH

Description

json_array_length(jsonArray) Returns the number of elements in the outermost JSON array.

Argument type: STRING/JSON_ARRAY

A STRING expression of a valid JSON array format, or JSON_ARRAY object.

Return type: INTEGER

NULL is returned in case of any other valid JSON string, NULL or an invalid JSON.

Example:

os> source=people | eval `lenght1` = json_array_length('[1,2,3,4]'), `lenght2` = json_array_length('[1,2,3,{"f1":1,"f2":[5,6]},4]'), `not_array` = json_array_length('{"key": 1}')
fetched rows / total rows = 1/1
+-----------+-----------+-------------+
| lenght1   | lenght2   | not_array   |
+-----------+-----------+-------------+
| 4         | 5         | null        |
+-----------+-----------+-------------+

os> source=people | eval `json_array` = json_array_length(json_array(1,2,3,4)), `empty_array` = json_array_length(json_array())
fetched rows / total rows = 1/1
+--------------+---------------+
| json_array   | empty_array   |
+--------------+---------------+
| 4            | 0             |
+--------------+---------------+

JSON_EXTRACT

Description

json_extract(jsonStr, path) Extracts json object from a json string based on json path specified. Return null if the input json string is invalid.

Argument type: STRING, STRING

Return type: STRING

A STRING expression of a valid JSON object format.

NULL is returned in case of an invalid JSON.

Example:

os> source=people | eval `json_extract('{"a":"b"}', '$.a')` = json_extract('{"a":"b"}', '$a')
fetched rows / total rows = 1/1
+----------------------------------+
| json_extract('{"a":"b"}', 'a')   |
+----------------------------------+
| b                                |
+----------------------------------+

os> source=people | eval `json_extract('{"a":[{"b":1},{"b":2}]}', '$.a[1].b')` = json_extract('{"a":[{"b":1},{"b":2}]}', '$.a[1].b')
fetched rows / total rows = 1/1
+-----------------------------------------------------------+
| json_extract('{"a":[{"b":1.0},{"b":2.0}]}', '$.a[1].b')   |
+-----------------------------------------------------------+
| 2.0                                                       |
+-----------------------------------------------------------+

os> source=people | eval `json_extract('{"a":[{"b":1},{"b":2}]}', '$.a[*].b')` = json_extract('{"a":[{"b":1},{"b":2}]}', '$.a[*].b')
fetched rows / total rows = 1/1
+-----------------------------------------------------------+
| json_extract('{"a":[{"b":1.0},{"b":2.0}]}', '$.a[*].b')   |
+-----------------------------------------------------------+
| [1.0,2.0]                                                 |
+-----------------------------------------------------------+

os> source=people | eval `invalid_json` = json_extract('{"invalid": "json"')
fetched rows / total rows = 1/1
+----------------+
| invalid_json   |
+----------------+
| null           |
+----------------+

JSON_KEYS

Description

json_keys(jsonStr) Returns all the keys of the outermost JSON object as an array.

Argument type: STRING

A STRING expression of a valid JSON object format.

Return type: ARRAY[STRING]

NULL is returned in case of any other valid JSON string, or an empty string, or an invalid JSON.

Example:

os> source=people | eval `keys` = json_keys('{"f1":"abc","f2":{"f3":"a","f4":"b"}}')
fetched rows / total rows = 1/1
+------------+
| keus       |
+------------+
| [f1, f2]   |
+------------+

os> source=people | eval `keys` = json_keys('[1,2,3,{"f1":1,"f2":[5,6]},4]')
fetched rows / total rows = 1/1
+--------+
| keys   |
+--------+
| null   |
+--------+

JSON_VALID

Description

json_valid(jsonStr) Evaluates whether a JSON string uses valid JSON syntax and returns TRUE or FALSE.

Argument type: STRING

Return type: BOOLEAN

Example:

os> source=people | eval `valid_json` = json_valid('[1,2,3,4]'), `invalid_json` = json_valid('{"invalid": "json"') | feilds `valid_json`, `invalid_json`
fetched rows / total rows = 1/1
+--------------+----------------+
| valid_json   | invalid_json   |
+--------------+----------------+
| True         | False          |
+--------------+----------------+

os> source=accounts | where json_valid('[1,2,3,4]') and isnull(email) | fields account_number, email
fetched rows / total rows = 1/1
+------------------+---------+
| account_number   | email   |
|------------------+---------|
| 13               | null    |
+------------------+---------+

Issues Resolved

Partial resolve #667

Check List

  • Updated documentation (ppl-spark-integration/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin LantaoJin added Lang:PPL Pipe Processing Language support 0.6 labels Oct 17, 2024
@LantaoJin LantaoJin marked this pull request as ready for review October 18, 2024 11:07
@LantaoJin LantaoJin changed the title Support PPL JSON functions Support PPL JSON functions (part1) Oct 18, 2024
@LantaoJin LantaoJin changed the title Support PPL JSON functions (part1) Support PPL JSON functions: construction and extraction Oct 18, 2024
@LantaoJin
Copy link
Member Author

LantaoJin commented Oct 18, 2024

@YANG-DB @penghuo , I'd like to separate this requirement to multiple PRs due to the different complexity of difference functions.
Another highlight part is whatever in Spark, MySQL or Snowflake, the result of any JSON construction functions are JSON string. There is no Json native data type in database, JSON is a string format. For example, function json_object() returns a string expression of a valid JSON object format, json_array() return a string expression of a valid JSON array format.

Updated Oct 21: To align with the requirement, the construction json_object returns Spark StructType (equals to named_struct()), the construction json_array() returns Spark ArrayType (equals to array()).

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin
Copy link
Member Author

Hi @YANG-DB @dai-chen @penghuo , could u review this PR?

@LantaoJin LantaoJin enabled auto-merge (squash) October 24, 2024 06:29
@seankao-az seankao-az disabled auto-merge October 25, 2024 17:24
@seankao-az seankao-az merged commit ee75048 into opensearch-project:main Oct 25, 2024
4 checks passed
@A-Gray-Cat
Copy link

Hello @YANG-DB @LantaoJin

Thanks or adding this feature quickly. One QQ, what's the syntax to access values within an array?

kenrickyap pushed a commit to Bit-Quill/opensearch-spark that referenced this pull request Dec 11, 2024
…-project#780)

* first commit

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* add docs and fix IT

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* add examples for json_extract()

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* fix missing import and doc link

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* minor

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* add UT and optimize the doc

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* typo

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* fix the issue when merge conflicts

Signed-off-by: Lantao Jin <ltjin@amazon.com>

---------

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]Add PPL JSON extended functions support
4 participants