Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark application #1723

Merged
merged 11 commits into from
Jun 23, 2023
Merged

Conversation

rupal-bq
Copy link
Contributor

@rupal-bq rupal-bq commented Jun 8, 2023

Description

  • Writes result of sql query to OpenSearch index

Schema of final DataFrame

root
 |-- result: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- schema: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- stepId: string (nullable = false)

Example (query: select * from my_table)

+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+---------------+
|result                                                                           |schema                                                                                         |stepId         |
+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+---------------+
|[{'Letter':'A','Number':1}, {'Letter':'B','Number':2}, {'Letter':'C','Number':3}]|[{'column_name':'Letter','data_type':'string'}, {'column_name':'Number','data_type':'integer'}]|s-3FL9QYLC51W6U|
+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+---------------+

Data written to OpenSearch index

POST /.query_execution_result/_search
{
  "query": {
    "match_all": {}
  }
}

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : ".query_execution_result",
        "_id" : "BGWzsYgBMUoqCqlDGXol",
        "_score" : 1.0,
        "_source" : {
          "result" : [
            "{'Letter':'A','Number':1}",
            "{'Letter':'B','Number':2}",
            "{'Letter':'C','Number':3}"
          ],
          "schema" : [
            "{'column_name':'Letter','data_type':'string'}",
            "{'column_name':'Number','data_type':'integer'}"
          ],
          "stepId" : "s-3FL9QYLC51W6U"
        }
      }
    ]
  }
}

Issues Resolved

#1722

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

rupal-bq added 3 commits June 8, 2023 06:11
Signed-off-by: Rupal Mahajan <maharup@amazon.com>
Signed-off-by: Rupal Mahajan <maharup@amazon.com>
Signed-off-by: Rupal Mahajan <maharup@amazon.com>
@rupal-bq rupal-bq self-assigned this Jun 8, 2023
@codecov
Copy link

codecov bot commented Jun 8, 2023

Codecov Report

Merging #1723 (65e0b1c) into main (c7dfdb3) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main    #1723   +/-   ##
=========================================
  Coverage     97.29%   97.29%           
  Complexity     4408     4408           
=========================================
  Files           388      388           
  Lines         10944    10944           
  Branches        774      774           
=========================================
  Hits          10648    10648           
  Misses          289      289           
  Partials          7        7           
Flag Coverage Δ
sql-engine 97.29% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

rupal-bq added 2 commits June 8, 2023 13:38
Signed-off-by: Rupal Mahajan <maharup@amazon.com>
Signed-off-by: Rupal Mahajan <maharup@amazon.com>
Signed-off-by: Rupal Mahajan <maharup@amazon.com>
@penghuo
Copy link
Collaborator

penghuo commented Jun 12, 2023

  1. what is the mapping of query_execution_result? could you add the result specification to the doc also?

spark-sql-application/project/plugins.sbt Outdated Show resolved Hide resolved
}
}

def getJson(df: DataFrame): DataFrame = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the limitation of the result size? 100MB, because limits the maximum size of a HTTP request to 100mb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so but I haven't tested with large dataset yet

Copy link
Member

@vmmusings vmmusings Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we exclude the filed in which we are writing the result from the mapping.
we can keep the data in _source and the field shouldn't be analyzed or indexed. I am not sure if we are specifying the mapping of the index in which we are writing to

val expectedSchema = StructType(Seq(
StructField(name, ArrayType(StringType, containsNull = true), nullable = true)
))
val expectedRows = Seq(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expected in here is different with the description which include triple quote string,

          "result" : [
            """{"name":"Tina","age":29,"city":"Bellevue"}""",
            """{"name":"Jane","age":25,"city":"London"}""",
            """{"name":"Mike","age":35,"city":"Paris"}"""
          ],

does triple quote string valid in json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I print dataframe I don't see triple quotes that's why didn't add in test. Also IIRC, triple quote strings are not valid in json.

I assumed it's getting added while writing to opensearch index because string contains quote. will check more on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After replacing \" with ' in all json string, triple quotes is removed from index and json is valid. updated doc and description accordingly.

Signed-off-by: Rupal Mahajan <maharup@amazon.com>
Signed-off-by: Rupal Mahajan <maharup@amazon.com>
@rupal-bq
Copy link
Contributor Author

rupal-bq commented Jun 12, 2023

  1. what is the mapping of query_execution_result? could you add the result specification to the doc also?

Updated readme and currently mapping looks like

{
  ".query_execution_result" : {
    "mappings" : {
      "properties" : {
        "result" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "schema" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "stepId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

Signed-off-by: Rupal Mahajan <maharup@amazon.com>
Signed-off-by: Rupal Mahajan <maharup@amazon.com>
Signed-off-by: Rupal Mahajan <maharup@amazon.com>
@rupal-bq rupal-bq merged commit 6c3744e into opensearch-project:main Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Flint
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants