Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework on OpenSearchDataType: parse, store and use mapping information #1314

Conversation

Yury-Fridlyand
Copy link
Collaborator

Description

Rework on OpenSearchDataType according to Bit-Quill#169 (comment)
OpenSearchDataType is not a enum anymore, it is a class which stores all required mapping information.

TODOs

Fixes could be done in scope of this PR or in follow-up fixes:

  1. Types of columns added by aggregation are based on ExprCoreType. Type resolution is hardcoded and doesn't consider real column types: https://github.com/Bit-Quill/opensearch-project-sql/blob/e4d89de579e8902a85438d56d0a4645c985f6fa6/core/src/main/java/org/opensearch/sql/data/type/ExprCoreType.java#L86
  2. Get rid of traverseAndFlatten - do recursive traverse only
  3. Support more types in ARRAY https://github.com/Bit-Quill/opensearch-project-sql/blob/16d7ce07d6211fe4978c0b6cf698502a8446df47/opensearch/src/main/java/org/opensearch/sql/opensearch/data/value/OpenSearchExprValueFactory.java#L244

Further future improvements

Features that should be done in scope of other PRs:

  1. text is not completely supported as a separate type yet.
  2. Date formats are ignored (draft in Merge develop into main  #142).
  3. Some types are not supported, e.g. geo_shape, ip_range.
  4. alias is not resolved.
  5. Function Map<String, ExprType> getFieldTypes() in OpenSearchIndex should be replaced by Map<String, OpenSearchDataType> getFieldOpenSearchTypes(). This requires base interface change.
  6. Probably, completely remove ExprCoreType.

Issues Resolved

Assuming that these changes would fix #794, #1038, #1112 and #1113. Relevant for #1111 too.

Notes

  1. Dummy enum value https://github.com/Bit-Quill/opensearch-project-sql/blob/ae11c22176b4d9357931070ccc8523c1b22db1a7/opensearch/src/main/java/org/opensearch/sql/opensearch/data/type/OpenSearchDataType.java#L28
    added to bypass Ignore switch case default cases which can't be covered jacoco/jacoco#1211. See also https://www.javaspecialists.eu/archive/Issue272-Hacking-Enums-Revisited.html

See team review and discussion in Bit-Quill#180.

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ion (#180)

Rework on `OpenSearchDataType`:
* Add data types for those classes are not defined in `ExprCoreType`.
* Address #180 (comment)
* Remove `TextKeywordValue`.
* Add changes according to the PR review. #180
* Update `IndexMapping::parseMapping` function.
* Add `OpenSearchDataType::resolve` function.
* Add new constructor for `OpenSearchTextType`.
* Make `fields` and `properties` in `OpenSearchDataType` readonly. Update tests and mapping parser.
* Move `getFields` from `OpenSearchDataType` to `OpenSearchTextType`. Update tests.
* Rewrite `traverseAndFlatten` according to #180 (comment)
* Minor comment fix.
* A fix to avoid breaking changes.
* `typeName` and `legacyTypeName` to return different type names.
* Change `typeof` function and corresponding tests.
* Move `convertTextToKeyword` from `ScriptUtils` to `OpenSearchTextType`. Update tests.
* Update UT for `typeof` function.
* Make all instances of `OpenSearchDataType` and of derived types singletones as much as possible.
* Make string representations of all `ExprType`s uppercase.
* Remove functions from `IndexMapping` used in tests only.

Signed-off-by: Yury-Fridlyand <yuryf@bitquilltech.com>
Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
…have the common style.

Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
@Yury-Fridlyand Yury-Fridlyand requested a review from a team as a code owner February 2, 2023 00:05
@Yury-Fridlyand Yury-Fridlyand marked this pull request as draft February 2, 2023 17:49
…mapping-use

Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
@Yury-Fridlyand Yury-Fridlyand marked this pull request as ready for review February 4, 2023 02:20
@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2023

Codecov Report

Merging #1314 (be7e9c7) into main (79c7922) will increase coverage by 0.03%.
The diff coverage is 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #1314      +/-   ##
============================================
+ Coverage     98.36%   98.40%   +0.03%     
- Complexity     3640     3733      +93     
============================================
  Files           343      345       +2     
  Lines          9016     9210     +194     
  Branches        585      597      +12     
============================================
+ Hits           8869     9063     +194     
  Misses          142      142              
  Partials          5        5              
Flag Coverage Δ
sql-engine 98.40% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...arch/storage/script/filter/lucene/LuceneQuery.java 100.00% <ø> (ø)
...g/opensearch/sql/data/model/AbstractExprValue.java 100.00% <100.00%> (ø)
...ava/org/opensearch/sql/data/type/ExprCoreType.java 100.00% <100.00%> (ø)
...nsearch/sql/expression/system/SystemFunctions.java 100.00% <100.00%> (ø)
...sql/opensearch/data/type/OpenSearchBinaryType.java 100.00% <100.00%> (ø)
...h/sql/opensearch/data/type/OpenSearchDataType.java 100.00% <100.00%> (ø)
...l/opensearch/data/type/OpenSearchGeoPointType.java 100.00% <100.00%> (ø)
...rch/sql/opensearch/data/type/OpenSearchIpType.java 100.00% <100.00%> (ø)
...h/sql/opensearch/data/type/OpenSearchTextType.java 100.00% <100.00%> (ø)
...ensearch/data/value/OpenSearchExprBinaryValue.java 100.00% <100.00%> (ø)
... and 22 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Collaborator

@MaxKsyunz MaxKsyunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a mistake to rename string to keyword in PPL.

Both SQL and PPL user documentation list string as OpenSearch SQL type for OpenSearch's keyword.

Renaming keyword to string in SQL would make sense, mapping text to text instead of string is also consistent.

if (cachedFieldTypes == null) {
cachedFieldTypes = new OpenSearchDescribeIndexRequest(client, indexName).getFieldTypes();
}
return OpenSearchDataType.traverseAndFlatten(cachedFieldTypes).entrySet().stream()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of traverseAndFlatten can be cached.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs updated in a6cf875 and caching in d47c624.

@Yury-Fridlyand
Copy link
Collaborator Author

Seems a mistake to rename string to keyword in PPL.

Both SQL and PPL user documentation list string as OpenSearch SQL type for OpenSearch's keyword.

Renaming keyword to string in SQL would make sense, mapping text to text instead of string is also consistent.

Renaming keyword to string in SQL would be a breaking change. All SQL queries respond text for text fields and keyword for everything else, even for select 'abc'.
There is a mistake in docs (probably caused by copy-paste from PPL) which should be updated. Thank you for pointing this out.

@dai-chen dai-chen added maintenance Improves code quality, but not the product enhancement New feature or request labels Feb 8, 2023
@MaxKsyunz MaxKsyunz self-requested a review February 8, 2023 19:27
MaxKsyunz
MaxKsyunz previously approved these changes Feb 8, 2023
Copy link
Collaborator

@MaxKsyunz MaxKsyunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this issue opensearch-project/observability#1392 need to be updated?

After this PR they will start getting text in some cases instead of string, right?

@Yury-Fridlyand
Copy link
Collaborator Author

Does this issue opensearch-project/observability#1392 need to be updated?

After this PR they will start getting text in some cases instead of string, right?

No, we won't. There are no breaking changes.
I had it in #180, but it was fixed in Bit-Quill@733a262 in scope of PR180.

…t and keyword.

Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
acarbonetto
acarbonetto previously approved these changes Feb 10, 2023
MaxKsyunz
MaxKsyunz previously approved these changes Feb 10, 2023
Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
@Yury-Fridlyand Yury-Fridlyand dismissed stale reviews from MaxKsyunz and acarbonetto via b968514 February 14, 2023 17:56
MaxKsyunz
MaxKsyunz previously approved these changes Feb 14, 2023
Comment on lines +100 to +103
if (cachedFieldOpenSearchTypes == null) {
cachedFieldOpenSearchTypes = new OpenSearchDescribeIndexRequest(client, indexName)
.getFieldTypes();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate with getFieldOpenSearchTypes()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost yes. getFieldOpenSearchTypes returns full type info. It is used to create OpenSearchExprValueFactory - a class responsible to proper parsing values received from the cluster.
getFieldTypes used to create TypeEnvironment in visitRelation. TypeEnvironment keeps simplified type info as ExprCoreType. It is used later for function resolution.

Text("text", ExprCoreType.UNKNOWN),
Keyword("keyword", ExprCoreType.STRING),
Ip("ip", ExprCoreType.UNKNOWN),
GeoPoint("geo_point", ExprCoreType.UNKNOWN),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question? does geo_point will in response schema? if yes, does it means?

  1. geo_point type should be supported in core.
  2. jdbc driver should support geo_point

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query select * from datatypes (a table from IT) response in legacy:

{
  "schema": [
    {
      "name": "boolean_value",
      "type": "boolean"
    },
    {
      "name": "double_range_value",
      "type": "double_range"
    },
    {
      "name": "ip_value",
      "type": "ip"
    },
    {
      "name": "long_range_value",
      "type": "long_range"
    },
    {
      "name": "keyword_value",
      "type": "keyword"
    },
    {
      "name": "date_range_value",
      "type": "date_range"
    },
    {
      "name": "binary_value",
      "type": "binary"
    },
    {
      "name": "date_value",
      "type": "date"
    },
    {
      "name": "text_value",
      "type": "text"
    },
    {
      "name": "nested_value",
      "alias": "",
      "type": "text"
    },
    {
      "name": "object_value",
      "alias": "",
      "type": "text"
    }
  ],
  "total": 1,
  "datarows": [[
    true,
    {
      "lte": 4,
      "gte": 1
    },
    "127.0.0.1",
    {
      "lte": 100500,
      "gte": 0
    },
    "keyword",
    {
      "lte": "2019-05-15",
      "gte": "2019-05-01"
    },
    "U29tZSBiaW5hcnkgYmxvYg==",
    "2020-10-13 13:00:00.000",
    "text",
    [
      {
        "last": "Smith",
        "first": "John"
      },
      {
        "last": "White",
        "first": "Alice"
      }
    ],
    {
      "last": "Dale",
      "first": "Dale"
    }
  ]],
  "size": 1,
  "status": 200
}

From V2 without these changes:

{
  "schema": [
    {
      "name": "date_value",
      "type": "timestamp"
    },
    {
      "name": "boolean_value",
      "type": "boolean"
    },
    {
      "name": "ip_value",
      "type": "ip"
    },
    {
      "name": "text_value",
      "type": "text"
    },
    {
      "name": "nested_value",
      "type": "nested"
    },
    {
      "name": "object_value",
      "type": "object"
    },
    {
      "name": "keyword_value",
      "type": "keyword"
    },
    {
      "name": "binary_value",
      "type": "binary"
    },
    {
      "name": "geo_value",
      "type": "geo_point"
    }
  ],
  "datarows": [
    [
      "2020-10-13 13:00:00",
      true,
      "127.0.0.1",
      "text",
      [
        {
          "last": "Smith",
          "first": "John"
        },
        {
          "last": "White",
          "first": "Alice"
        }
      ],
      {
        "last": "Dale",
        "first": "Dale"
      },
      "keyword",
      "U29tZSBiaW5hcnkgYmxvYg==",
      {
        "lat": 40.71,
        "lon": 74.0
      }
    ]
  ],
  "total": 1,
  "size": 1,
  "status": 200
}

And with these changes:

{
  "schema": [
    {
      "name": "text_value",
      "type": "text"
    },
    {
      "name": "date_value",
      "type": "timestamp"
    },
    {
      "name": "boolean_value",
      "type": "boolean"
    },
    {
      "name": "ip_value",
      "type": "ip"
    },
    {
      "name": "nested_value",
      "type": "nested"
    },
    {
      "name": "object_value",
      "type": "object"
    },
    {
      "name": "keyword_value",
      "type": "keyword"
    },
    {
      "name": "binary_value",
      "type": "binary"
    },
    {
      "name": "geo_value",
      "type": "geo_point"
    }
  ],
  "datarows": [
    [
      "text",
      "2020-10-13 13:00:00",
      true,
      "127.0.0.1",
      [
        {
          "last": "Smith",
          "first": "John"
        },
        {
          "last": "White",
          "first": "Alice"
        }
      ],
      {
        "last": "Dale",
        "first": "Dale"
      },
      "keyword",
      "U29tZSBiaW5hcnkgYmxvYg==",
      {
        "lat": 40.71,
        "lon": 74.0
      }
    ]
  ],
  "total": 1,
  "size": 1,
  "status": 200
}

And JDBC doesn't support few types including geo_point:
image

}

public static OpenSearchBinaryType of() {
return OpenSearchBinaryType.instance;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return instrance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A singleton

}
res = new OpenSearchDataType(mappingType);
res.exprCoreType = exprCoreType;
instances.put(mappingType.toString(), res);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not init the instances statically?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the idea! Fixed in be7e9c7.

public String legacyTypeName() {
return LEGACY_TYPE_NAME_MAPPING.getOrDefault(this, typeName());
if (mappingType == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case mappingType is null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When object created from aggregation, see changes in AggregationQueryBuilder. Aggregation adds a new column which type has no index mapping; the column type should be passed into OpenSearchExprValueFactory as OpenSearchDataType to provide proper data parsing.


/**
* Text field doesn't have doc value (exception thrown even when you call "get")
* Limitation: assume inner field name is always "keyword".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue #1113 not solved, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not yet. It will be next step and it will be a breaking change.
Current PR is a basement for #1113 and #794 without introducing any changes for the end user.

@dai-chen
Copy link
Collaborator

Just to confirm, will the changes in this PR fix the linked issue #794 ?

@Yury-Fridlyand
Copy link
Collaborator Author

Just to confirm, will the changes in this PR fix the linked issue #794 ?

No, not yet. It will in the upcoming PR.

Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
@penghuo
Copy link
Collaborator

penghuo commented Mar 20, 2023

my understand is

  1. we define the core type, geo_point / point.
  2. opensearch geo_point has special format / configuration, which map to geo_point in core.
  3. postgresql point map to geo_point.
  4. if type is not supported in core, map to string, data is json formatted.

@Yury-Fridlyand Yury-Fridlyand merged commit d44cd39 into opensearch-project:main Mar 21, 2023
@Yury-Fridlyand Yury-Fridlyand deleted the integ-spike-rework-mapping-use branch March 21, 2023 02:21
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 21, 2023
…ion (#1314)

* Rework on `OpenSearchDataType`: parse, store and use mapping information (#180)

Rework on `OpenSearchDataType`:
* Add data types for those classes are not defined in `ExprCoreType`.
* Address Bit-Quill#180 (comment)
* Remove `TextKeywordValue`.
* Add changes according to the PR review. Bit-Quill#180
* Update `IndexMapping::parseMapping` function.
* Add `OpenSearchDataType::resolve` function.
* Add new constructor for `OpenSearchTextType`.
* Make `fields` and `properties` in `OpenSearchDataType` readonly. Update tests and mapping parser.
* Move `getFields` from `OpenSearchDataType` to `OpenSearchTextType`. Update tests.
* Rewrite `traverseAndFlatten` according to Bit-Quill#180 (comment)
* Minor comment fix.
* A fix to avoid breaking changes.
* `typeName` and `legacyTypeName` to return different type names.
* Change `typeof` function and corresponding tests.
* Move `convertTextToKeyword` from `ScriptUtils` to `OpenSearchTextType`. Update tests.
* Update UT for `typeof` function.
* Make all instances of `OpenSearchDataType` and of derived types singletones as much as possible.
* Make string representations of all `ExprType`s uppercase.
* Remove functions from `IndexMapping` used in tests only.

Signed-off-by: Yury-Fridlyand <yuryf@bitquilltech.com>
Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
(cherry picked from commit d44cd39)
Yury-Fridlyand added a commit that referenced this pull request Mar 21, 2023
…ion (#1314) (#1455)

* Rework on `OpenSearchDataType`: parse, store and use mapping information (#180)

Rework on `OpenSearchDataType`:
* Add data types for those classes are not defined in `ExprCoreType`.
* Address Bit-Quill#180 (comment)
* Remove `TextKeywordValue`.
* Add changes according to the PR review. Bit-Quill#180
* Update `IndexMapping::parseMapping` function.
* Add `OpenSearchDataType::resolve` function.
* Add new constructor for `OpenSearchTextType`.
* Make `fields` and `properties` in `OpenSearchDataType` readonly. Update tests and mapping parser.
* Move `getFields` from `OpenSearchDataType` to `OpenSearchTextType`. Update tests.
* Rewrite `traverseAndFlatten` according to Bit-Quill#180 (comment)
* Minor comment fix.
* A fix to avoid breaking changes.
* `typeName` and `legacyTypeName` to return different type names.
* Change `typeof` function and corresponding tests.
* Move `convertTextToKeyword` from `ScriptUtils` to `OpenSearchTextType`. Update tests.
* Update UT for `typeof` function.
* Make all instances of `OpenSearchDataType` and of derived types singletones as much as possible.
* Make string representations of all `ExprType`s uppercase.
* Remove functions from `IndexMapping` used in tests only.

Signed-off-by: Yury-Fridlyand <yuryf@bitquilltech.com>
Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
(cherry picked from commit d44cd39)

Co-authored-by: Yury-Fridlyand <yury.fridlyand@improving.com>
@Yury-Fridlyand
Copy link
Collaborator Author

my understand is

  1. we define the core type, geo_point / point.
  2. opensearch geo_point has special format / configuration, which map to geo_point in core.
  3. postgresql point map to geo_point.
  4. if type is not supported in core, map to string, data is json formatted.

Currently, V2 engine follows 4. We have #1432 to track geo_point update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x enhancement New feature or request maintenance Improves code quality, but not the product
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] SQL query doesn't honor date format in OpenSearch index mapping
6 participants