[RFC] Schema on Reads #1133

imRishN · 2021-08-22T19:17:50Z

Problem Statement

By default, OpenSearch supports ‘schema on write’ i.e. the structure is defined at the time of ingest so that it is available for query immediately. However, as use cases for OpenSearch evolved, there is a need for greater flexibility. End users may not be aware of the data structure or may want additional attributes to query upon post ingest. This is where ‘schema on read’ is useful. With ‘schema on read’, the query result field can be defined at the time of query. This also helps greatly improve ingest rate by avoiding having to index fields that are not always going to be queried right away.

Requirements

Ability to define fields that are evaluated at query time.
No changes should be made to the underlying schema. This avoids the need to re-index existing data.
These user defined fields should support all operations of a regular field in the query.

Existing Solution

Scripting

Scripting is supported at various constructs of the _search request body. In each of these constructs, the fundamental working is same: script is evaluated at query time, it derives value/s from the indexed field/s and acts on the derived values.

In query and filter context, the derived value can be used to filter out documents.
In aggregations, results can be aggregated on the derived value.
The derived values can be exposed as a custom field by including it in script_fields.
Results can also be sorted on the derived value.
Using script_score, the derived value can be used to score the filtered documents.

Shortcomings of existing solution

Scripting satisfies most of the requirements listed above but adding scripts to the request make it bulky, non-readable and difficult to manage. Even though scripts can be stored and referenced in the query, it does not help the readability.

Following example highlights the same:

GET index_1/_search
{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": """
 return ChronoUnit.YEARS.between(doc['dob'].value, doc['create_time'].value) > 18;
 """
        }
      }
    }
  },
  "aggs": {
    "day-aggregations": {
      "histogram": {
        "interval": 10,
        "script": {
          "source": "ChronoUnit.YEARS.between(doc['dob'].value, doc['create_time'].value);"
        }
      }
    }
  },
  "sort": {
    "_script": {
      "type": "number",
      "script": {
        "source": "ChronoUnit.DAYS.between(doc['dob'].value, doc['create_time'].value);"
      },
      "order": "desc"
    }
  },
  "_source": true,
  "script_fields": {
    "age": {
      "script": "ChronoUnit.YEARS.between(doc['dob'].value, doc['create_time'].value);"
    }
  },
  "size": 10
}

Proposed Solution

Regular OpenSearch queries revolve around fields in the schema. With scripting, the query syntax changes a lot.
In the proposed solution, we aim to achieve ease of using schema on read along with all the benefits of scripting.
The proposal includes defining fields in mapping which will be evaluated at query time and behave like regular fields.

The text was updated successfully, but these errors were encountered:

dblock · 2021-11-23T21:08:43Z

@rramachand21 are you working on this?

anasalkouz · 2021-11-24T19:27:54Z

Thanks for putting up this proposal. Downside of this feature, clients will start taking the easy path and use schema on reads even for fields that are being used frequently. we should think of some field usage and guardrails to avoid abusing the feature.
Could you explain more about the usage of those fields? for example, can I use those fields in the aggregation. does the field searchable? or those field will be only used for data retrieval.

lrynek · 2022-01-25T17:05:30Z

@imRishN what about the existing runtime fields feature? Looks almost the same that you are proposing here:
👉 https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html
Could you maybe tell in what your solution will differ from the runtime fields? (just asking for the sake of curiosity to better understand the proposal at hand 😉 )

reta · 2022-01-25T19:58:14Z

@lrynek You are very right, the runtime fields serve the same purpose (schema on read) but this is proprietary Elasticseach feature / implementation. The goal of this RFC is to provide similar functionality on OpenSearch side (but obviously it cannot be copied as is).

lrynek · 2022-01-26T08:07:08Z

@imRishN Oh, haven't known that ,thanks for explanation! 👍 // It's that I assumed that given OpenSearch is a fork of Elasticsearch 7 version, it would have all the features available for that version too. Have we got any reference for such discrepancies between the two projects? It would be awesome...😎

dblock · 2022-02-02T16:59:01Z

@imRishN Oh, haven't known that ,thanks for explanation! 👍 // It's that I assumed that given OpenSearch is a fork of Elasticsearch 7 version, it would have all the features available for that version too. Have we got any reference for such discrepancies between the two projects? It would be awesome...😎

OpenSearch forked at 7.10.2, so anything added in OpenSearch or ES since then is likely different.

lrynek · 2022-02-02T18:07:34Z

@dblock Thanks for explaining! 👍

marekm-gain · 2022-09-20T15:18:12Z

+1 for having this

ryn9 · 2022-10-13T14:20:52Z

@imRishN @rramachand21 @elfisher
Is this advancing? It looks like this keeps slipping?

elfisher · 2022-10-27T16:20:48Z

@imRishN is this being worked on?

grahamplace · 2022-10-27T23:58:14Z

Came from: https://forum.opensearch.org/t/runtime-fields-on-opensearch/9837

I'm bummed that opensearch doesn't support runtime fields — they seemed like the solution I needed for my project (was reading about ES, obviously), so I'm disappointed that I'm left without the feature having chosen OS over ES 😞

rramachand21 · 2022-11-05T06:06:58Z

We will be looking into this and updating this with a more accurate version where this will be available. As usual, opensource contributions are welcome :) If there is interest in contributing to this, please do reach out.

svdasein · 2023-04-05T16:25:35Z

@rramachand21 has this made it to the roadmap yet? Can you comment on status?

khmelevskii · 2023-05-11T23:39:05Z

@rramachand21 do you have a plan deliver it?

hrbu · 2023-06-07T20:09:18Z

Voting for this feature too. This would massively simplify our task to build an integrated view on distributed data. Currently we manage this by a prepocessing service resolving references before indexing.

AhmedAbdoOrtiga · 2023-06-14T05:34:53Z

It's crazy that such a feature has still been in the backlog since 2021!

dblock · 2023-06-16T16:50:39Z

Please contribute!

khmelevskii · 2023-10-01T18:45:59Z

Please contribute or let us know your plan about this feature.

reta · 2024-01-22T03:27:43Z

@rishabhmaurya thanks for picking it up, I have nothing against runtime_fields but it looks like Elasticsearch has invented them at the first place, may be we should look for another name for it?

rishabhmaurya · 2024-01-22T18:24:14Z

@reta that's a good point. Should we call them prototype field? Since they are meant for prototype purposes and not for permanent use.

smacrakis · 2024-01-22T19:43:36Z

I don't think we should name them for what we think they are best used for. I'd hope that they would have essentially zero runtime cost and so not be suitable only for prototypes.

There are lots of good names to choose from. I think in DBMS's they are called generated, virtual, computed, calculated, derived, etc.

Or is there some critical difference between the DBMS concept and the OpenSearch concept which needs to be emphasized?

reta · 2024-01-22T20:00:59Z

I like computed_fields or derived_fields, I think they fit very well the purpose, thank you @smacrakis

rishabhmaurya · 2024-01-22T20:09:55Z

+1 for derived_fields, i hope we don't use it for any other purposes today in OpenSearch.

khmelevskii · 2024-01-23T02:03:04Z

It would be great to have the same name and interface as ElasticSearch for it

reta · 2024-01-23T02:51:22Z

It would be great to have the same name and interface as ElasticSearch for it

In theory - yes, in practice - we are asking for problems: Elasticsearch is not OSS

rishabhmaurya · 2024-04-11T18:05:52Z

Good news - we are done with most of the implementation(#12281) and here is a little documentation (opensearch-project/documentation-website#6943). I encourage folks waiting for it to give it a shot using snapshot build and see if it meets their needs. Let us know if you have any feedback or suggestions, happy to incorporate them possibly before next version release.
Thanks @qreshi and @msfroh for your contributions.

rishabhmaurya · 2024-06-26T17:09:32Z

This feature is released in 2.15 - https://opensearch.org/docs/latest/field-types/supported-field-types/derived/
with certain limitations, which we will be working on next. Current and future limitations are tracked as part of #12281

imRishN added the enhancement Enhancement or improvement to existing feature or request label Aug 22, 2021

imRishN changed the title ~~Schema on Reads~~ [RFC] Schema on Reads Oct 1, 2021

anasalkouz added the Indexing & Search label Nov 17, 2021

Bukhtawar mentioned this issue Jan 21, 2022

[Idea] Introduce new compute role / phase (aka: dynamic fields) #1912

Open

CEHENKLE mentioned this issue Feb 11, 2022

OpenSearch Engine 2022 Themes #2095

Open

reta mentioned this issue Feb 17, 2022

'runtime_mappings' not supported on query #2144

Closed

ryn9 mentioned this issue Oct 13, 2022

TSVB - "group on" over multiple fields / dimensions opensearch-project/OpenSearch-Dashboards#2565

Open

RaulSokolova mentioned this issue Mar 1, 2023

Observations/Indicators filtering by Creator OpenCTI-Platform/opencti#2936

Closed

dagneyb added the v2.13.0 Issues and PRs related to version 2.13.0 label Jan 22, 2024

qreshi mentioned this issue Feb 9, 2024

[META] Derived Fields #12281

Open

6 tasks

getsaurabh02 added v2.14.0 and removed v2.13.0 Issues and PRs related to version 2.13.0 labels Apr 8, 2024

rishabhmaurya mentioned this issue Apr 11, 2024

[DOC] Document the Derived Field feature opensearch-project/documentation-website#6943

Closed

4 tasks

sohami added the RFC Issues requesting major changes label May 14, 2024

github-project-automation bot added this to Test roadmap format May 14, 2024

github-project-automation bot moved this to Planned work items in Test roadmap format May 14, 2024

sohami added the Roadmap:Search Project-wide roadmap label label May 14, 2024

kkhatua added the v2.15.0 Issues and PRs related to version 2.15.0 label May 20, 2024

getsaurabh02 removed the v2.14.0 label May 20, 2024

rishabhmaurya mentioned this issue May 21, 2024

[Derived Fields] Derived fields integration with Dashboards opensearch-project/OpenSearch-Dashboards#6817

Open

getsaurabh02 added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

rishabhmaurya closed this as completed Jun 26, 2024

github-project-automation bot moved this from Todo to Done in Performance Roadmap Jun 26, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Jun 26, 2024

dblock mentioned this issue Oct 22, 2024

Remove unsupported runtime fields types opensearch-project/opensearch-api-specification#634

Merged

Xtansia mentioned this issue Oct 22, 2024

[FEATURE] Add specifications for Derived Fields opensearch-project/opensearch-api-specification#636

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Schema on Reads #1133

[RFC] Schema on Reads #1133

imRishN commented Aug 22, 2021 •

edited

Loading

dblock commented Nov 23, 2021

anasalkouz commented Nov 24, 2021

lrynek commented Jan 25, 2022

reta commented Jan 25, 2022 •

edited

Loading

lrynek commented Jan 26, 2022

dblock commented Feb 2, 2022

lrynek commented Feb 2, 2022

marekm-gain commented Sep 20, 2022

ryn9 commented Oct 13, 2022

elfisher commented Oct 27, 2022

grahamplace commented Oct 27, 2022

rramachand21 commented Nov 5, 2022

svdasein commented Apr 5, 2023

khmelevskii commented May 11, 2023

hrbu commented Jun 7, 2023

AhmedAbdoOrtiga commented Jun 14, 2023 •

edited

Loading

dblock commented Jun 16, 2023

khmelevskii commented Oct 1, 2023

reta commented Jan 22, 2024

rishabhmaurya commented Jan 22, 2024

smacrakis commented Jan 22, 2024

reta commented Jan 22, 2024

rishabhmaurya commented Jan 22, 2024

khmelevskii commented Jan 23, 2024

reta commented Jan 23, 2024

rishabhmaurya commented Apr 11, 2024

rishabhmaurya commented Jun 26, 2024 •

edited

Loading

[RFC] Schema on Reads #1133

[RFC] Schema on Reads #1133

Comments

imRishN commented Aug 22, 2021 • edited Loading

Problem Statement

Requirements

Existing Solution

Scripting

Shortcomings of existing solution

Proposed Solution

dblock commented Nov 23, 2021

anasalkouz commented Nov 24, 2021

lrynek commented Jan 25, 2022

reta commented Jan 25, 2022 • edited Loading

lrynek commented Jan 26, 2022

dblock commented Feb 2, 2022

lrynek commented Feb 2, 2022

marekm-gain commented Sep 20, 2022

ryn9 commented Oct 13, 2022

elfisher commented Oct 27, 2022

grahamplace commented Oct 27, 2022

rramachand21 commented Nov 5, 2022

svdasein commented Apr 5, 2023

khmelevskii commented May 11, 2023

hrbu commented Jun 7, 2023

AhmedAbdoOrtiga commented Jun 14, 2023 • edited Loading

dblock commented Jun 16, 2023

khmelevskii commented Oct 1, 2023

reta commented Jan 22, 2024

rishabhmaurya commented Jan 22, 2024

smacrakis commented Jan 22, 2024

reta commented Jan 22, 2024

rishabhmaurya commented Jan 22, 2024

khmelevskii commented Jan 23, 2024

reta commented Jan 23, 2024

rishabhmaurya commented Apr 11, 2024

rishabhmaurya commented Jun 26, 2024 • edited Loading

imRishN commented Aug 22, 2021 •

edited

Loading

reta commented Jan 25, 2022 •

edited

Loading

AhmedAbdoOrtiga commented Jun 14, 2023 •

edited

Loading

rishabhmaurya commented Jun 26, 2024 •

edited

Loading