Add flattened field type #1018

dblock · 2021-07-28T13:41:42Z

(updated from #1018 (comment) below, @macrakis)

[Design Proposal] The flat data type in OpenSearch

Summary

JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.

Flat subfields support exact match queries and textual sorting.

Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.

Motivation

Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
Flat fields do not create a large number of fields, one per unique key. The “mapping explosion” caused by mapping many individual fields can lead to heavy RAM use and thrashing. (memory efficiency)
Flat fields do not have inverted indexes which take space. (Space efficiency)
Migration from systems supporting similar flat types is easy. (Compatibility)

Demand

OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)

Prevent mapping explosion.

OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)

Migrating corpus with existing flattened mappings.

Specification

Mapping and ingestion

flattened is a new mapping type
Fields declared flattened are ingested as structured, nested objects.
Neither the field as a whole nor its subfields are indexed.
The nested fields are uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers.

Searching and retrieving

Supports fetching a subfields with the usual dotted notation.
Supports aggregations of subfields with the usual dotted notation.
Filtering by subfields is supported, but may be inefficient (full scan).

Example

This declares catalog as being of type flattened:

{ "mappings": 
  { "book" :
    { "properties" :
       { "ISBN13"  : "keyword",
         "catalog" : "flattened" }
}}}

Consider the ingestion of the following document:

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "author1" : 
            { "surname" : "McCandless",
              "given"   : "Mike" }
}}}

Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.

Performance

Performance should be similar to a keyword field.
Fetching the value of a nested field using dot paths in a given document should be efficient.
Finding a document with a specific value of a nested field (e.g., given = ‘Mike’) is not efficient: it may require a full scan of the index, an expensive operation.

Limitations

Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.

Possible implementation

These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.

Flattened fields could be stored as Lucene blobs whose internal structure is defined by OpenSearch.
The internal structure should be such that dot path references (such as .catalog.author1.given) are cheap. Perhaps binary JSON.

Security

Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.

Possible enhancements

The current specification is minimal. It intentionally does not include many options offered by other vendors.

Depending on the user feedback we receive after the initial release, various enhancements are possible:

Querying the field as a whole searches all leaf values. For example, catalog=Mike would match the document above.
Fine tune efficiency with various options controlling query interpretation, etc.
Provide a concatenated index. In that index, the entry for the given field above would be something like “Catalog|author1|given=Mike”. This would provide efficient searching by field (assuming that indexes support prefix compression).
Allow specifying that certain subfields should be indexed separately. (This could also be provided in logstash.)
Support wildcards in field names.

The text was updated successfully, but these errors were encountered:

mrkamel · 2021-10-07T18:26:57Z

i'd like to underline the need for an offical feature complete and performant alternative to the flattened datatype and it also was requested a lot on several sites already: opendistro-for-elasticsearch/opendistro-build#523, https://discuss.opendistrocommunity.dev/t/flattened-type-with-opendistro/5014/4, https://forums.aws.amazon.com/thread.jspa?threadID=32970, ...

elfisher · 2021-11-18T20:55:26Z

@dblock and @nknize is this in Lucene already? I couldn't tell from the Lucene docs. If it isn't should it be contributed there and then pulled into OpenSearch? Seems like it could add value in Lucene too.

chipzzz · 2021-12-15T21:00:02Z

@aparo this link no longer there https://github.com/aparo/opensearch-flattened-mapper-plugin :(

abhishek-v · 2021-12-17T06:13:38Z

Is there any plan to support this functionality in the near future?

dblock · 2021-12-18T15:54:06Z

Is there any plan to support this functionality in the near future?

I don't think anyone is working on it, cc: @anasalkouz?

reta · 2022-01-14T20:44:37Z

I was involved in the discussion recently on the subject [1], it would be really beneficial to have the flattened type but I believe the [2] was the subject of the IP / copyright claims (@aparo would be great to hear the reasons, many people are asking, thank you 🙏). To keep it short: we probably could add something similar to flattened type but with different name (and obviously implementation), but migrating existing Elasticsearch indices using snapshot / restore would be problematic (unless we would internally support type aliases etc.) since type won't match.

[1] https://discuss.opendistrocommunity.dev/t/migration-via-snapshot-restore-mapping-roadblocks/7906/7
[2] https://github.com/aparo/opensearch-flattened-mapper-plugin

andreaAlkalay · 2022-02-01T08:19:11Z

Hi,
Do you know when can we expect to have the new flattened type implemented?
It is very crucial for our business scenario.
Thanks,
Andrea.

dblock · 2022-02-02T18:54:57Z

@andreaAlkalay @abhishek-v it's not on any roadmap today, but we would gladly accept a PR

tristandostaler · 2022-02-24T19:25:47Z

+1!

macrakis · 2022-03-14T16:58:25Z

@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.

amalgamm · 2022-04-04T11:39:15Z

@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.

Let me jump in :)
We have a case when kubernetes pod having built-in labels like app=foo, also we have a bunch of services which use label like app.kubernetes.io/managed-by: Helm. In first case field app in just a string. In other case it is nested object. When logshipper send such entry to opensearch it throws back mapper_parsing_exception and drops documents

Flattened type solves such problems for fields that you don't want to use as nested

CEHENKLE · 2022-06-14T08:59:34Z

@anasalkouz heya Anas -- what's the latest on this?

CEHENKLE · 2022-06-14T09:07:32Z

(question is also for @macrakis) :)

macrakis · 2022-07-18T22:02:05Z

[Design Proposal] The flat data type in OpenSearch

Summary

JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.

Flat subfields support exact match queries and textual sorting.

Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.

Motivation

Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
Flat fields do not create a large number of fields, one per unique key. The “mapping explosion” caused by mapping many individual fields can lead to heavy RAM use and thrashing. (memory efficiency)
Flat fields do not have inverted indexes which take space. (Space efficiency)
Migration from systems supporting similar flat types is easy. (Compatibility)

Demand

OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)

Prevent mapping explosion.

OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)

Migrating corpus with existing flattened mappings.

Specification

Mapping and ingestion

flattened is a new mapping type
Fields declared flattened are ingested as structured, nested objects.
Neither the field as a whole nor its subfields are indexed.
The nested fields are uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers.

Searching and retrieving

Supports fetching a subfields with the usual dotted notation.
Supports aggregations of subfields with the usual dotted notation.
Filtering by subfields is supported, but may be inefficient (full scan).

Example

This declares catalog as being of type flattened:

{ "mappings": 
  { "book" :
    { "properties" :
       { "ISBN13"  : "keyword",
         "catalog" : "flattened" }
}}}

Consider the ingestion of the following document:

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "author1" : 
            { "surname" : "McCandless",
              "given"   : "Mike" }
}}}

Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.

Performance

Performance should be similar to a keyword field.
Fetching the value of a nested field using dot paths in a given document should be efficient.
Finding a document with a specific value of a nested field (e.g., given = ‘Mike’) is not efficient: it may require a full scan of the index, an expensive operation.

Limitations

Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.

Possible implementation

These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.

Flattened fields could be stored as Lucene blobs whose internal structure is defined by OpenSearch.
The internal structure should be such that dot path references (such as .catalog.author1.given) are cheap. Perhaps binary JSON.

Security

Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.

Possible enhancements

The current specification is minimal. It intentionally does not include many options offered by other vendors.

Depending on the user feedback we receive after the initial release, various enhancements are possible:

Querying the field as a whole searches all leaf values. For example, catalog=Mike would match the document above.
Fine tune efficiency with various options controlling query interpretation, etc.
Provide a concatenated index. In that index, the entry for the given field above would be something like “Catalog|author1|given=Mike”. This would provide efficient searching by field (assuming that indexes support prefix compression).
Allow specifying that certain subfields should be indexed separately. (This could also be provided in logstash.)
Support wildcards in field names.

elfisher · 2022-07-19T12:38:31Z

@dblock I think we should move the design proposal @macrakis pasted here into the issue summary as it is a pretty comprehensive proposal for the feature. Do you have any issue with that? I can make the change.

dblock · 2022-07-21T16:42:43Z

No issues with that! Sounds great.

CEHENKLE · 2022-07-21T19:18:02Z

@dblock @macrakis @elfisher Done.

elfisher · 2022-07-21T19:23:27Z

Thanks @CEHENKLE!

dblock · 2022-07-21T19:59:12Z

I think for posterity we might want to do this not by replacing the original content, but adding Update ... and linking and copy-pasting something from below, just so it doesn't look like I did all the work on that proposal (@macrakis did ;)).

CEHENKLE · 2022-07-25T17:27:48Z

Good point, @dblock . Will do it that way going forward.

@aabukhalil Can you pick this up?

aabukhalil · 2022-07-25T17:29:24Z

@CEHENKLE yes I will be working on this

aabukhalil · 2022-08-04T01:19:43Z

Open questions:

Do we have any legal concern when using flattened or flat field type ? We need to close on what name to use.
- Since one of the motivations and demand to implement this feature is the ease of migration (Compatibility), what should we do if we agreed to not use flattened as name ? not using matching name will make migration harder. should we introduce field type aliasing ?

Checklist of things to do:

How indexing, document writing, document mapping and field mapping is done.
How to store the data on top of Lucene to support doc_value like access pattern for the subfields without causing mapping explosion (mapping overhead of flat datatype should be O(1)). In other words, how to overload single Lucene field to hold an object while allowing efficient dotted access. This is needed to support accessing subfields in the flat object for retrieval and aggregation.
How snapshot work and how to support the new field type ? and how can we restore a snapshot created by ElasticSearch having “flattened” field type into OpenSearch ?
After confirming implementation details and how data will be stored, Think about forward compatibility when adding more features to this new field type without causing migration, if possible at all.

reta · 2022-08-04T19:38:51Z

@aabukhalil I agree, going with flattened would be ideal and significantly ease the migration, but it is indeed poses legal risks. We actually had a very similar discussion regarding dense_vector field type [1], may be the sign off from one of the PMs (@CEHENKLE ?) would help here.

[1] #3545 (comment)

aabukhalil · 2022-08-04T19:49:03Z

@reta yes I'm asking for help regarding legal implications.

mingshl · 2023-02-15T22:33:43Z

@josefschiefer27 There are pros and cons in approach 1 and approach 3, and I have been thinking that they are pretty good and different implementations that cannot be combined.

You are right about approach 3's biggest pro that can support multiple types, for example, numeric and dates. But the biggest con is that it creates sub docs. If the Json object are very complicated, it can goes up to exponential amount of sub docs in the worst case (if each leave has n leaves, summing n^k, k from 0 to n, that would be n^n amount sub-docs). It would be a lot of sub-docs to consider in the worst case.

Approach 1 treats everything as string,(for example number as string), a very complicated JSON, it can be a long string field to parse in mapping, but it doesn't do anything in the docs. it works the same way as uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers. It's efficiency in storing and mapping.

Thinking of what use case the flat-object should fit. It's so called lazy man approach, we define a field as flat-object to store as string to avoid mapping explosion, can be retrieved in exact match in global field level and dot path notation. If someone wants to use numeric data in a fair simple JSON object, people can always go with dynamic mapping to allow each subfield has its specific field type.

Can you please give a sample use case for approach 3? We would like to see how it can fits in different ways.

josefschiefer27 · 2023-02-16T00:19:04Z

Approach 1 does work well with most search operations (e.g. range queries with numbers/dates get tricky), but does fall short with most aggregations (e.g. aggregations for numeric fields).

In the current proposal for Approach 3, I agree there would be lots of sub-docs. However, do we really have to create nested sub-docs? Couldn't we just map the fields as proposed without nested docs? Query capabilities would be still better and more flexible as approach 1.

Maybe I am missing something - let me make an example. Let's say we have this json (note that I added numeric fields 'reviews' and 'price'):

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "reviews": 1, "price": 11.5,
         "authors" : [ 
            { "surname" : "McCandless",
              "given"   : "Mike" },
            { "surname" : "Hatcher",
              "given"   : "Erik" }
           ...
         ...]
}}}

If we want to flatten the field 'catalog' we could map it similar as suggested in approach 3 with typed values as follows:

[
  {
    "key": "catalog.title",
    "value_string": "Lucene in Action"
  },
  {
    "key": "catalog.authors.surname",
    "value_string": ["McCandless", "Hatcher"]
  },
  {
    "key": "catalog.authors.given",
    "value_string": ["Mike","Erik"]
  },
  {
    "key": "catalog.reviews",
    "value_num": 1
  },
  {
    "key": "catalog.price",
    "value_num": 11.5
  }
]

With this index structure for flattened fields we used 3 fields in total ('key', 'value_string', 'value_num') and writing search and aggregation queries are fairly simple.

Let's say we want to get the average price for all jsons, we could write the query as follows:

{
  "query": {
    "term": {
      "key": {
        "value": "catalog.price"
      }
    }
  }, 
  "aggs": {
    "average_price": {
      "avg": {
        "field": "value_num"
      }
    }
  }
}

Note, that queries and aggs do have to be rewritten to work with this index structure. However, most search and aggregation functions would work with this index structure as intended. There would be less limitations as for approach 1.

reta · 2023-02-16T00:43:09Z

@josefschiefer27 trying to understand what would be the catalog field data type, an object?

josefschiefer27 · 2023-02-16T00:52:45Z

@reta - catalog is flattened. With my proposed index structure, it can be an object or array, both would work as expected (same behavior as for other fields in Opensearch). With 'as expected' I mean same result as I would get without flattening.

reta · 2023-02-16T01:18:35Z

@josefschiefer27 sorry should have been more precise, catalog is flattened, right. I am trying to understand what is underlying representation of this data structure in terms of Apache Lucene supported types (so we could apply term queries, etc). OpenSearch does not support arrays natively but only objects or nested types, which are mapped to Apache Lucene documents.

josefschiefer27 · 2023-02-16T03:38:00Z

@reta - I am a bit confused when you say OpenSearch does not support arrays natively. Every field can be also an array in OpenSearch (https://opensearch.org/docs/2.0/opensearch/supported-field-types/index/) which is a feature supported through Lucene.

josefschiefer27 · 2023-02-16T03:43:56Z

If we go with approach 1, we do something very similar what Elasticsearch does today with 'flattened' data type. And the number one limitation and complaint from users today is the lack of support of data types besides strings. In my opinion it's very challenging to get around that limitation. It would be nice if we don't put ourselves into the same corner.

Here some Elasticsearch limitations for reference elastic/elasticsearch#61550 - there are many hearts on this issue ;-) See also elastic/elasticsearch#43805 for possible limitations.

josefschiefer27 · 2023-02-16T06:17:18Z

I tried to create a bigger example based on @lukas-vlcek's json examples from above to illustrate how the mapping would work. I added date and numeric fields as well as mixed field values to make it more interesting. Below my learnings when going through the example...

Sample Data

- // --- Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
  "ISBN13": "V9781933988175",
  "catalog": {
    "title": "Java in Action",
    "author": "John Doe",
    "publication_score": 1023
  }
}

- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
  "ISBN13": "V9781933988177",
  "catalog": {
    "title": "Lucene in Action",
    "price": 12.5,
    "publication_date": "2010-10-10T10:10:10",
    "author": {
      "surname": "McCandless",
      "given": "Mike"
      "publication_score": 1033
    }
  }
}

- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
  "ISBN13": "V9781933988176",
  "catalog": {
    "title": "Test in Action",
    "publication_date": "none",
    "price": 14,
    "author": [
    	"John Doe",
	    {
	      "surname": "Smith",
	      "given": "Peter"
	    },
	    {
	      "surname": "Smith2",
	      "given": "Peter2"
	    },
	    {
	      "surename": "Green",
	      "first_name": "Billy"
	    }
    ]
  }
}

These documents would be mapped (under the hood) into the following index structure.

- // --- Mapped Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
  {
    "key": "ISBN13",
    "value_string": "V9781933988175"
  },
  {
    "key": "catalog.title",
    "value_string": "Java in Action"
  },
  {
    "key": "catalog.author",
    "value_string": "John Doe"
  },
  {
    "key": "catalog.author",
    "value_string": "John Doe"
  },
  {
    "key": "catalog.publication_score",
    "value_num": 1023
  }
}

- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
  {
    "key": "ISBN13",
    "value_string": "V9781933988177"
  },
  {
    "key": "catalog.title",
    "value_string": "Lucene in Action"
  },
  {
    "key": "catalog.price",
    "value_num": 12.5
  },
  {
    "key": "catalog.publication_date",
    "value_date": "2010-10-10T10:10:10"
  },
  {
    "key": "catalog.author.surname",
    "value_string": "McCandless"
  },
  {
    "key": "catalog.author.given",
    "value_string": "Mike"
  },
  {
    "key": "catalog.author.publication_score",
    "value_num": 1033
   }
}

- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
  {
    "key": "ISBN13",
    "value_string": "V9781933988176",
  },
  {
    "key": "catalog.title",
    "value_string": "Test in Action",
  },
  {
    "key": "catalog.publication_date",
    "value_string": "none",
  },
  {
    "key": "catalog.price",
    "value_num": 14
  },
  {
    "key": "catalog.author",
    "value_string": "John Doe",
  },
  {
    "key": "catalog.author.surname",
    "value_string": "Smith",
  },
  {
    "key": "catalog.author.given",
    "value_string": ["Peter", "Peter2"],
  },
  {
    "key": "catalog.author.surname",
    "value_string": ["Smith", "Smith2"]
  },
  {
    "key": "catalog.author.surename",
    "value_string": "Green",
  },
  {
    "key": "catalog.author.first_name",
    "value_string": "Billy"
    }
}

For this index mapping we used 4 Lucene fields ('key', 'value_string', 'value_num', 'value_date') to map all fields into Lucene. You can see that we can map also 'weird' json data which wouldn't be supported by OpenSearch without flattening.

Now the real fun starts - how can we query this json data!?!

Queries and aggregations using flattened fields need to be rewritten - any query clause and aggregation needs to use generic value fields and requires an additional filter for the key.

Let's try some query example. Let's assume we want to find all docs which the word 'Action' in catalog.title.

Without flattening the query would be:

{
   "query" : {
        "wildcard" : {
                "catalog.title": "*Action*"
         }
    }
}

To get the same result, we could try to rewrite this query as follows:

{
   "query" : {
      "bool": {
        "filter": [
          {
             "term" : {
                   "key": "catalog.title"
             }
          },
          {
             "wildcard" : {
                "value_string": "*Action*"
             }
          }
        ]
      }
    }
}

However, there is a big problem with this query - since we don't use nested docs/queries, it wouldn't deliver always the correct result (e.g. if there is a 'catalog.title' field and *Action* matches in some other field we would still get a hit). I possibly could use a scripted query to validate the match - however this wouldn't be an elegant solution anymore... it might work as discussed above by using nested docs/queries, however that might lead to a 'nested-doc' explosion.

The example was a helpful exercise for me to understand better the problem. It would be nice if we could find some way to support data-types beyond just strings.

josefschiefer27 · 2023-02-16T07:04:27Z

An idea to get around the problem of having only string (sub-)fields in flattened objects. We could allow users to define in the mapping a parameter to exclude certain fields from the flattening (e.g. "exclude": "*date*" for not flattening all date fields). This would allow users to have full search/aggs supports for selected fields of the flattened object.

reta · 2023-02-16T12:18:37Z

@reta - I am a bit confused when you say OpenSearch does not support arrays natively. Every field can be also an array in OpenSearch (https://opensearch.org/docs/2.0/opensearch/supported-field-types/index/) which is a feature supported through Lucene.

There is no dedicated array field type in OpenSearch. Instead, you can pass an array of values into any field. All values in the array must have the same field type. - taken from docs

mingshl · 2023-02-17T05:12:48Z

@josefschiefer27 Approach 3 does create a lot of sub-docs, but not nested doc in multiple levels. To be clear, there will be root level and level one. That's two level in total. But level one might have n^n sub-docs in the worst case. Yes, it will support the numeric operation. It is an important point for the users, but it's not a minimum requirement addressing in this issue.

It seems that you have a clear idea of implementing approach 3 and would you like to raise a PR, or a draft PR to approach 3?

mingshl · 2023-02-17T05:46:48Z

An idea to get around the problem of having only string (sub-)fields in flattened objects. We could allow users to define in the mapping a parameter to exclude certain fields from the flattening (e.g. "exclude": "date" for not flattening all date fields). This would allow users to have full search/aggs supports for selected fields of the flattened object.

I thought about dynamically adding subfields to identify typed fields, but if enabled adding non-limited subfields, for example, millions of dates and numbers subfields, it will have risk in leading to mapping explosion.

And there might be a way around to help with the numeric subfields, if a user would like to use one raw field as flat-object to injest entire JSON as string, when found out a numeric subfield and a date subfield within the JSON object, user can cherrypick the subfields, and try add additional new fields to update the documents with numeric fields or date fields. In this example, it can be three fields,

{
     "raw field" :{
        "type": "flat-object"
      },
    "date field" :{
        "type": "Dates"
      },
   "number field" :{
        "type": "numbers"
      }

It might need some work, but this can a way around to help with the typed fields and avoid mapping explosion.

josefschiefer27 · 2023-02-17T07:26:23Z

@mingshl - I think creating lots of sub-docs for flattened objects is sub-optimal and likely creates other problems. There might be flattened objects where the number of sub-docs can becomes huge and nested queries can be expensive. In my attempt for approach 3 I tried to avoid nested docs/queries.

Meanwhile, I do believe that approach 1 with smart string encoding is probably the most promising approach. In your description for approach 1 you are using two fields ('value' and 'content_and_path'). Wouldn't be the 'content_and_path' field sufficient? You mentioned as example catalog = 'Mike' - not sure when this would be needed in an OpenSearch query.

Edit: Found the answer to my question - such query is currently supported by 'flattened' data type.

kotwanikunal · 2023-04-06T18:45:42Z

@lukas-vlcek Reaching out since this is marked as a part of v2.7.0 roadmap. Please let me know if this isn't going to be a part of the release.

mingshl · 2023-04-06T18:53:12Z

Hi @kotwanikunal，the flat-object is going to v2.7.0 release. We are planning to merge this PR later today. #6507

To fulfill issue #1018, we implement the approach by storing the entire nested object as a String. A `flat_object` creates exactly two internal Lucene [StringField](https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/document/StringField.html) ( "._value" and "._valueAndPath" ) in regards of how many nested fields in the flat field. - value: a keyword field field contains the leaf values of all subfields, allowing for efficient searching of leaf values without specifying the path. (e.g: catalog = 'Mike'). - valueAndPath: : a keyword field field contains the path to the leaf value and its value, enabling efficient searching when the query includes the path to the leaf. (e.g: catalog.author.given = 'Mike') Limitation and Future Development: - enable searching in PainlessScript, we will need to direct the fielddatabuilder to fetch docvalues within the two stringfields in memory - open parameters setting, such as normalizer, docValues, ignoreAbove, nullValue, similarity, and depthlimit. - enable wildcard query (cherry picked from commit 75bb3ef) Signed-off-by: Mingshi Liu <mingshl@amazon.com> Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> Co-authored-by: Lukáš Vlček <lukas.vlcek@aiven.io>

DarshitChanpura · 2023-04-14T15:09:07Z

Hi @dblock Is the issue ready to be closed, since #6507 is merged.

mingshl · 2023-04-14T17:39:06Z

we can close this issue now. flat_object is going to 2.7 and future enhancement issues are here:
#7138
#7137
#7136

dblock added the enhancement Enhancement or improvement to existing feature or request label Jul 28, 2021

anasalkouz added the Indexing & Search label Nov 17, 2021

This was referenced Jun 23, 2022

[RFC] Lucene based kNN search support in core OpenSearch #3545

Closed

[BUG] Failed to parse mapping [_doc]: No handler for type [flattened] declared on field #3733

Closed

reta mentioned this issue Jul 4, 2022

[Meta] Migrations to OpenSearch #3757

Open

CEHENKLE assigned aabukhalil Jul 25, 2022

mingshl mentioned this issue Feb 28, 2023

Add FlatObject FieldMapper #6507

Merged

6 tasks

dblock mentioned this issue Mar 3, 2023

Add support for wildcard field type #5639

Closed

macohen added the roadmap label Mar 15, 2023

anasalkouz added Migration:In Progress and removed Migration:In Progress labels Mar 17, 2023

mingshl mentioned this issue Mar 27, 2023

[DOC] Flat_Object Field Type Documentation opensearch-project/documentation-website#3586

Closed

1 task

kotwanikunal mentioned this issue Apr 6, 2023

Release Version 2.7.0 #6967

Closed

23 tasks

mingshl mentioned this issue Apr 10, 2023

[2.x] Extend the version range to run flat-object field REST Yaml test on 2.7.0 #7081

Merged

6 tasks

dblock closed this as completed Apr 14, 2023

andrross mentioned this issue Apr 28, 2023

moved flat object from changed to added b/c it's new #7322

Merged

4 tasks

macohen mentioned this issue May 23, 2023

[RFC] Add Field Type Label #7693

Closed

peternied mentioned this issue Jun 19, 2023

Adding field level security test cases for FlatFields opensearch-project/security#2876

Merged

3 tasks

Add flattened field type #1018

Add flattened field type #1018

Comments

dblock commented Jul 28, 2021 • edited Loading

[Design Proposal] The flat data type in OpenSearch

Summary

Motivation

Demand

Specification

Mapping and ingestion

Searching and retrieving

Example

Performance

Limitations

Possible implementation

Security

Possible enhancements

mrkamel commented Oct 7, 2021

elfisher commented Nov 18, 2021

chipzzz commented Dec 15, 2021

abhishek-v commented Dec 17, 2021

dblock commented Dec 18, 2021

reta commented Jan 14, 2022 • edited Loading

andreaAlkalay commented Feb 1, 2022

dblock commented Feb 2, 2022

tristandostaler commented Feb 24, 2022

macrakis commented Mar 14, 2022

amalgamm commented Apr 4, 2022 • edited Loading

CEHENKLE commented Jun 14, 2022

CEHENKLE commented Jun 14, 2022

macrakis commented Jul 18, 2022

[Design Proposal] The flat data type in OpenSearch

Summary

Motivation

Demand

Specification

Mapping and ingestion

Searching and retrieving

Example

Performance

Limitations

Possible implementation

Security

Possible enhancements

elfisher commented Jul 19, 2022

dblock commented Jul 21, 2022

CEHENKLE commented Jul 21, 2022

elfisher commented Jul 21, 2022

dblock commented Jul 21, 2022

CEHENKLE commented Jul 25, 2022

aabukhalil commented Jul 25, 2022

aabukhalil commented Aug 4, 2022

Open questions:

Checklist of things to do:

reta commented Aug 4, 2022

aabukhalil commented Aug 4, 2022

mingshl commented Feb 15, 2023

josefschiefer27 commented Feb 16, 2023 • edited Loading

reta commented Feb 16, 2023 • edited Loading

josefschiefer27 commented Feb 16, 2023 • edited Loading

reta commented Feb 16, 2023

josefschiefer27 commented Feb 16, 2023 • edited Loading

josefschiefer27 commented Feb 16, 2023 • edited Loading

josefschiefer27 commented Feb 16, 2023 • edited Loading

Sample Data

Now the real fun starts - how can we query this json data!?!

josefschiefer27 commented Feb 16, 2023 • edited Loading

reta commented Feb 16, 2023 • edited Loading

mingshl commented Feb 17, 2023

mingshl commented Feb 17, 2023 • edited Loading

josefschiefer27 commented Feb 17, 2023 • edited Loading

kotwanikunal commented Apr 6, 2023

mingshl commented Apr 6, 2023

DarshitChanpura commented Apr 14, 2023

mingshl commented Apr 14, 2023

dblock commented Jul 28, 2021 •

edited

Loading

reta commented Jan 14, 2022 •

edited

Loading

amalgamm commented Apr 4, 2022 •

edited

Loading

josefschiefer27 commented Feb 16, 2023 •

edited

Loading

reta commented Feb 16, 2023 •

edited

Loading

josefschiefer27 commented Feb 16, 2023 •

edited

Loading

josefschiefer27 commented Feb 16, 2023 •

edited

Loading

josefschiefer27 commented Feb 16, 2023 •

edited

Loading

josefschiefer27 commented Feb 16, 2023 •

edited

Loading

josefschiefer27 commented Feb 16, 2023 •

edited

Loading

reta commented Feb 16, 2023 •

edited

Loading

mingshl commented Feb 17, 2023 •

edited

Loading

josefschiefer27 commented Feb 17, 2023 •

edited

Loading