Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flattened field type #1018

Closed
dblock opened this issue Jul 28, 2021 · 83 comments
Closed

Add flattened field type #1018

dblock opened this issue Jul 28, 2021 · 83 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search v2.7.0

Comments

@dblock
Copy link
Member

dblock commented Jul 28, 2021

(updated from #1018 (comment) below, @macrakis)

[Design Proposal] The flat data type in OpenSearch

Summary

JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.

Flat subfields support exact match queries and textual sorting.

Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.

Motivation

  • Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
  • Flat fields do not create a large number of fields, one per unique key. The “mapping explosion” caused by mapping many individual fields can lead to heavy RAM use and thrashing. (memory efficiency)
  • Flat fields do not have inverted indexes which take space. (Space efficiency)
  • Migration from systems supporting similar flat types is easy. (Compatibility)

Demand

OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)

  • Prevent mapping explosion.

OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)

  • Migrating corpus with existing flattened mappings.

Specification

Mapping and ingestion

  • flattened is a new mapping type
  • Fields declared flattened are ingested as structured, nested objects.
  • Neither the field as a whole nor its subfields are indexed.
  • The nested fields are uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers.

Searching and retrieving

  • Supports fetching a subfields with the usual dotted notation.
  • Supports aggregations of subfields with the usual dotted notation.
  • Filtering by subfields is supported, but may be inefficient (full scan).

Example

This declares catalog as being of type flattened:

{ "mappings": 
  { "book" :
    { "properties" :
       { "ISBN13"  : "keyword",
         "catalog" : "flattened" }
}}}

Consider the ingestion of the following document:

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "author1" : 
            { "surname" : "McCandless",
              "given"   : "Mike" }
}}}

Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.

Performance

  • Performance should be similar to a keyword field.
  • Fetching the value of a nested field using dot paths in a given document should be efficient.
  • Finding a document with a specific value of a nested field (e.g., given = ‘Mike’) is not efficient: it may require a full scan of the index, an expensive operation.

Limitations

Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.

Possible implementation

These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.

  • Flattened fields could be stored as Lucene blobs whose internal structure is defined by OpenSearch.
  • The internal structure should be such that dot path references (such as .catalog.author1.given) are cheap. Perhaps binary JSON.

Security

Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.

Possible enhancements

The current specification is minimal. It intentionally does not include many options offered by other vendors.

Depending on the user feedback we receive after the initial release, various enhancements are possible:

  • Querying the field as a whole searches all leaf values. For example, catalog=Mike would match the document above.
  • Fine tune efficiency with various options controlling query interpretation, etc.
  • Provide a concatenated index. In that index, the entry for the given field above would be something like “Catalog|author1|given=Mike”. This would provide efficient searching by field (assuming that indexes support prefix compression).
  • Allow specifying that certain subfields should be indexed separately. (This could also be provided in logstash.)
  • Support wildcards in field names.
@dblock dblock added the enhancement Enhancement or improvement to existing feature or request label Jul 28, 2021
@mrkamel
Copy link

mrkamel commented Oct 7, 2021

i'd like to underline the need for an offical feature complete and performant alternative to the flattened datatype and it also was requested a lot on several sites already: opendistro-for-elasticsearch/opendistro-build#523, https://discuss.opendistrocommunity.dev/t/flattened-type-with-opendistro/5014/4, https://forums.aws.amazon.com/thread.jspa?threadID=32970, ...

@elfisher
Copy link

@dblock and @nknize is this in Lucene already? I couldn't tell from the Lucene docs. If it isn't should it be contributed there and then pulled into OpenSearch? Seems like it could add value in Lucene too.

@chipzzz
Copy link

chipzzz commented Dec 15, 2021

@aparo this link no longer there https://github.com/aparo/opensearch-flattened-mapper-plugin :(

@abhishek-v
Copy link

Is there any plan to support this functionality in the near future?

@dblock
Copy link
Member Author

dblock commented Dec 18, 2021

Is there any plan to support this functionality in the near future?

I don't think anyone is working on it, cc: @anasalkouz?

@reta
Copy link
Collaborator

reta commented Jan 14, 2022

I was involved in the discussion recently on the subject [1], it would be really beneficial to have the flattened type but I believe the [2] was the subject of the IP / copyright claims (@aparo would be great to hear the reasons, many people are asking, thank you 🙏). To keep it short: we probably could add something similar to flattened type but with different name (and obviously implementation), but migrating existing Elasticsearch indices using snapshot / restore would be problematic (unless we would internally support type aliases etc.) since type won't match.

[1] https://discuss.opendistrocommunity.dev/t/migration-via-snapshot-restore-mapping-roadblocks/7906/7
[2] https://github.com/aparo/opensearch-flattened-mapper-plugin

@andreaAlkalay
Copy link

Hi,
Do you know when can we expect to have the new flattened type implemented?
It is very crucial for our business scenario.
Thanks,
Andrea.

@dblock
Copy link
Member Author

dblock commented Feb 2, 2022

@andreaAlkalay @abhishek-v it's not on any roadmap today, but we would gladly accept a PR

@tristandostaler
Copy link

+1!

@macrakis
Copy link

@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.

@amalgamm
Copy link

amalgamm commented Apr 4, 2022

@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks.

Let me jump in :)
We have a case when kubernetes pod having built-in labels like app=foo, also we have a bunch of services which use label like app.kubernetes.io/managed-by: Helm. In first case field app in just a string. In other case it is nested object. When logshipper send such entry to opensearch it throws back mapper_parsing_exception and drops documents

Flattened type solves such problems for fields that you don't want to use as nested

@CEHENKLE
Copy link
Member

@anasalkouz heya Anas -- what's the latest on this?

@CEHENKLE
Copy link
Member

(question is also for @macrakis) :)

@macrakis
Copy link

[Design Proposal] The flat data type in OpenSearch

Summary

JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.

Flat subfields support exact match queries and textual sorting.

Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.

Motivation

  • Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
  • Flat fields do not create a large number of fields, one per unique key. The “mapping explosion” caused by mapping many individual fields can lead to heavy RAM use and thrashing. (memory efficiency)
  • Flat fields do not have inverted indexes which take space. (Space efficiency)
  • Migration from systems supporting similar flat types is easy. (Compatibility)

Demand

OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)

  • Prevent mapping explosion.

OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)

  • Migrating corpus with existing flattened mappings.

Specification

Mapping and ingestion

  • flattened is a new mapping type
  • Fields declared flattened are ingested as structured, nested objects.
  • Neither the field as a whole nor its subfields are indexed.
  • The nested fields are uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers.

Searching and retrieving

  • Supports fetching a subfields with the usual dotted notation.
  • Supports aggregations of subfields with the usual dotted notation.
  • Filtering by subfields is supported, but may be inefficient (full scan).

Example

This declares catalog as being of type flattened:

{ "mappings": 
  { "book" :
    { "properties" :
       { "ISBN13"  : "keyword",
         "catalog" : "flattened" }
}}}

Consider the ingestion of the following document:

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "author1" : 
            { "surname" : "McCandless",
              "given"   : "Mike" }
}}}

Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.

Performance

  • Performance should be similar to a keyword field.
  • Fetching the value of a nested field using dot paths in a given document should be efficient.
  • Finding a document with a specific value of a nested field (e.g., given = ‘Mike’) is not efficient: it may require a full scan of the index, an expensive operation.

Limitations

Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.

Possible implementation

These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.

  • Flattened fields could be stored as Lucene blobs whose internal structure is defined by OpenSearch.
  • The internal structure should be such that dot path references (such as .catalog.author1.given) are cheap. Perhaps binary JSON.

Security

Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.

Possible enhancements

The current specification is minimal. It intentionally does not include many options offered by other vendors.

Depending on the user feedback we receive after the initial release, various enhancements are possible:

  • Querying the field as a whole searches all leaf values. For example, catalog=Mike would match the document above.
  • Fine tune efficiency with various options controlling query interpretation, etc.
  • Provide a concatenated index. In that index, the entry for the given field above would be something like “Catalog|author1|given=Mike”. This would provide efficient searching by field (assuming that indexes support prefix compression).
  • Allow specifying that certain subfields should be indexed separately. (This could also be provided in logstash.)
  • Support wildcards in field names.

@elfisher
Copy link

@dblock I think we should move the design proposal @macrakis pasted here into the issue summary as it is a pretty comprehensive proposal for the feature. Do you have any issue with that? I can make the change.

@dblock
Copy link
Member Author

dblock commented Jul 21, 2022

No issues with that! Sounds great.

@CEHENKLE
Copy link
Member

@dblock @macrakis @elfisher Done.

@elfisher
Copy link

Thanks @CEHENKLE!

@dblock
Copy link
Member Author

dblock commented Jul 21, 2022

I think for posterity we might want to do this not by replacing the original content, but adding Update ... and linking and copy-pasting something from below, just so it doesn't look like I did all the work on that proposal (@macrakis did ;)).

@CEHENKLE
Copy link
Member

Good point, @dblock . Will do it that way going forward.

@aabukhalil Can you pick this up?

@aabukhalil
Copy link
Contributor

@CEHENKLE yes I will be working on this

@aabukhalil
Copy link
Contributor

Open questions:

  • Do we have any legal concern when using flattened or flat field type ? We need to close on what name to use.
    • Since one of the motivations and demand to implement this feature is the ease of migration (Compatibility), what should we do if we agreed to not use flattened as name ? not using matching name will make migration harder. should we introduce field type aliasing ?

Checklist of things to do:

  • How indexing, document writing, document mapping and field mapping is done.
  • How to store the data on top of Lucene to support doc_value like access pattern for the subfields without causing mapping explosion (mapping overhead of flat datatype should be O(1)). In other words, how to overload single Lucene field to hold an object while allowing efficient dotted access. This is needed to support accessing subfields in the flat object for retrieval and aggregation.
  • How snapshot work and how to support the new field type ? and how can we restore a snapshot created by ElasticSearch having “flattened” field type into OpenSearch ?
  • After confirming implementation details and how data will be stored, Think about forward compatibility when adding more features to this new field type without causing migration, if possible at all.

@reta
Copy link
Collaborator

reta commented Aug 4, 2022

@aabukhalil I agree, going with flattened would be ideal and significantly ease the migration, but it is indeed poses legal risks. We actually had a very similar discussion regarding dense_vector field type [1], may be the sign off from one of the PMs (@CEHENKLE ?) would help here.

[1] #3545 (comment)

@aabukhalil
Copy link
Contributor

@reta yes I'm asking for help regarding legal implications.

@mingshl
Copy link
Contributor

mingshl commented Feb 15, 2023

@josefschiefer27 There are pros and cons in approach 1 and approach 3, and I have been thinking that they are pretty good and different implementations that cannot be combined.

You are right about approach 3's biggest pro that can support multiple types, for example, numeric and dates. But the biggest con is that it creates sub docs. If the Json object are very complicated, it can goes up to exponential amount of sub docs in the worst case (if each leave has n leaves, summing n^k, k from 0 to n, that would be n^n amount sub-docs). It would be a lot of sub-docs to consider in the worst case.

Approach 1 treats everything as string,(for example number as string), a very complicated JSON, it can be a long string field to parse in mapping, but it doesn't do anything in the docs. it works the same way as uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers. It's efficiency in storing and mapping.

Thinking of what use case the flat-object should fit. It's so called lazy man approach, we define a field as flat-object to store as string to avoid mapping explosion, can be retrieved in exact match in global field level and dot path notation. If someone wants to use numeric data in a fair simple JSON object, people can always go with dynamic mapping to allow each subfield has its specific field type.

Can you please give a sample use case for approach 3? We would like to see how it can fits in different ways.

@josefschiefer27
Copy link

josefschiefer27 commented Feb 16, 2023

Approach 1 does work well with most search operations (e.g. range queries with numbers/dates get tricky), but does fall short with most aggregations (e.g. aggregations for numeric fields).

In the current proposal for Approach 3, I agree there would be lots of sub-docs. However, do we really have to create nested sub-docs? Couldn't we just map the fields as proposed without nested docs? Query capabilities would be still better and more flexible as approach 1.

Maybe I am missing something - let me make an example. Let's say we have this json (note that I added numeric fields 'reviews' and 'price'):

{ 
  { "ISBN13" : "V9781933988177",
    "catalog" : 
       { "title" : "Lucene in Action",
         "reviews": 1, "price": 11.5,
         "authors" : [ 
            { "surname" : "McCandless",
              "given"   : "Mike" },
            { "surname" : "Hatcher",
              "given"   : "Erik" }
           ...
         ...]
}}}

If we want to flatten the field 'catalog' we could map it similar as suggested in approach 3 with typed values as follows:

[
  {
    "key": "catalog.title",
    "value_string": "Lucene in Action"
  },
  {
    "key": "catalog.authors.surname",
    "value_string": ["McCandless", "Hatcher"]
  },
  {
    "key": "catalog.authors.given",
    "value_string": ["Mike","Erik"]
  },
  {
    "key": "catalog.reviews",
    "value_num": 1
  },
  {
    "key": "catalog.price",
    "value_num": 11.5
  }
]

With this index structure for flattened fields we used 3 fields in total ('key', 'value_string', 'value_num') and writing search and aggregation queries are fairly simple.

Let's say we want to get the average price for all jsons, we could write the query as follows:

{
  "query": {
    "term": {
      "key": {
        "value": "catalog.price"
      }
    }
  }, 
  "aggs": {
    "average_price": {
      "avg": {
        "field": "value_num"
      }
    }
  }
}

Note, that queries and aggs do have to be rewritten to work with this index structure. However, most search and aggregation functions would work with this index structure as intended. There would be less limitations as for approach 1.

@reta
Copy link
Collaborator

reta commented Feb 16, 2023

@josefschiefer27 trying to understand what would be the catalog field data type, an object?

@josefschiefer27
Copy link

josefschiefer27 commented Feb 16, 2023

@reta - catalog is flattened. With my proposed index structure, it can be an object or array, both would work as expected (same behavior as for other fields in Opensearch). With 'as expected' I mean same result as I would get without flattening.

@reta
Copy link
Collaborator

reta commented Feb 16, 2023

@josefschiefer27 sorry should have been more precise, catalog is flattened, right. I am trying to understand what is underlying representation of this data structure in terms of Apache Lucene supported types (so we could apply term queries, etc). OpenSearch does not support arrays natively but only objects or nested types, which are mapped to Apache Lucene documents.

@josefschiefer27
Copy link

josefschiefer27 commented Feb 16, 2023

@reta - I am a bit confused when you say OpenSearch does not support arrays natively. Every field can be also an array in OpenSearch (https://opensearch.org/docs/2.0/opensearch/supported-field-types/index/) which is a feature supported through Lucene.

@josefschiefer27
Copy link

josefschiefer27 commented Feb 16, 2023

If we go with approach 1, we do something very similar what Elasticsearch does today with 'flattened' data type. And the number one limitation and complaint from users today is the lack of support of data types besides strings. In my opinion it's very challenging to get around that limitation. It would be nice if we don't put ourselves into the same corner.

Here some Elasticsearch limitations for reference elastic/elasticsearch#61550 - there are many hearts on this issue ;-) See also elastic/elasticsearch#43805 for possible limitations.

@josefschiefer27
Copy link

josefschiefer27 commented Feb 16, 2023

I tried to create a bigger example based on @lukas-vlcek's json examples from above to illustrate how the mapping would work. I added date and numeric fields as well as mixed field values to make it more interesting. Below my learnings when going through the example...

Sample Data

- // --- Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
  "ISBN13": "V9781933988175",
  "catalog": {
    "title": "Java in Action",
    "author": "John Doe",
    "publication_score": 1023
  }
}

- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
  "ISBN13": "V9781933988177",
  "catalog": {
    "title": "Lucene in Action",
    "price": 12.5,
    "publication_date": "2010-10-10T10:10:10",
    "author": {
      "surname": "McCandless",
      "given": "Mike"
      "publication_score": 1033
    }
  }
}

- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
  "ISBN13": "V9781933988176",
  "catalog": {
    "title": "Test in Action",
    "publication_date": "none",
    "price": 14,
    "author": [
    	"John Doe",
	    {
	      "surname": "Smith",
	      "given": "Peter"
	    },
	    {
	      "surname": "Smith2",
	      "given": "Peter2"
	    },
	    {
	      "surename": "Green",
	      "first_name": "Billy"
	    }
    ]
  }
}

These documents would be mapped (under the hood) into the following index structure.

- // --- Mapped Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
  {
    "key": "ISBN13",
    "value_string": "V9781933988175"
  },
  {
    "key": "catalog.title",
    "value_string": "Java in Action"
  },
  {
    "key": "catalog.author",
    "value_string": "John Doe"
  },
  {
    "key": "catalog.author",
    "value_string": "John Doe"
  },
  {
    "key": "catalog.publication_score",
    "value_num": 1023
  }
}

- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
  {
    "key": "ISBN13",
    "value_string": "V9781933988177"
  },
  {
    "key": "catalog.title",
    "value_string": "Lucene in Action"
  },
  {
    "key": "catalog.price",
    "value_num": 12.5
  },
  {
    "key": "catalog.publication_date",
    "value_date": "2010-10-10T10:10:10"
  },
  {
    "key": "catalog.author.surname",
    "value_string": "McCandless"
  },
  {
    "key": "catalog.author.given",
    "value_string": "Mike"
  },
  {
    "key": "catalog.author.publication_score",
    "value_num": 1033
   }
}

- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
  {
    "key": "ISBN13",
    "value_string": "V9781933988176",
  },
  {
    "key": "catalog.title",
    "value_string": "Test in Action",
  },
  {
    "key": "catalog.publication_date",
    "value_string": "none",
  },
  {
    "key": "catalog.price",
    "value_num": 14
  },
  {
    "key": "catalog.author",
    "value_string": "John Doe",
  },
  {
    "key": "catalog.author.surname",
    "value_string": "Smith",
  },
  {
    "key": "catalog.author.given",
    "value_string": ["Peter", "Peter2"],
  },
  {
    "key": "catalog.author.surname",
    "value_string": ["Smith", "Smith2"]
  },
  {
    "key": "catalog.author.surename",
    "value_string": "Green",
  },
  {
    "key": "catalog.author.first_name",
    "value_string": "Billy"
    }
}

For this index mapping we used 4 Lucene fields ('key', 'value_string', 'value_num', 'value_date') to map all fields into Lucene. You can see that we can map also 'weird' json data which wouldn't be supported by OpenSearch without flattening.

Now the real fun starts - how can we query this json data!?!

Queries and aggregations using flattened fields need to be rewritten - any query clause and aggregation needs to use generic value fields and requires an additional filter for the key.

Let's try some query example. Let's assume we want to find all docs which the word 'Action' in catalog.title.

Without flattening the query would be:

{
   "query" : {
        "wildcard" : {
                "catalog.title": "*Action*"
         }
    }
}

To get the same result, we could try to rewrite this query as follows:

{
   "query" : {
      "bool": {
        "filter": [
          {
             "term" : {
                   "key": "catalog.title"
             }
          },
          {
             "wildcard" : {
                "value_string": "*Action*"
             }
          }
        ]
      }
    }
}

However, there is a big problem with this query - since we don't use nested docs/queries, it wouldn't deliver always the correct result (e.g. if there is a 'catalog.title' field and *Action* matches in some other field we would still get a hit). I possibly could use a scripted query to validate the match - however this wouldn't be an elegant solution anymore... it might work as discussed above by using nested docs/queries, however that might lead to a 'nested-doc' explosion.

The example was a helpful exercise for me to understand better the problem. It would be nice if we could find some way to support data-types beyond just strings.

@josefschiefer27
Copy link

josefschiefer27 commented Feb 16, 2023

An idea to get around the problem of having only string (sub-)fields in flattened objects. We could allow users to define in the mapping a parameter to exclude certain fields from the flattening (e.g. "exclude": "*date*" for not flattening all date fields). This would allow users to have full search/aggs supports for selected fields of the flattened object.

@reta
Copy link
Collaborator

reta commented Feb 16, 2023

@reta - I am a bit confused when you say OpenSearch does not support arrays natively. Every field can be also an array in OpenSearch (https://opensearch.org/docs/2.0/opensearch/supported-field-types/index/) which is a feature supported through Lucene.

There is no dedicated array field type in OpenSearch. Instead, you can pass an array of values into any field. All values in the array must have the same field type. - taken from docs

@mingshl
Copy link
Contributor

mingshl commented Feb 17, 2023

@josefschiefer27 Approach 3 does create a lot of sub-docs, but not nested doc in multiple levels. To be clear, there will be root level and level one. That's two level in total. But level one might have n^n sub-docs in the worst case. Yes, it will support the numeric operation. It is an important point for the users, but it's not a minimum requirement addressing in this issue.

It seems that you have a clear idea of implementing approach 3 and would you like to raise a PR, or a draft PR to approach 3?

@mingshl
Copy link
Contributor

mingshl commented Feb 17, 2023

An idea to get around the problem of having only string (sub-)fields in flattened objects. We could allow users to define in the mapping a parameter to exclude certain fields from the flattening (e.g. "exclude": "date" for not flattening all date fields). This would allow users to have full search/aggs supports for selected fields of the flattened object.

I thought about dynamically adding subfields to identify typed fields, but if enabled adding non-limited subfields, for example, millions of dates and numbers subfields, it will have risk in leading to mapping explosion.

And there might be a way around to help with the numeric subfields, if a user would like to use one raw field as flat-object to injest entire JSON as string, when found out a numeric subfield and a date subfield within the JSON object, user can cherrypick the subfields, and try add additional new fields to update the documents with numeric fields or date fields. In this example, it can be three fields,

{
     "raw field" :{
        "type": "flat-object"
      },
    "date field" :{
        "type": "Dates"
      },
   "number field" :{
        "type": "numbers"
      }

It might need some work, but this can a way around to help with the typed fields and avoid mapping explosion.

@josefschiefer27
Copy link

josefschiefer27 commented Feb 17, 2023

@mingshl - I think creating lots of sub-docs for flattened objects is sub-optimal and likely creates other problems. There might be flattened objects where the number of sub-docs can becomes huge and nested queries can be expensive. In my attempt for approach 3 I tried to avoid nested docs/queries.

Meanwhile, I do believe that approach 1 with smart string encoding is probably the most promising approach. In your description for approach 1 you are using two fields ('value' and 'content_and_path'). Wouldn't be the 'content_and_path' field sufficient? You mentioned as example catalog = 'Mike' - not sure when this would be needed in an OpenSearch query.

Edit: Found the answer to my question - such query is currently supported by 'flattened' data type.

@kotwanikunal
Copy link
Member

@lukas-vlcek Reaching out since this is marked as a part of v2.7.0 roadmap. Please let me know if this isn't going to be a part of the release.

@kotwanikunal kotwanikunal mentioned this issue Apr 6, 2023
23 tasks
@mingshl
Copy link
Contributor

mingshl commented Apr 6, 2023

Hi @kotwanikunal,the flat-object is going to v2.7.0 release. We are planning to merge this PR later today. #6507

tlfeng pushed a commit that referenced this issue Apr 10, 2023
To fulfill issue #1018, we implement the approach by storing the entire nested object as a String. A `flat_object` creates exactly two internal Lucene [StringField](https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/document/StringField.html) ( "._value" and "._valueAndPath" ) in regards of how many nested fields in the flat field. 

- value: a keyword field field contains the leaf values of all subfields, allowing for efficient searching of leaf values without specifying the path. (e.g: catalog = 'Mike').
- valueAndPath: : a keyword field field contains the path to the leaf value and its value, enabling efficient searching when the query includes the path to the leaf. (e.g: catalog.author.given = 'Mike')

Limitation and Future Development:
- enable searching in PainlessScript, we will need to direct the fielddatabuilder to fetch docvalues within the two stringfields in memory
- open parameters setting, such as normalizer, docValues, ignoreAbove, nullValue, similarity, and depthlimit.
- enable wildcard query

(cherry picked from commit 75bb3ef)

Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Co-authored-by: Lukáš Vlček <lukas.vlcek@aiven.io>
@DarshitChanpura
Copy link
Member

Hi @dblock Is the issue ready to be closed, since #6507 is merged.

@mingshl
Copy link
Contributor

mingshl commented Apr 14, 2023

we can close this issue now. flat_object is going to 2.7 and future enhancement issues are here:
#7138
#7137
#7136

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search v2.7.0
Projects
Status: Done
Development

No branches or pull requests