-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flattened field type #1018
Comments
i'd like to underline the need for an offical feature complete and performant alternative to the flattened datatype and it also was requested a lot on several sites already: opendistro-for-elasticsearch/opendistro-build#523, https://discuss.opendistrocommunity.dev/t/flattened-type-with-opendistro/5014/4, https://forums.aws.amazon.com/thread.jspa?threadID=32970, ... |
@aparo this link no longer there https://github.com/aparo/opensearch-flattened-mapper-plugin :( |
Is there any plan to support this functionality in the near future? |
I don't think anyone is working on it, cc: @anasalkouz? |
I was involved in the discussion recently on the subject [1], it would be really beneficial to have the [1] https://discuss.opendistrocommunity.dev/t/migration-via-snapshot-restore-mapping-roadblocks/7906/7 |
Hi, |
@andreaAlkalay @abhishek-v it's not on any roadmap today, but we would gladly accept a PR |
+1! |
@andreaAlkalay Can you tell us more about your business scenario and why the flattened type is important to it? Thanks. |
Let me jump in :) Flattened type solves such problems for fields that you don't want to use as nested |
@anasalkouz heya Anas -- what's the latest on this? |
(question is also for @macrakis) :) |
[Design Proposal] The flat data type in OpenSearchSummaryJSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup. Flat subfields support exact match queries and textual sorting. Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting. Motivation
DemandOpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)
OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)
SpecificationMapping and ingestion
Searching and retrieving
ExampleThis declares catalog as being of type flattened:
Consider the ingestion of the following document:
Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed. Performance
LimitationsFlattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting. Possible implementationThese are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.
SecurityFlattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole. Possible enhancementsThe current specification is minimal. It intentionally does not include many options offered by other vendors. Depending on the user feedback we receive after the initial release, various enhancements are possible:
|
No issues with that! Sounds great. |
Thanks @CEHENKLE! |
I think for posterity we might want to do this not by replacing the original content, but adding Update ... and linking and copy-pasting something from below, just so it doesn't look like I did all the work on that proposal (@macrakis did ;)). |
Good point, @dblock . Will do it that way going forward. @aabukhalil Can you pick this up? |
@CEHENKLE yes I will be working on this |
Open questions:
Checklist of things to do:
|
@aabukhalil I agree, going with [1] #3545 (comment) |
@reta yes I'm asking for help regarding legal implications. |
@josefschiefer27 There are pros and cons in approach 1 and approach 3, and I have been thinking that they are pretty good and different implementations that cannot be combined. You are right about approach 3's biggest pro that can support multiple types, for example, numeric and dates. But the biggest con is that it creates sub docs. If the Json object are very complicated, it can goes up to exponential amount of sub docs in the worst case (if each leave has n leaves, summing n^k, k from 0 to n, that would be n^n amount sub-docs). It would be a lot of sub-docs to consider in the worst case. Approach 1 treats everything as string,(for example number as string), a very complicated JSON, it can be a long string field to parse in mapping, but it doesn't do anything in the docs. it works the same way as uninterpreted keywords, and sort like strings. Nested fields cannot be used as dates or numbers. It's efficiency in storing and mapping. Thinking of what use case the flat-object should fit. It's so called lazy man approach, we define a field as flat-object to store as string to avoid mapping explosion, can be retrieved in exact match in global field level and dot path notation. If someone wants to use numeric data in a fair simple JSON object, people can always go with dynamic mapping to allow each subfield has its specific field type. Can you please give a sample use case for approach 3? We would like to see how it can fits in different ways. |
Approach 1 does work well with most search operations (e.g. range queries with numbers/dates get tricky), but does fall short with most aggregations (e.g. aggregations for numeric fields). In the current proposal for Approach 3, I agree there would be lots of sub-docs. However, do we really have to create nested sub-docs? Couldn't we just map the fields as proposed without nested docs? Query capabilities would be still better and more flexible as approach 1. Maybe I am missing something - let me make an example. Let's say we have this json (note that I added numeric fields 'reviews' and 'price'): {
{ "ISBN13" : "V9781933988177",
"catalog" :
{ "title" : "Lucene in Action",
"reviews": 1, "price": 11.5,
"authors" : [
{ "surname" : "McCandless",
"given" : "Mike" },
{ "surname" : "Hatcher",
"given" : "Erik" }
...
...]
}}} If we want to flatten the field 'catalog' we could map it similar as suggested in approach 3 with typed values as follows: [
{
"key": "catalog.title",
"value_string": "Lucene in Action"
},
{
"key": "catalog.authors.surname",
"value_string": ["McCandless", "Hatcher"]
},
{
"key": "catalog.authors.given",
"value_string": ["Mike","Erik"]
},
{
"key": "catalog.reviews",
"value_num": 1
},
{
"key": "catalog.price",
"value_num": 11.5
}
] With this index structure for flattened fields we used 3 fields in total ('key', 'value_string', 'value_num') and writing search and aggregation queries are fairly simple. Let's say we want to get the average price for all jsons, we could write the query as follows: {
"query": {
"term": {
"key": {
"value": "catalog.price"
}
}
},
"aggs": {
"average_price": {
"avg": {
"field": "value_num"
}
}
}
} Note, that queries and aggs do have to be rewritten to work with this index structure. However, most search and aggregation functions would work with this index structure as intended. There would be less limitations as for approach 1. |
@josefschiefer27 trying to understand what would be the |
@reta - catalog is flattened. With my proposed index structure, it can be an object or array, both would work as expected (same behavior as for other fields in Opensearch). With 'as expected' I mean same result as I would get without flattening. |
@josefschiefer27 sorry should have been more precise, |
@reta - I am a bit confused when you say OpenSearch does not support arrays natively. Every field can be also an array in OpenSearch (https://opensearch.org/docs/2.0/opensearch/supported-field-types/index/) which is a feature supported through Lucene. |
If we go with approach 1, we do something very similar what Elasticsearch does today with 'flattened' data type. And the number one limitation and complaint from users today is the lack of support of data types besides strings. In my opinion it's very challenging to get around that limitation. It would be nice if we don't put ourselves into the same corner. Here some Elasticsearch limitations for reference elastic/elasticsearch#61550 - there are many hearts on this issue ;-) See also elastic/elasticsearch#43805 for possible limitations. |
I tried to create a bigger example based on @lukas-vlcek's json examples from above to illustrate how the mapping would work. I added date and numeric fields as well as mixed field values to make it more interesting. Below my learnings when going through the example... Sample Data- // --- Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
"ISBN13": "V9781933988175",
"catalog": {
"title": "Java in Action",
"author": "John Doe",
"publication_score": 1023
}
}
- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
"ISBN13": "V9781933988177",
"catalog": {
"title": "Lucene in Action",
"price": 12.5,
"publication_date": "2010-10-10T10:10:10",
"author": {
"surname": "McCandless",
"given": "Mike"
"publication_score": 1033
}
}
}
- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
"ISBN13": "V9781933988176",
"catalog": {
"title": "Test in Action",
"publication_date": "none",
"price": 14,
"author": [
"John Doe",
{
"surname": "Smith",
"given": "Peter"
},
{
"surname": "Smith2",
"given": "Peter2"
},
{
"surename": "Green",
"first_name": "Billy"
}
]
}
} These documents would be mapped (under the hood) into the following index structure. - // --- Mapped Document 0
- // The "catalog.author" is a simple text field. One numeric field.
{
{
"key": "ISBN13",
"value_string": "V9781933988175"
},
{
"key": "catalog.title",
"value_string": "Java in Action"
},
{
"key": "catalog.author",
"value_string": "John Doe"
},
{
"key": "catalog.author",
"value_string": "John Doe"
},
{
"key": "catalog.publication_score",
"value_num": 1023
}
}
- // --- Document 1
- // The "catalog.author" is an object. New date field.
{
{
"key": "ISBN13",
"value_string": "V9781933988177"
},
{
"key": "catalog.title",
"value_string": "Lucene in Action"
},
{
"key": "catalog.price",
"value_num": 12.5
},
{
"key": "catalog.publication_date",
"value_date": "2010-10-10T10:10:10"
},
{
"key": "catalog.author.surname",
"value_string": "McCandless"
},
{
"key": "catalog.author.given",
"value_string": "Mike"
},
{
"key": "catalog.author.publication_score",
"value_num": 1033
}
}
- // --- Document 2
- // The "catalog.author" is an array with objects.
- // And each object can be either simple value field or another object with variable "schema". Date is invalidate date value.
{
{
"key": "ISBN13",
"value_string": "V9781933988176",
},
{
"key": "catalog.title",
"value_string": "Test in Action",
},
{
"key": "catalog.publication_date",
"value_string": "none",
},
{
"key": "catalog.price",
"value_num": 14
},
{
"key": "catalog.author",
"value_string": "John Doe",
},
{
"key": "catalog.author.surname",
"value_string": "Smith",
},
{
"key": "catalog.author.given",
"value_string": ["Peter", "Peter2"],
},
{
"key": "catalog.author.surname",
"value_string": ["Smith", "Smith2"]
},
{
"key": "catalog.author.surename",
"value_string": "Green",
},
{
"key": "catalog.author.first_name",
"value_string": "Billy"
}
} For this index mapping we used 4 Lucene fields ('key', 'value_string', 'value_num', 'value_date') to map all fields into Lucene. You can see that we can map also 'weird' json data which wouldn't be supported by OpenSearch without flattening. Now the real fun starts - how can we query this json data!?!Queries and aggregations using flattened fields need to be rewritten - any query clause and aggregation needs to use generic value fields and requires an additional filter for the key. Let's try some query example. Let's assume we want to find all docs which the word 'Action' in catalog.title. Without flattening the query would be: {
"query" : {
"wildcard" : {
"catalog.title": "*Action*"
}
}
} To get the same result, we could try to rewrite this query as follows: {
"query" : {
"bool": {
"filter": [
{
"term" : {
"key": "catalog.title"
}
},
{
"wildcard" : {
"value_string": "*Action*"
}
}
]
}
}
} However, there is a big problem with this query - since we don't use nested docs/queries, it wouldn't deliver always the correct result (e.g. if there is a 'catalog.title' field and *Action* matches in some other field we would still get a hit). I possibly could use a scripted query to validate the match - however this wouldn't be an elegant solution anymore... it might work as discussed above by using nested docs/queries, however that might lead to a 'nested-doc' explosion. The example was a helpful exercise for me to understand better the problem. It would be nice if we could find some way to support data-types beyond just strings. |
An idea to get around the problem of having only string (sub-)fields in flattened objects. We could allow users to define in the mapping a parameter to exclude certain fields from the flattening (e.g. "exclude": "*date*" for not flattening all date fields). This would allow users to have full search/aggs supports for selected fields of the flattened object. |
There is no dedicated array field type in OpenSearch. Instead, you can pass an array of values into any field. All values in the array must have the same field type. - taken from docs |
@josefschiefer27 Approach 3 does create a lot of sub-docs, but not nested doc in multiple levels. To be clear, there will be root level and level one. That's two level in total. But level one might have n^n sub-docs in the worst case. Yes, it will support the numeric operation. It is an important point for the users, but it's not a minimum requirement addressing in this issue. It seems that you have a clear idea of implementing approach 3 and would you like to raise a PR, or a draft PR to approach 3? |
I thought about dynamically adding subfields to identify typed fields, but if enabled adding non-limited subfields, for example, millions of dates and numbers subfields, it will have risk in leading to mapping explosion. And there might be a way around to help with the numeric subfields, if a user would like to use one raw field as flat-object to injest entire JSON as string, when found out a numeric subfield and a date subfield within the JSON object, user can cherrypick the subfields, and try add additional new fields to update the documents with numeric fields or date fields. In this example, it can be three fields,
It might need some work, but this can a way around to help with the typed fields and avoid mapping explosion. |
@mingshl - I think creating lots of sub-docs for flattened objects is sub-optimal and likely creates other problems. There might be flattened objects where the number of sub-docs can becomes huge and nested queries can be expensive. In my attempt for approach 3 I tried to avoid nested docs/queries. Meanwhile, I do believe that approach 1 with smart string encoding is probably the most promising approach. In your description for approach 1 you are using two fields ('value' and 'content_and_path'). Wouldn't be the 'content_and_path' field sufficient? You mentioned as example catalog = 'Mike' - not sure when this would be needed in an OpenSearch query. Edit: Found the answer to my question - such query is currently supported by 'flattened' data type. |
@lukas-vlcek Reaching out since this is marked as a part of |
Hi @kotwanikunal,the flat-object is going to v2.7.0 release. We are planning to merge this PR later today. #6507 |
To fulfill issue #1018, we implement the approach by storing the entire nested object as a String. A `flat_object` creates exactly two internal Lucene [StringField](https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/document/StringField.html) ( "._value" and "._valueAndPath" ) in regards of how many nested fields in the flat field. - value: a keyword field field contains the leaf values of all subfields, allowing for efficient searching of leaf values without specifying the path. (e.g: catalog = 'Mike'). - valueAndPath: : a keyword field field contains the path to the leaf value and its value, enabling efficient searching when the query includes the path to the leaf. (e.g: catalog.author.given = 'Mike') Limitation and Future Development: - enable searching in PainlessScript, we will need to direct the fielddatabuilder to fetch docvalues within the two stringfields in memory - open parameters setting, such as normalizer, docValues, ignoreAbove, nullValue, similarity, and depthlimit. - enable wildcard query (cherry picked from commit 75bb3ef) Signed-off-by: Mingshi Liu <mingshl@amazon.com> Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io> Co-authored-by: Lukáš Vlček <lukas.vlcek@aiven.io>
(updated from #1018 (comment) below, @macrakis)
[Design Proposal] The flat data type in OpenSearch
Summary
JSON objects whose components are not indexed are a proposed new field type in OpenSearch; we call them flat objects, and their mapping type is flattened. Subfields within the JSON are accessible using standard dot path notation in DSL, SQL, and Painless, but are not indexed for fast lookup.
Flat subfields support exact match queries and textual sorting.
Flat fields and subfields do not support type-specific parsing, numerical operations such as numerical comparison and sorting, text analyzers, or highlighting.
Motivation
Demand
OpenDistro: Provide alternative to datatype flattened #523 (Dec 21, 2020)
OpenDistro: Migration via Snapshot restore. Mapping roadblocks (Dec 2021)
Specification
Mapping and ingestion
Searching and retrieving
Example
This declares catalog as being of type flattened:
Consider the ingestion of the following document:
Upon ingestion, this will create two indexes: one for ISBN13, and one for catalog. The surname field (e.g.) may be accessed as catalog.author1.surname. But the title, author1, surname, and given fields (and the words within them) will not be indexed.
Performance
Limitations
Flattened fields and subfields do not support type-specific operations, including parsing, numerical operations (including numerical ranges), text analyzers, or highlighting.
Possible implementation
These are some possible implementations. The implementation may use one of these or something else that satisfies the requirements.
Security
Flattened fields are treated as atomic for the purposes of security. Individual subfields cannot have different security properties than the field as a whole.
Possible enhancements
The current specification is minimal. It intentionally does not include many options offered by other vendors.
Depending on the user feedback we receive after the initial release, various enhancements are possible:
The text was updated successfully, but these errors were encountered: