-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for 'flattened object' fields. #42541
Conversation
Pinging @elastic/es-search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how queries and aggregations work as if fields had been mapped on their own. However, this is not the case for stored fields, which makes me wonder whether we should leave it unsupported for now.
@@ -82,6 +84,8 @@ include::types/date.asciidoc[] | |||
|
|||
include::types/date_nanos.asciidoc[] | |||
|
|||
include::types/embedded-json.asciidoc[] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we move it next to the object
and keyword
fields that it relates to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these includes are just alphabetized. For the actual links to individual field types, I put it under "Specialised datatypes" to encourage users to think through whether it's appropriate for their data.
- Only one field mapping is created for the whole object, which can help | ||
prevent a <<mapping-limit-settings, mappings explosion>> due to a large | ||
number of field mappings. | ||
- An embedded JSON field may take up less space in the index, as only one underlying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we skip this one as this is not what your tests suggested? #33003 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I actually have a TODO to rework this docs page, I will ping you for another look when that's done.
keywords. When sorting, this implies that values are compared lexicographically. | ||
|
||
Finally, because of the way leaf values are stored in the index, the null | ||
character `\0` is not allowed to appear in the keys of the JSON object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment might be a bit misleading since the null character is not allowed anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I had forgotten these weren't allowed by default.
==== Stored fields | ||
|
||
If the <<mapping-store,`store`>> option is enabled, the entire JSON object will | ||
be stored in pretty-printed format. It can be retrieved through the top-level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we pretty-print?
for (int i = 0; i < field.length(); ++i) { | ||
if (field.charAt(i) == '.') { | ||
numDots++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe use String#indexOf(String) which is an intrinsic and might make this a bit faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -36,15 +38,24 @@ | |||
final CopyOnWriteHashMap<String, MappedFieldType> fullNameToFieldType; | |||
private final CopyOnWriteHashMap<String, String> aliasToConcreteName; | |||
|
|||
private final CopyOnWriteHashMap<String, JsonFieldMapper> fullNameToJsonMapper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, it'd be slightly cleaner to me if we referred to the field type rather than mapper here, since the type is supposed to be about read logic while the mapper is about write logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had actually tried this, but found it cleaner to use JsonFieldMapper
here compared to the other option, where we use RootJsonFieldType
and create the KeyedJsonFieldType
objects using it. I found it nice that JsonFieldMapper
contained consistent pairs of methods (fieldType()
and keyedFieldType()
, name()
and keyedFieldName()
). To me the mapper is acting in its role as 'field type provider'.
|
||
public static final String CONTENT_TYPE = "embedded_json"; | ||
public static final NamedAnalyzer WHITESPACE_ANALYZER = new NamedAnalyzer( | ||
"whitespace", AnalyzerScope.INDEX, new WhitespaceAnalyzer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having an analyzer shared across indices with the INDEX scope feels wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense -- I'll move the construction of this analyzer to JsonFieldMapper.Builder#build
to match what we do in KeywordFieldMapper
.
Looking at this again, I also find the behavior around stored fields to be a bit unintuitive:
I guess the main use case would be if a user wanted to retrieve the field in a search, but using source filtering is too expensive. I would be okay leaving it unsupported for now because the design is not that clean. @colings86 and @romseygeek checking since you reviewed this earlier -- what do you think about removing support for stored fields? |
I’m fine with removing support for stored field until we have a more
intuitive way of exposing it
…On Wed, 29 May 2019 at 19:20, Julie Tibshirani ***@***.***> wrote:
However, this is not the case for stored fields, which makes me wonder
whether we should leave it unsupported for now.
Looking at this again, I also find the behavior around stored fields to be
a bit unintuitive:
- Only the root field is stored, it is not possible to load the keyed
fields through stored_fields.
- We store the whole JSON input, which differs from what is indexed
and stored in docvalues.
- We don't preserve the original formatting because we parse the JSON
block, then reconstruct it to create the stored field.
I guess the main use case would be if a user wanted to retrieve the field
in a search, but that using source filtering would be too expensive. I
would be okay leaving it unsupported for now because the design is not that
clean. @colings86 <https://github.com/colings86> and @romseygeek
<https://github.com/romseygeek> checking since you reviewed this earlier
-- what do you think about removing support for stored fields?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42541?email_source=notifications&email_token=AABZZO5HYMRANJE7AG6QDY3PX3CORA5CNFSM4HPR3JQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWQGWDY#issuecomment-497052431>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABZZO5CGGGKL4LTYPPKZHLPX3CORANCNFSM4HPR3JQQ>
.
|
63dd3a8
to
a03ca94
Compare
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add tests for the supported query types. * Disallow unbounded range queries on keyed JSON fields. * Make sure MappedFieldType#hasDocValues always returns false.
* Add documentation for JSON fields.
We now track the maximum depth of any JSON field, which allows the JSON field lookup to be short-circuited as soon as that depth is reached. This helps prevent slow lookups when the user is searching over a very deep field that is not in the mappings.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
One concern around the name `json` is that because the entire document is JSON, new users may see this field and think that they should always use it. We thought that a more verbose name like `embedded_json` would help convey that the field type has a special, non-obvious purpose. This commit updates documentation references to `embedded_json`, but leaves the `JsonField` naming in the code to avoid very long class names.
This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and sorting options to be supported. As discussed in #41220, I opted not to support `KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity. The one place this affects is the 'low cardinality' optimization for terms aggregations, which now needs to be disabled for keyed JSON fields. It was fairly difficult to incorporate this change, and I have a couple follow-up refactors in mind to help simplify the global ordinals code. (I will likely wait until this feature branch is merged though before opening PRs on master).
…r. (#41319) The index warmer iterates through all field types when determining the fields for which global ordinals should be loaded. Previously, keyed JSON field types were not returned from FieldTypeLookup#iterator, so their eager_global_ordinals setting would be ignored. This PR fixes the issue by including keyed JSON fields in FieldTypeLookup#iterator.
In an earlier iteration of the design, it made sense to disallow these query types on the root JSON field. It should now it be fine to allow them.
* Don't explicitly mention that '\0' is not allowed in keys. * Use String#indexOf in FieldTypeLookup#fieldDepth. * Construct the whitespace analyzer once per field mapper.
* Remove comment about saving space. * Emphasize the similarity to keyword fields. * Line wrap at 80 characters.
a03ca94
to
74ebc19
Compare
d38d12a
to
064e26f
Compare
The code refers to 'flat object' in some places for brevity.
064e26f
to
1ec900a
Compare
1ec900a
to
ad35339
Compare
@jpountz @colings86 this is ready for another look. The last commit since you reviewed is 7237b2b ('Remove the experimental tag.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from minor comments, it looks good to me.
whitespace when building a query for this field. Accepts `true` or `false` | ||
(default). | ||
|
||
<<mapping-store,`store`>>:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks I missed this.
index: flat_object_test | ||
body: | ||
mappings: | ||
dynamic: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is irrelevant to the test, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 removed.
* A helper class for {@link FlatObjectFieldMapper} parses a JSON object | ||
* and produces a pair of indexable fields for each leaf value. | ||
*/ | ||
public class FlatObjectFieldParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it be pkg-private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -1757,5 +1757,5 @@ public void testFieldAliasesForMetaFields() throws Exception { | |||
|
|||
DocumentField field = hit.getFields().get("id-alias"); | |||
assertThat(field.getValue().toString(), equalTo("1")); | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe undo changes to this file since they are unrelated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Previously, if multiple `embedded_json` fields were added at once, only the last one would be registered with `FieldTypeLookup`. This bug was uncovered when trying out different scenarios for performance benchmarking.
This PR pulls `FlatObjectFieldMapper` into its own `MapperPlugin`. To do so it introduces a new interface `DynamicKeyFieldMapper` with the method `keyedFieldType(String key)`, which gives the opportunity to return a special field type for a subfield.
This PR pulls the `flattened` mapper plugin into the xpack directory as its own feature.
This commit merges the `object-fields` feature branch. The new 'flattened object' field type allows an entire JSON object to be indexed into a field, and provides limited search functionality over the field's contents.
This PR merges the
object-fields
feature branch. All commits on the branch have been individually code reviewed as part of earlier PRs.Before merging, there are a few open issues to resolve:
Original issue: #25312
Meta-issue tracking design + implementation: #33003