Add support for 'flattened object' fields. #42541

jtibshirani · 2019-05-24T18:41:00Z

This PR merges the object-fields feature branch. All commits on the branch have been individually code reviewed as part of earlier PRs.

Before merging, there are a few open issues to resolve:

The field type is currently marked 'experimental'. I've started an internal discussion to see if we can remove this tag, since the feature is useful in its current form and we don't expect huge changes in its API.
There is some chance we want to revise the type name again -- I reopened an internal issue to ask for feedback.
I will push a commit with some tweaks to the documentation (I've gained more insight on the field type from performance profiling and discussions with the Kibana team).

Original issue: #25312
Meta-issue tracking design + implementation: #33003

elasticmachine · 2019-05-24T18:41:02Z

Pinging @elastic/es-search

jpountz

I really like how queries and aggregations work as if fields had been mapped on their own. However, this is not the case for stored fields, which makes me wonder whether we should leave it unsupported for now.

jpountz · 2019-05-29T08:05:08Z

docs/reference/mapping/types.asciidoc

@@ -82,6 +84,8 @@ include::types/date.asciidoc[]

 include::types/date_nanos.asciidoc[]

+include::types/embedded-json.asciidoc[]
+


nit: could we move it next to the object and keyword fields that it relates to?

I think these includes are just alphabetized. For the actual links to individual field types, I put it under "Specialised datatypes" to encourage users to think through whether it's appropriate for their data.

jpountz · 2019-05-29T08:08:05Z

docs/reference/mapping/types/embedded-json.asciidoc

+- Only one field mapping is created for the whole object, which can help
+  prevent a <<mapping-limit-settings, mappings explosion>> due to a large
+  number of field mappings.
+- An embedded JSON field may take up less space in the index, as only one underlying


Should we skip this one as this is not what your tests suggested? #33003 (comment)

I agree -- I actually have a TODO to rework this docs page, I will ping you for another look when that's done.

jpountz · 2019-05-29T08:11:02Z

docs/reference/mapping/types/embedded-json.asciidoc

+keywords. When sorting, this implies that values are compared lexicographically.
+
+Finally, because of the way leaf values are stored in the index, the null
+character `\0` is not allowed to appear in the keys of the JSON object.


This comment might be a bit misleading since the null character is not allowed anyway?

Thanks, I had forgotten these weren't allowed by default.

jpountz · 2019-05-29T08:11:21Z

docs/reference/mapping/types/embedded-json.asciidoc

+==== Stored fields
+
+If the <<mapping-store,`store`>> option is enabled, the entire JSON object will
+be stored in pretty-printed format. It can be retrieved through the top-level


Why do we pretty-print?

jpountz · 2019-05-29T08:18:35Z

server/src/main/java/org/elasticsearch/index/mapper/FieldTypeLookup.java

+        for (int i = 0; i < field.length(); ++i) {
+            if (field.charAt(i) == '.') {
+                numDots++;
+            }


nit: maybe use String#indexOf(String) which is an intrinsic and might make this a bit faster

jpountz · 2019-05-29T08:26:02Z

server/src/main/java/org/elasticsearch/index/mapper/FieldTypeLookup.java

@@ -36,15 +38,24 @@
    final CopyOnWriteHashMap<String, MappedFieldType> fullNameToFieldType;
    private final CopyOnWriteHashMap<String, String> aliasToConcreteName;

+    private final CopyOnWriteHashMap<String, JsonFieldMapper> fullNameToJsonMapper;


nit, it'd be slightly cleaner to me if we referred to the field type rather than mapper here, since the type is supposed to be about read logic while the mapper is about write logic

I had actually tried this, but found it cleaner to use JsonFieldMapper here compared to the other option, where we use RootJsonFieldType and create the KeyedJsonFieldType objects using it. I found it nice that JsonFieldMapper contained consistent pairs of methods (fieldType() and keyedFieldType(), name() and keyedFieldName()). To me the mapper is acting in its role as 'field type provider'.

jpountz · 2019-05-29T08:52:38Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java

+
+    public static final String CONTENT_TYPE = "embedded_json";
+    public static final NamedAnalyzer WHITESPACE_ANALYZER = new NamedAnalyzer(
+        "whitespace", AnalyzerScope.INDEX, new WhitespaceAnalyzer());


having an analyzer shared across indices with the INDEX scope feels wrong

That makes sense -- I'll move the construction of this analyzer to JsonFieldMapper.Builder#build to match what we do in KeywordFieldMapper.

jtibshirani · 2019-05-29T18:20:15Z

However, this is not the case for stored fields, which makes me wonder whether we should leave it unsupported for now.

Looking at this again, I also find the behavior around stored fields to be a bit unintuitive:

Only the root field is stored, it is not possible to load the keyed fields through stored_fields.
We store the whole JSON input, which differs from what is indexed and stored in docvalues.
We don't preserve the original formatting because we parse the JSON block, then reconstruct it to create the stored field.

I guess the main use case would be if a user wanted to retrieve the field in a search, but using source filtering is too expensive. I would be okay leaving it unsupported for now because the design is not that clean. @colings86 and @romseygeek checking since you reviewed this earlier -- what do you think about removing support for stored fields?

colings86 · 2019-05-29T18:28:50Z

I’m fine with removing support for stored field until we have a more intuitive way of exposing it

…

On Wed, 29 May 2019 at 19:20, Julie Tibshirani ***@***.***> wrote: However, this is not the case for stored fields, which makes me wonder whether we should leave it unsupported for now. Looking at this again, I also find the behavior around stored fields to be a bit unintuitive: - Only the root field is stored, it is not possible to load the keyed fields through stored_fields. - We store the whole JSON input, which differs from what is indexed and stored in docvalues. - We don't preserve the original formatting because we parse the JSON block, then reconstruct it to create the stored field. I guess the main use case would be if a user wanted to retrieve the field in a search, but that using source filtering would be too expensive. I would be okay leaving it unsupported for now because the design is not that clean. @colings86 <https://github.com/colings86> and @romseygeek <https://github.com/romseygeek> checking since you reviewed this earlier -- what do you think about removing support for stored fields? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42541?email_source=notifications&email_token=AABZZO5HYMRANJE7AG6QDY3PX3CORA5CNFSM4HPR3JQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWQGWDY#issuecomment-497052431>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABZZO5CGGGKL4LTYPPKZHLPX3CORANCNFSM4HPR3JQQ> .

* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.

…key. (#34207)

* Add tests for the supported query types. * Disallow unbounded range queries on keyed JSON fields. * Make sure MappedFieldType#hasDocValues always returns false.

* Add documentation for JSON fields.

We now track the maximum depth of any JSON field, which allows the JSON field lookup to be short-circuited as soon as that depth is reached. This helps prevent slow lookups when the user is searching over a very deep field that is not in the mappings.

When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.

One concern around the name `json` is that because the entire document is JSON, new users may see this field and think that they should always use it. We thought that a more verbose name like `embedded_json` would help convey that the field type has a special, non-obvious purpose. This commit updates documentation references to `embedded_json`, but leaves the `JsonField` naming in the code to avoid very long class names.

This PR updates `KeyedJsonAtomicFieldData` to always return ordinals in the range `[0, (maxOrd - minOrd)]`, which is necessary for certain aggregations and sorting options to be supported. As discussed in #41220, I opted not to support `KeyedIndexFieldData#getOrdinalMap`, as it would add substantial complexity. The one place this affects is the 'low cardinality' optimization for terms aggregations, which now needs to be disabled for keyed JSON fields. It was fairly difficult to incorporate this change, and I have a couple follow-up refactors in mind to help simplify the global ordinals code. (I will likely wait until this feature branch is merged though before opening PRs on master).

…r. (#41319) The index warmer iterates through all field types when determining the fields for which global ordinals should be loaded. Previously, keyed JSON field types were not returned from FieldTypeLookup#iterator, so their eager_global_ordinals setting would be ignored. This PR fixes the issue by including keyed JSON fields in FieldTypeLookup#iterator.

In an earlier iteration of the design, it made sense to disallow these query types on the root JSON field. It should now it be fine to allow them.

* Don't explicitly mention that '\0' is not allowed in keys. * Use String#indexOf in FieldTypeLookup#fieldDepth. * Construct the whitespace analyzer once per field mapper.

* Remove comment about saving space. * Emphasize the similarity to keyword fields. * Line wrap at 80 characters.

The code refers to 'flat object' in some places for brevity.

jtibshirani · 2019-06-07T22:09:05Z

@jpountz @colings86 this is ready for another look. The last commit since you reviewed is 7237b2b ('Remove the experimental tag.')

jpountz

Apart from minor comments, it looks good to me.

jpountz · 2019-06-11T12:38:03Z

docs/reference/mapping/types/flattened.asciidoc

+    whitespace when building a query for this field. Accepts `true` or `false`
+    (default).
+
+<<mapping-store,`store`>>::


didn't we remove it?

Oops, thanks I missed this.

jpountz · 2019-06-11T12:42:11Z

rest-api-spec/src/main/resources/rest-api-spec/test/search/160_exists_query.yml

+        index:  flat_object_test
+        body:
+          mappings:
+            dynamic: false


this is irrelevant to the test, isn't it?

👍 removed.

jpountz · 2019-06-11T13:15:54Z

server/src/main/java/org/elasticsearch/index/mapper/FlatObjectFieldParser.java

+ * A helper class for {@link FlatObjectFieldMapper} parses a JSON object
+ * and produces a pair of indexable fields for each leaf value.
+ */
+public class FlatObjectFieldParser {


can it be pkg-private?

jpountz · 2019-06-11T13:19:59Z

server/src/test/java/org/elasticsearch/search/query/SearchQueryIT.java

@@ -1757,5 +1757,5 @@ public void testFieldAliasesForMetaFields() throws Exception {

        DocumentField field = hit.getFields().get("id-alias");
        assertThat(field.getValue().toString(), equalTo("1"));
-   }
+    }


maybe undo changes to this file since they are unrelated

Previously, if multiple `embedded_json` fields were added at once, only the last one would be registered with `FieldTypeLookup`. This bug was uncovered when trying out different scenarios for performance benchmarking.

This PR pulls `FlatObjectFieldMapper` into its own `MapperPlugin`. To do so it introduces a new interface `DynamicKeyFieldMapper` with the method `keyedFieldType(String key)`, which gives the opportunity to return a special field type for a subfield.

This PR pulls the `flattened` mapper plugin into the xpack directory as its own feature.

This commit merges the `object-fields` feature branch. The new 'flattened object' field type allows an entire JSON object to be indexed into a field, and provides limited search functionality over the field's contents.

jtibshirani added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 v7.3.0 labels May 24, 2019

jpountz reviewed May 29, 2019

View reviewed changes

jtibshirani force-pushed the object-fields branch from 63dd3a8 to a03ca94 Compare May 29, 2019 23:18

jtibshirani added 18 commits May 29, 2019 16:19

When parsing JSON fields, also create tokens prefixed with the field …

09b68e7

…key. (#34207)

Add support for querying JSON fields based on key. (#34621)

9eb4bd1

Add support for storing JSON fields. (#34942)

7daf406

Enforce a limit on the depth of the JSON object. (#35063)

624d9ca

Disallow doc_values in the JSON field mapping. (#35282)

133f554

Make sure stored JSON fields are properly decoded. (#35279)

9183cbd

Add tests for the supported query types. (#35319)

4507b85

* Add tests for the supported query types. * Disallow unbounded range queries on keyed JSON fields. * Make sure MappedFieldType#hasDocValues always returns false.

Add documentation for JSON fields. (#35281)

fbadb62

* Add documentation for JSON fields.

Add a test around JSON fields and source filtering. (#35399)

d0ec58a

Rename embedded_json.asciidoc for consistency.

78da6c5

Allow non-exact query types on the root JSON field. (#41290)

bf5c457

In an earlier iteration of the design, it made sense to disallow these query types on the root JSON field. It should now it be fine to allow them.

Remove the 'experimental' tag.

7237b2b

jtibshirani changed the title ~~Add support for embedded JSON ('queryable object') fields.~~ Add support for embedded JSON fields. May 29, 2019

jtibshirani added 2 commits May 29, 2019 16:31

Address code review feedback.

31d3a58

* Don't explicitly mention that '\0' is not allowed in keys. * Use String#indexOf in FieldTypeLookup#fieldDepth. * Construct the whitespace analyzer once per field mapper.

Remove support for stored fields.

c9420a3

Improve the field type documentation.

74ebc19

* Remove comment about saving space. * Emphasize the similarity to keyword fields. * Line wrap at 80 characters.

jtibshirani force-pushed the object-fields branch from a03ca94 to 74ebc19 Compare May 29, 2019 23:34

jtibshirani changed the title ~~Add support for embedded JSON fields.~~ Add support for 'flattened object' fields. Jun 7, 2019

jtibshirani force-pushed the object-fields branch from d38d12a to 064e26f Compare June 7, 2019 05:52

jtibshirani added 2 commits June 7, 2019 10:51

Rename 'embedded JSON' to 'flattened object'.

9cefddf

The code refers to 'flat object' in some places for brevity.

Merge remote-tracking branch 'upstream/master' into object-fields

58533cf

jtibshirani force-pushed the object-fields branch from 064e26f to 1ec900a Compare June 7, 2019 19:32

Fix full text queries, which were broken after the merge.

ad35339

jtibshirani force-pushed the object-fields branch from 1ec900a to ad35339 Compare June 7, 2019 19:43

jpountz approved these changes Jun 11, 2019

View reviewed changes

jtibshirani added 9 commits June 11, 2019 12:11

Address code review feedback and remove leftover references to 'json'.

8241627

Merge remote-tracking branch 'upstream/master' into object-fields

e47467b

Ensure all JSON field types are available for lookup. (#41914)

daa200b

Previously, if multiple `embedded_json` fields were added at once, only the last one would be registered with `FieldTypeLookup`. This bug was uncovered when trying out different scenarios for performance benchmarking.

Merge remote-tracking branch 'upstream/master' into object-fields

c995f68

Merge remote-tracking branch 'upstream/master' into object-fields

2c18194

License the flattened mapper as basic. (#43690)

6f33b65

This PR pulls the `flattened` mapper plugin into the xpack directory as its own feature.

Merge remote-tracking branch 'upstream/master' into object-fields

ac0a80b

Merge remote-tracking branch 'upstream/master' into object-fields

c344200

jtibshirani merged commit f3317eb into master Jun 28, 2019

jtibshirani deleted the object-fields branch June 28, 2019 12:33

This was referenced Jun 28, 2019

Add support for 'flattened object' fields. #43762

Merged

Flattened object fields design + implementation #33003

Closed

Queryable object fields #25312

Closed

Mpdreamz mentioned this pull request Aug 7, 2019

[meta] 7.3 Release elastic/elasticsearch-net#4001

Closed

16 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for 'flattened object' fields. #42541

Add support for 'flattened object' fields. #42541

jtibshirani commented May 24, 2019 •

edited

Loading

elasticmachine commented May 24, 2019

jpountz left a comment

jpountz May 29, 2019

jtibshirani May 29, 2019

jpountz May 29, 2019

jtibshirani May 29, 2019 •

edited

Loading

jpountz May 29, 2019

jtibshirani May 29, 2019

jpountz May 29, 2019

jpountz May 29, 2019

jtibshirani May 29, 2019

jpountz May 29, 2019

jtibshirani May 29, 2019 •

edited

Loading

jpountz May 29, 2019

jtibshirani May 29, 2019

jtibshirani commented May 29, 2019 •

edited

Loading

colings86 commented May 29, 2019 via email

jtibshirani commented Jun 7, 2019

jpountz left a comment

jpountz Jun 11, 2019

jtibshirani Jun 11, 2019

jpountz Jun 11, 2019

jtibshirani Jun 11, 2019

jpountz Jun 11, 2019

jtibshirani Jun 11, 2019

jpountz Jun 11, 2019

jtibshirani Jun 11, 2019

		@@ -82,6 +84,8 @@ include::types/date.asciidoc[]

		include::types/date_nanos.asciidoc[]

		include::types/embedded-json.asciidoc[]

Add support for 'flattened object' fields. #42541

Add support for 'flattened object' fields. #42541

Conversation

jtibshirani commented May 24, 2019 • edited Loading

elasticmachine commented May 24, 2019

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani May 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani May 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented May 29, 2019 • edited Loading

colings86 commented May 29, 2019 via email

jtibshirani commented Jun 7, 2019

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented May 24, 2019 •

edited

Loading

jtibshirani May 29, 2019 •

edited

Loading

jtibshirani May 29, 2019 •

edited

Loading

jtibshirani commented May 29, 2019 •

edited

Loading