Add a simple JSON field mapper. #33923

jtibshirani · 2018-09-20T20:50:46Z

Currently the mapper extracts all leaf values of the JSON object, converts them to their text representations, and indexes each one as a keyword. I plan to add support for key-value pairs in a subsequent PR. As discussed in #33795, we've decided to try an approach where we tokenize the JSON outside of lucene analysis.

Because of the parallel between this data type and keyword, I included the relevant options ignore_above, null_value, and split_queries_on_whitespace.

Lastly, it would be great to get feedback on the naming: I settled on json, but am very happy for other suggestions. I wasn't a huge fan of names like key_value, map, or dict, because they might suggest the input should be a flat list of key-value pairs, as opposed to a general JSON blob. Certainly object would be a nice choice, but it conflicts with the existing type.

elasticmachine · 2018-09-20T20:50:48Z

Pinging @elastic/es-search-aggs

jtibshirani · 2018-09-20T21:09:46Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java

+        }
+
+        @Override
+        public Query fuzzyQuery(Object value, Fuzziness fuzziness, int prefixLength, int maxExpansions,


I'm disallowing these more advanced queries for now, but will look into supporting them in a follow-up (tracked on the meta-issue).

In addition to fuzzy and regexp, I would also like to disallow wildcard. However, WildcardQueryBuilder doesn't delegate to a method on MappedFieldType. This means we attempt to create wildcard queries on non-text or keyword fields, and usually fail because the value can't be parsed. Do you know if there is a reason for this set-up, or if it's just an oversight that would be good to address?

I'm not sure regarding why wildcard is different. I suspect its just because we haven't before needed for change the behaviour of wildcard queries based on the field type. I can't see any reason not to change it so we can control the behaviour here though. If we do make the change it might be best to make it directly on master and then pull the change into this branch to disallow it as the change may become difficult to maintain in the feature branch

jtibshirani · 2018-09-20T21:13:09Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldParser.java

+
+    private void addField(String value, List<IndexableField> fields) {
+        if (value.length() <= ignoreAbove) {
+            fields.add(new Field(fieldType.name(), new BytesRef(value), fieldType));


For now I opted not to index a special stored field with the whole JSON blob (as we discussed in #33795 (comment)), but it's something I'm keeping in mind.

colings86

@jtibshirani I left some comments but I like where its going.

On naming: I'm not crazy about json as the name of the field type, mostly because our documents are all JSON anyway so I'm worried it will lead to some confusion and users will have the wrong expectations from this new field type. That said, I don't currently have a great alternative either so its something we can discuss and should not hold up this PR since renaming the field type should be fairly simple change later.

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java

colings86 · 2018-09-25T10:56:55Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java

+        }
+
+        public Builder copyTo(CopyTo copyTo) {
+            throw new UnsupportedOperationException("[copy_to] is not currently supported for ["


nit: I'd remove currently here since we don't yet know for sure if we are going to be able to implement copy_to or if we'll have to have a different mechanism.

Makes sense.

colings86 · 2018-09-25T11:01:13Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java

+        }
+
+        @Override
+        public Query fuzzyQuery(Object value, Fuzziness fuzziness, int prefixLength, int maxExpansions,


I'm not sure regarding why wildcard is different. I suspect its just because we haven't before needed for change the behaviour of wildcard queries based on the field type. I can't see any reason not to change it so we can control the behaviour here though. If we do make the change it might be best to make it directly on master and then pull the change into this branch to disallow it as the change may become difficult to maintain in the feature branch

colings86 · 2018-09-25T11:39:17Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java

+            return;
+        }
+
+        if (fieldType().indexOptions() != IndexOptions.NONE || fieldType().stored()) {


I think the stored check here is obsolete since we aren't adding an option for a stored field as per https://github.com/elastic/elasticsearch/pull/33923/files#r219319713 below? We should probably also throw an exception if store(boolean) is called on the Builder?

Also I wonder if its worth even allowing the indexOptions to be set? Given you can't turn on term vectors or positions the only option would be to turn off indexing altogether which I don't think makes sense since we aren't allowing stored fields? Whats left would be the user wanting to ignore the field completely which is already covered by the object type's enabled" false setting?

The lucene fields we're adding are still stored if store: true, it's just that each value is added as a separate field (as opposed to adding a single stored field containing the whole JSON blob, as I was brainstorming). I think it'd be nice to see how this naive approach looks on the feature branch, so we can make an informed choice about whether we should switch to a single stored field.

For the index_options question, I guess the user could still decide between docs and freqs. I was mostly aiming for consistency with the keyword-like fields here. Is your thought that frequency information doesn't really make sense in this context either?

it's just that each value is added as a separate field (as opposed to adding a single stored field containing the whole JSON blob, as I was brainstorming).

I think I'm still missed something here so maybe we can clear up my confusions with an example? Say we have a document as follows and json_field is a field of this new type:

{ "json_field": { "a": "foo", "b": { "c": "bar" } } }

Are you saying that if store: true we will create two stored fields in Lucene, one for json_field.a containing the value foo and one for json_field.b.c contain the value bar rather than crating a single json_field stored field containing the following?

{ "a": "foo", "b": { "c": "bar" } }

For the index_options question, I guess the user could still decide between docs and freqs.

True

I was mostly aiming for consistency with the keyword-like fields here. Is your thought that frequency information doesn't really make sense in this context either?

I guess the freqs would make sense for a search that is looking for e.g. "json_field": "foo" (i.e. "foo") anywhere in the JSON blob but it wouldn't really make sense for a search for e.g. "json_field.a": "foo" since we wouldn't have the frequencies for the inner field alone. I haven't thought about this too much but maybe for consistency between the two cases (did we say decide we were going to support both cases still?) it would be easier to explain without freqs?

Your understanding of the current behavior for stored fields is correct -- in this PR, a separate stored field is created for each leaf value. I think that later we should switch to the approach where we create a single stored field with the whole JSON blob (and have made a note of this on the meta-issue), but I was curious to play around with this first more naive approach. I'm also okay disallowing stored fields for now until we address the issue in a dedicated PR.

it wouldn't really make sense for a search for e.g. "json_field.a": "foo" since we wouldn't have the frequencies for the inner field alone

I think we would have these frequencies, because we'd calculate the frequency of the term a\0foo (where \0 is a separator character)? And you're right, we are supporting both of those cases.

I think we would have these frequencies, because we'd calculate the frequency of the term a\0foo (where \0 is a separator character)?

Good point. In which case I agree that index options do make sense.

After catching up with @romseygeek and @colings86 offline, I decided to disable stored fields in this PR to cut down on confusion.

colings86 · 2018-09-25T11:40:31Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java

+
+        if (fieldType().indexOptions() != IndexOptions.NONE || fieldType().stored()) {
+            fields.addAll(fieldParser.parse(context.parser()));
+            if (fieldType.omitNorms()) {


Do norms make sense for this type of field? given we are searching within the object the length of the whole object being taken into account when searching might not be that interesting?

Right, I think they would be misleading unless we had some way to filter them based on a particular key. I also suspect it's not common to use norms with keyword-like fields. I'll disallow norms for now, and potentially revisit if we have other ideas.

jtibshirani · 2018-09-25T17:48:25Z

Thanks @colings86 for the review! Sounds like we can mull over the naming a bit. I'll plan to make wildcard query creation more consistent in a separate PR.

An update: just merged #34062, which delegates wildcard creation to MappedFieldType.

jtibshirani · 2018-09-27T16:39:18Z

@colings86 I think I've addressed your initial round of comments.

romseygeek

I left a couple of questions, but other than that LGTM

romseygeek · 2018-09-28T16:02:51Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldParser.java

+        int openObjects = 1;
+
+        while (true) {
+            if (openObjects == 0) {


What happens if we're given malformed JSON? If it's missing a closing brace, will this just loop forever?

I had been wondering the same thing and have a test case for this: https://github.com/elastic/elasticsearch/pull/33923/files#diff-c8a7654deccf33995eeafb732768f18bR233

XContentParser will throw if it encounters an end token (or another non-sensical token) when it is expecting a closing brace.

romseygeek · 2018-09-28T16:05:25Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldParser.java

+                String value = fieldType.nullValueAsString();
+                if (value != null) {
+                    addField(value, fields);
+                }


I'm guessing that storing the path to the leaf is something you want to do as a followup?

Yes exactly, that's coming in the next PR.

Currently the mapper extracts all leaf values of the JSON object, converts them to their text representations, and indexes each one as a keyword.

* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.

jtibshirani added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types labels Sep 20, 2018

jtibshirani changed the base branch from master to object-fields September 20, 2018 20:50

jtibshirani commented Sep 20, 2018

View reviewed changes

jtibshirani force-pushed the json-field-mapper branch 3 times, most recently from c3ff13f to 2afe161 Compare September 20, 2018 21:42

jtibshirani requested a review from romseygeek September 21, 2018 06:46

jtibshirani force-pushed the json-field-mapper branch 2 times, most recently from 6fc57d5 to 8773b4f Compare September 24, 2018 16:12

jtibshirani requested a review from colings86 September 24, 2018 20:24

jtibshirani force-pushed the json-field-mapper branch from 8773b4f to 070b6b5 Compare September 25, 2018 02:31

colings86 reviewed Sep 25, 2018

View reviewed changes

jtibshirani mentioned this pull request Sep 25, 2018

Flattened object fields design + implementation #33003

Closed

11 tasks

jtibshirani force-pushed the json-field-mapper branch 3 times, most recently from e098b08 to cb72aca Compare September 26, 2018 19:44

romseygeek approved these changes Sep 28, 2018

View reviewed changes

jtibshirani added 8 commits September 28, 2018 09:13

Add a simple JSON field type.

768a5d7

Currently the mapper extracts all leaf values of the JSON object, converts them to their text representations, and indexes each one as a keyword.

Add support for ignore_above.

09d0301

Add support for null_value.

6d74f54

Add support for split_queries_on_whitespace.

6d58cca

Prevent norms from being enabled.

236d568

Clarify the message around copy_to not being supported.

26884b6

Disallow wildcard queries.

50e6439

For now, disallow the field from being stored.

bef3f0e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a simple JSON field mapper. #33923

Add a simple JSON field mapper. #33923

jtibshirani commented Sep 20, 2018 •

edited

Loading

elasticmachine commented Sep 20, 2018

jtibshirani Sep 20, 2018

colings86 Sep 25, 2018

jtibshirani Sep 20, 2018 •

edited

Loading

colings86 left a comment

colings86 Sep 25, 2018

jtibshirani Sep 25, 2018

colings86 Sep 25, 2018

colings86 Sep 25, 2018

jtibshirani Sep 25, 2018 •

edited

Loading

colings86 Sep 26, 2018

jtibshirani Sep 26, 2018 •

edited

Loading

colings86 Sep 27, 2018

jtibshirani Sep 28, 2018

colings86 Sep 25, 2018

jtibshirani Sep 25, 2018 •

edited

Loading

jtibshirani commented Sep 25, 2018 •

edited

Loading

jtibshirani commented Sep 27, 2018

romseygeek left a comment

romseygeek Sep 28, 2018

jtibshirani Sep 28, 2018

romseygeek Sep 28, 2018

jtibshirani Sep 28, 2018

Add a simple JSON field mapper. #33923

Add a simple JSON field mapper. #33923

Conversation

jtibshirani commented Sep 20, 2018 • edited Loading

elasticmachine commented Sep 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Sep 20, 2018 • edited Loading

Choose a reason for hiding this comment

colings86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Sep 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Sep 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Sep 25, 2018 • edited Loading

Choose a reason for hiding this comment

jtibshirani commented Sep 25, 2018 • edited Loading

jtibshirani commented Sep 27, 2018

romseygeek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Sep 20, 2018 •

edited

Loading

jtibshirani Sep 20, 2018 •

edited

Loading

jtibshirani Sep 25, 2018 •

edited

Loading

jtibshirani Sep 26, 2018 •

edited

Loading

jtibshirani Sep 25, 2018 •

edited

Loading

jtibshirani commented Sep 25, 2018 •

edited

Loading