Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a simple JSON field mapper. #33923

Merged
merged 8 commits into from
Sep 28, 2018

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Sep 20, 2018

Currently the mapper extracts all leaf values of the JSON object, converts them to their text representations, and indexes each one as a keyword. I plan to add support for key-value pairs in a subsequent PR. As discussed in #33795, we've decided to try an approach where we tokenize the JSON outside of lucene analysis.

Because of the parallel between this data type and keyword, I included the relevant options ignore_above, null_value, and split_queries_on_whitespace.

Lastly, it would be great to get feedback on the naming: I settled on json, but am very happy for other suggestions. I wasn't a huge fan of names like key_value, map, or dict, because they might suggest the input should be a flat list of key-value pairs, as opposed to a general JSON blob. Certainly object would be a nice choice, but it conflicts with the existing type.

@jtibshirani jtibshirani added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types labels Sep 20, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jtibshirani jtibshirani changed the base branch from master to object-fields September 20, 2018 20:50
}

@Override
public Query fuzzyQuery(Object value, Fuzziness fuzziness, int prefixLength, int maxExpansions,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm disallowing these more advanced queries for now, but will look into supporting them in a follow-up (tracked on the meta-issue).

In addition to fuzzy and regexp, I would also like to disallow wildcard. However, WildcardQueryBuilder doesn't delegate to a method on MappedFieldType. This means we attempt to create wildcard queries on non-text or keyword fields, and usually fail because the value can't be parsed. Do you know if there is a reason for this set-up, or if it's just an oversight that would be good to address?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure regarding why wildcard is different. I suspect its just because we haven't before needed for change the behaviour of wildcard queries based on the field type. I can't see any reason not to change it so we can control the behaviour here though. If we do make the change it might be best to make it directly on master and then pull the change into this branch to disallow it as the change may become difficult to maintain in the feature branch


private void addField(String value, List<IndexableField> fields) {
if (value.length() <= ignoreAbove) {
fields.add(new Field(fieldType.name(), new BytesRef(value), fieldType));
Copy link
Contributor Author

@jtibshirani jtibshirani Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I opted not to index a special stored field with the whole JSON blob (as we discussed in #33795 (comment)), but it's something I'm keeping in mind.

@jtibshirani jtibshirani force-pushed the json-field-mapper branch 3 times, most recently from c3ff13f to 2afe161 Compare September 20, 2018 21:42
@jtibshirani jtibshirani force-pushed the json-field-mapper branch 2 times, most recently from 6fc57d5 to 8773b4f Compare September 24, 2018 16:12
Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani I left some comments but I like where its going.

On naming: I'm not crazy about json as the name of the field type, mostly because our documents are all JSON anyway so I'm worried it will lead to some confusion and users will have the wrong expectations from this new field type. That said, I don't currently have a great alternative either so its something we can discuss and should not hold up this PR since renaming the field type should be fairly simple change later.

}

public Builder copyTo(CopyTo copyTo) {
throw new UnsupportedOperationException("[copy_to] is not currently supported for ["
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd remove currently here since we don't yet know for sure if we are going to be able to implement copy_to or if we'll have to have a different mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

}

@Override
public Query fuzzyQuery(Object value, Fuzziness fuzziness, int prefixLength, int maxExpansions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure regarding why wildcard is different. I suspect its just because we haven't before needed for change the behaviour of wildcard queries based on the field type. I can't see any reason not to change it so we can control the behaviour here though. If we do make the change it might be best to make it directly on master and then pull the change into this branch to disallow it as the change may become difficult to maintain in the feature branch

return;
}

if (fieldType().indexOptions() != IndexOptions.NONE || fieldType().stored()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the stored check here is obsolete since we aren't adding an option for a stored field as per https://github.com/elastic/elasticsearch/pull/33923/files#r219319713 below? We should probably also throw an exception if store(boolean) is called on the Builder?

Also I wonder if its worth even allowing the indexOptions to be set? Given you can't turn on term vectors or positions the only option would be to turn off indexing altogether which I don't think makes sense since we aren't allowing stored fields? Whats left would be the user wanting to ignore the field completely which is already covered by the object type's enabled" false setting?

Copy link
Contributor Author

@jtibshirani jtibshirani Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lucene fields we're adding are still stored if store: true, it's just that each value is added as a separate field (as opposed to adding a single stored field containing the whole JSON blob, as I was brainstorming). I think it'd be nice to see how this naive approach looks on the feature branch, so we can make an informed choice about whether we should switch to a single stored field.

For the index_options question, I guess the user could still decide between docs and freqs. I was mostly aiming for consistency with the keyword-like fields here. Is your thought that frequency information doesn't really make sense in this context either?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just that each value is added as a separate field (as opposed to adding a single stored field containing the whole JSON blob, as I was brainstorming).

I think I'm still missed something here so maybe we can clear up my confusions with an example? Say we have a document as follows and json_field is a field of this new type:

{
    "json_field": {
        "a": "foo",
        "b": {
            "c": "bar"
        }
    }
}

Are you saying that if store: true we will create two stored fields in Lucene, one for json_field.a containing the value foo and one for json_field.b.c contain the value bar rather than crating a single json_field stored field containing the following?

{
    "a": "foo",
    "b": {
        "c": "bar"
    }
}

For the index_options question, I guess the user could still decide between docs and freqs.

True

I was mostly aiming for consistency with the keyword-like fields here. Is your thought that frequency information doesn't really make sense in this context either?

I guess the freqs would make sense for a search that is looking for e.g. "json_field": "foo" (i.e. "foo") anywhere in the JSON blob but it wouldn't really make sense for a search for e.g. "json_field.a": "foo" since we wouldn't have the frequencies for the inner field alone. I haven't thought about this too much but maybe for consistency between the two cases (did we say decide we were going to support both cases still?) it would be easier to explain without freqs?

Copy link
Contributor Author

@jtibshirani jtibshirani Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding of the current behavior for stored fields is correct -- in this PR, a separate stored field is created for each leaf value. I think that later we should switch to the approach where we create a single stored field with the whole JSON blob (and have made a note of this on the meta-issue), but I was curious to play around with this first more naive approach. I'm also okay disallowing stored fields for now until we address the issue in a dedicated PR.

it wouldn't really make sense for a search for e.g. "json_field.a": "foo" since we wouldn't have the frequencies for the inner field alone

I think we would have these frequencies, because we'd calculate the frequency of the term a\0foo (where \0 is a separator character)? And you're right, we are supporting both of those cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would have these frequencies, because we'd calculate the frequency of the term a\0foo (where \0 is a separator character)?

Good point. In which case I agree that index options do make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After catching up with @romseygeek and @colings86 offline, I decided to disable stored fields in this PR to cut down on confusion.


if (fieldType().indexOptions() != IndexOptions.NONE || fieldType().stored()) {
fields.addAll(fieldParser.parse(context.parser()));
if (fieldType.omitNorms()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do norms make sense for this type of field? given we are searching within the object the length of the whole object being taken into account when searching might not be that interesting?

Copy link
Contributor Author

@jtibshirani jtibshirani Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think they would be misleading unless we had some way to filter them based on a particular key. I also suspect it's not common to use norms with keyword-like fields. I'll disallow norms for now, and potentially revisit if we have other ideas.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Sep 25, 2018

Thanks @colings86 for the review! Sounds like we can mull over the naming a bit. I'll plan to make wildcard query creation more consistent in a separate PR.

An update: just merged #34062, which delegates wildcard creation to MappedFieldType.

@jtibshirani jtibshirani force-pushed the json-field-mapper branch 3 times, most recently from e098b08 to cb72aca Compare September 26, 2018 19:44
@jtibshirani
Copy link
Contributor Author

@colings86 I think I've addressed your initial round of comments.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of questions, but other than that LGTM

int openObjects = 1;

while (true) {
if (openObjects == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we're given malformed JSON? If it's missing a closing brace, will this just loop forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been wondering the same thing and have a test case for this: https://github.com/elastic/elasticsearch/pull/33923/files#diff-c8a7654deccf33995eeafb732768f18bR233

XContentParser will throw if it encounters an end token (or another non-sensical token) when it is expecting a closing brace.

String value = fieldType.nullValueAsString();
if (value != null) {
addField(value, fields);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that storing the path to the leaf is something you want to do as a followup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, that's coming in the next PR.

jtibshirani added a commit that referenced this pull request Oct 1, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Oct 3, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Oct 18, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Oct 26, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Oct 29, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Oct 31, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Nov 7, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Nov 8, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Nov 9, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Nov 30, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Dec 5, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Dec 13, 2018
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Mar 6, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Mar 7, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Mar 9, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Mar 11, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Mar 14, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Mar 19, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Mar 19, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Mar 28, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Apr 10, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request Apr 17, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request May 1, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request May 24, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
jtibshirani added a commit that referenced this pull request May 29, 2019
* Add a simple JSON field type.
* Add support for ignore_above.
* Add support for null_value.
* Add support for split_queries_on_whitespace.
* Prevent norms from being enabled.
* Clarify the message around copy_to not being supported.
* Disallow wildcard queries.
* For now, disallow the field from being stored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants