-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a simple JSON field mapper. #33923
Add a simple JSON field mapper. #33923
Conversation
Pinging @elastic/es-search-aggs |
} | ||
|
||
@Override | ||
public Query fuzzyQuery(Object value, Fuzziness fuzziness, int prefixLength, int maxExpansions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm disallowing these more advanced queries for now, but will look into supporting them in a follow-up (tracked on the meta-issue).
In addition to fuzzy
and regexp
, I would also like to disallow wildcard
. However, WildcardQueryBuilder
doesn't delegate to a method on MappedFieldType
. This means we attempt to create wildcard queries on non-text or keyword fields, and usually fail because the value can't be parsed. Do you know if there is a reason for this set-up, or if it's just an oversight that would be good to address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure regarding why wildcard is different. I suspect its just because we haven't before needed for change the behaviour of wildcard queries based on the field type. I can't see any reason not to change it so we can control the behaviour here though. If we do make the change it might be best to make it directly on master and then pull the change into this branch to disallow it as the change may become difficult to maintain in the feature branch
|
||
private void addField(String value, List<IndexableField> fields) { | ||
if (value.length() <= ignoreAbove) { | ||
fields.add(new Field(fieldType.name(), new BytesRef(value), fieldType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I opted not to index a special stored field with the whole JSON blob (as we discussed in #33795 (comment)), but it's something I'm keeping in mind.
c3ff13f
to
2afe161
Compare
6fc57d5
to
8773b4f
Compare
8773b4f
to
070b6b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani I left some comments but I like where its going.
On naming: I'm not crazy about json
as the name of the field type, mostly because our documents are all JSON anyway so I'm worried it will lead to some confusion and users will have the wrong expectations from this new field type. That said, I don't currently have a great alternative either so its something we can discuss and should not hold up this PR since renaming the field type should be fairly simple change later.
server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java
Show resolved
Hide resolved
} | ||
|
||
public Builder copyTo(CopyTo copyTo) { | ||
throw new UnsupportedOperationException("[copy_to] is not currently supported for [" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd remove currently
here since we don't yet know for sure if we are going to be able to implement copy_to
or if we'll have to have a different mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
} | ||
|
||
@Override | ||
public Query fuzzyQuery(Object value, Fuzziness fuzziness, int prefixLength, int maxExpansions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure regarding why wildcard is different. I suspect its just because we haven't before needed for change the behaviour of wildcard queries based on the field type. I can't see any reason not to change it so we can control the behaviour here though. If we do make the change it might be best to make it directly on master and then pull the change into this branch to disallow it as the change may become difficult to maintain in the feature branch
return; | ||
} | ||
|
||
if (fieldType().indexOptions() != IndexOptions.NONE || fieldType().stored()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the stored check here is obsolete since we aren't adding an option for a stored field as per https://github.com/elastic/elasticsearch/pull/33923/files#r219319713 below? We should probably also throw an exception if store(boolean)
is called on the Builder
?
Also I wonder if its worth even allowing the indexOptions
to be set? Given you can't turn on term vectors or positions the only option would be to turn off indexing altogether which I don't think makes sense since we aren't allowing stored fields? Whats left would be the user wanting to ignore the field completely which is already covered by the object
type's enabled" false
setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lucene fields we're adding are still stored if store: true
, it's just that each value is added as a separate field (as opposed to adding a single stored field containing the whole JSON blob, as I was brainstorming). I think it'd be nice to see how this naive approach looks on the feature branch, so we can make an informed choice about whether we should switch to a single stored field.
For the index_options
question, I guess the user could still decide between docs
and freqs
. I was mostly aiming for consistency with the keyword-like fields here. Is your thought that frequency information doesn't really make sense in this context either?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just that each value is added as a separate field (as opposed to adding a single stored field containing the whole JSON blob, as I was brainstorming).
I think I'm still missed something here so maybe we can clear up my confusions with an example? Say we have a document as follows and json_field
is a field of this new type:
{
"json_field": {
"a": "foo",
"b": {
"c": "bar"
}
}
}
Are you saying that if store: true
we will create two stored fields in Lucene, one for json_field.a
containing the value foo
and one for json_field.b.c
contain the value bar
rather than crating a single json_field
stored field containing the following?
{
"a": "foo",
"b": {
"c": "bar"
}
}
For the index_options question, I guess the user could still decide between docs and freqs.
True
I was mostly aiming for consistency with the keyword-like fields here. Is your thought that frequency information doesn't really make sense in this context either?
I guess the freqs
would make sense for a search that is looking for e.g. "json_field": "foo"
(i.e. "foo"
) anywhere in the JSON blob but it wouldn't really make sense for a search for e.g. "json_field.a": "foo"
since we wouldn't have the frequencies for the inner field alone. I haven't thought about this too much but maybe for consistency between the two cases (did we say decide we were going to support both cases still?) it would be easier to explain without freqs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding of the current behavior for stored fields is correct -- in this PR, a separate stored field is created for each leaf value. I think that later we should switch to the approach where we create a single stored field with the whole JSON blob (and have made a note of this on the meta-issue), but I was curious to play around with this first more naive approach. I'm also okay disallowing stored fields for now until we address the issue in a dedicated PR.
it wouldn't really make sense for a search for e.g. "json_field.a": "foo" since we wouldn't have the frequencies for the inner field alone
I think we would have these frequencies, because we'd calculate the frequency of the term a\0foo
(where \0
is a separator character)? And you're right, we are supporting both of those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would have these frequencies, because we'd calculate the frequency of the term a\0foo (where \0 is a separator character)?
Good point. In which case I agree that index options do make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After catching up with @romseygeek and @colings86 offline, I decided to disable stored fields in this PR to cut down on confusion.
|
||
if (fieldType().indexOptions() != IndexOptions.NONE || fieldType().stored()) { | ||
fields.addAll(fieldParser.parse(context.parser())); | ||
if (fieldType.omitNorms()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do norms make sense for this type of field? given we are searching within the object the length of the whole object being taken into account when searching might not be that interesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think they would be misleading unless we had some way to filter them based on a particular key. I also suspect it's not common to use norms with keyword-like fields. I'll disallow norms for now, and potentially revisit if we have other ideas.
Thanks @colings86 for the review! Sounds like we can mull over the naming a bit. I'll plan to make wildcard query creation more consistent in a separate PR. An update: just merged #34062, which delegates wildcard creation to |
e098b08
to
cb72aca
Compare
@colings86 I think I've addressed your initial round of comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of questions, but other than that LGTM
int openObjects = 1; | ||
|
||
while (true) { | ||
if (openObjects == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we're given malformed JSON? If it's missing a closing brace, will this just loop forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been wondering the same thing and have a test case for this: https://github.com/elastic/elasticsearch/pull/33923/files#diff-c8a7654deccf33995eeafb732768f18bR233
XContentParser
will throw if it encounters an end token (or another non-sensical token) when it is expecting a closing brace.
String value = fieldType.nullValueAsString(); | ||
if (value != null) { | ||
addField(value, fields); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that storing the path to the leaf is something you want to do as a followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, that's coming in the next PR.
Currently the mapper extracts all leaf values of the JSON object, converts them to their text representations, and indexes each one as a keyword.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
* Add a simple JSON field type. * Add support for ignore_above. * Add support for null_value. * Add support for split_queries_on_whitespace. * Prevent norms from being enabled. * Clarify the message around copy_to not being supported. * Disallow wildcard queries. * For now, disallow the field from being stored.
Currently the mapper extracts all leaf values of the JSON object, converts them to their text representations, and indexes each one as a keyword. I plan to add support for key-value pairs in a subsequent PR. As discussed in #33795, we've decided to try an approach where we tokenize the JSON outside of lucene analysis.
Because of the parallel between this data type and
keyword
, I included the relevant optionsignore_above
,null_value
, andsplit_queries_on_whitespace
.Lastly, it would be great to get feedback on the naming: I settled on
json
, but am very happy for other suggestions. I wasn't a huge fan of names likekey_value
,map
, ordict
, because they might suggest the input should be a flat list of key-value pairs, as opposed to a general JSON blob. Certainlyobject
would be a nice choice, but it conflicts with the existing type.