Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose splitOnWhitespace in Query String Query #20965

Merged
merged 3 commits into from
Nov 2, 2016
Merged

Expose splitOnWhitespace in Query String Query #20965

merged 3 commits into from
Nov 2, 2016

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Oct 17, 2016

This change adds an option called split_on_whitespace which prevents the query parser to split free text part on whitespace prior to analysis. Instead the queryparser would parse around only real 'operators'. Default to true.
For instance the query "foo bar" would let the analyzer of the targeted field decide how the tokens should be splitted.
Some options are missing in this change but I'd like to add them in a follow up PR in order to be able to simplify the backport in 5.x. The missing options (changes) are:

  • A type option which similarly to the multi_match query defines how the free text should be parsed when multi fields are defined.
  • Simple range query with additional tokens like ">100 50" are broken when split_on_whitespace is set to false. It should be possible to preserve this syntax and make the parser aware of this special syntax even when split_on_whitespace is set to false.
  • Since all this options would make the query_string_query very similar to a match (multi_match) query we should be able to share the code that produce the final Lucene query.

Fixes #20841

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left two comments about changes that will be needed for the backport

@@ -200,6 +204,7 @@ public QueryStringQueryBuilder(StreamInput in) throws IOException {
timeZone = in.readOptionalTimeZone();
escape = in.readBoolean();
maxDeterminizedStates = in.readVInt();
splitOnWhitespace = in.readBoolean();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need serialization protection when backported to the 5.x branch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually there should protection on 6.x too since we have not given up the idea about multi-version clusters yet

@@ -234,6 +239,7 @@ protected void doWriteTo(StreamOutput out) throws IOException {
out.writeOptionalTimeZone(timeZone);
out.writeBoolean(this.escape);
out.writeVInt(this.maxDeterminizedStates);
out.writeBoolean(this.splitOnWhitespace);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about the serialization protection

@jimczi
Copy link
Contributor Author

jimczi commented Oct 28, 2016

@dakrone @jpountz I pushed another commit to protect the serialization on prior versions. Can you take another look ?

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. @dakrone might want another look just to be super sure though.

@@ -200,6 +207,9 @@ public QueryStringQueryBuilder(StreamInput in) throws IOException {
timeZone = in.readOptionalTimeZone();
escape = in.readBoolean();
maxDeterminizedStates = in.readVInt();
if (in.getVersion().onOrAfter(V_5_1_0_UNRELEASED)) {
splitOnWhitespace = in.readBoolean();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably add an else clause that sets splitOnWhitespace to the appropriate value just to be super clear.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This change adds an option called `split_on_whitespace` which prevents the query parser to split free text part on whitespace prior to analysis. Instead the queryparser would parse around only real 'operators'. Default to true.
For instance the query `"foo bar"` would let the analyzer of the targeted field decide how the tokens should be splitted.
Some options are missing in this change but I'd like to add them in a follow up PR in order to be able to simplify the backport in 5.x. The missing options (changes) are:
* A `type` option which similarly to the `multi_match` query defines how the free text should be parsed when multi fields are defined.
* Simple range query with additional tokens like ">100 50" are broken when `split_on_whitespace` is set to false. It should be possible to preserve this syntax and make the parser aware of this special syntax even when `split_on_whitespace` is set to false.
* Since all this options would make the `query_string_query` very similar to a match (multi_match) query we should be able to share the code that produce the final Lucene query.
@jimczi jimczi merged commit 9d6fac8 into elastic:master Nov 2, 2016
@jimczi jimczi deleted the split_on_whitespace branch November 2, 2016 09:00
jimczi added a commit that referenced this pull request Nov 2, 2016
This change adds an option called `split_on_whitespace` which prevents the query parser to split free text part on whitespace prior to analysis. Instead the queryparser would parse around only real 'operators'. Default to true.
For instance the query `"foo bar"` would let the analyzer of the targeted field decide how the tokens should be splitted.
Some options are missing in this change but I'd like to add them in a follow up PR in order to be able to simplify the backport in 5.x. The missing options (changes) are:
* A `type` option which similarly to the `multi_match` query defines how the free text should be parsed when multi fields are defined.
* Simple range query with additional tokens like ">100 50" are broken when `split_on_whitespace` is set to false. It should be possible to preserve this syntax and make the parser aware of this special syntax even when `split_on_whitespace` is set to false.
* Since all this options would make the `query_string_query` very similar to a match (multi_match) query we should be able to share the code that produce the final Lucene query.
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v5.1.1 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants