Expose splitOnWhitespace in `Query String Query` #20965

jimczi · 2016-10-17T09:09:40Z

This change adds an option called split_on_whitespace which prevents the query parser to split free text part on whitespace prior to analysis. Instead the queryparser would parse around only real 'operators'. Default to true.
For instance the query "foo bar" would let the analyzer of the targeted field decide how the tokens should be splitted.
Some options are missing in this change but I'd like to add them in a follow up PR in order to be able to simplify the backport in 5.x. The missing options (changes) are:

A type option which similarly to the multi_match query defines how the free text should be parsed when multi fields are defined.
Simple range query with additional tokens like ">100 50" are broken when split_on_whitespace is set to false. It should be possible to preserve this syntax and make the parser aware of this special syntax even when split_on_whitespace is set to false.
Since all this options would make the query_string_query very similar to a match (multi_match) query we should be able to share the code that produce the final Lucene query.

Fixes #20841

dakrone

LGTM, I left two comments about changes that will be needed for the backport

dakrone · 2016-10-17T15:57:33Z

core/src/main/java/org/elasticsearch/index/query/QueryStringQueryBuilder.java

@@ -200,6 +204,7 @@ public QueryStringQueryBuilder(StreamInput in) throws IOException {
        timeZone = in.readOptionalTimeZone();
        escape = in.readBoolean();
        maxDeterminizedStates = in.readVInt();
+        splitOnWhitespace = in.readBoolean();


This will need serialization protection when backported to the 5.x branch

actually there should protection on 6.x too since we have not given up the idea about multi-version clusters yet

dakrone · 2016-10-17T15:57:43Z

core/src/main/java/org/elasticsearch/index/query/QueryStringQueryBuilder.java

@@ -234,6 +239,7 @@ protected void doWriteTo(StreamOutput out) throws IOException {
        out.writeOptionalTimeZone(timeZone);
        out.writeBoolean(this.escape);
        out.writeVInt(this.maxDeterminizedStates);
+        out.writeBoolean(this.splitOnWhitespace);


Same here about the serialization protection

jimczi · 2016-10-28T07:58:53Z

@dakrone @jpountz I pushed another commit to protect the serialization on prior versions. Can you take another look ?

nik9000

Makes sense to me. @dakrone might want another look just to be super sure though.

nik9000 · 2016-10-30T03:19:11Z

core/src/main/java/org/elasticsearch/index/query/QueryStringQueryBuilder.java

@@ -200,6 +207,9 @@ public QueryStringQueryBuilder(StreamInput in) throws IOException {
        timeZone = in.readOptionalTimeZone();
        escape = in.readBoolean();
        maxDeterminizedStates = in.readVInt();
+        if (in.getVersion().onOrAfter(V_5_1_0_UNRELEASED)) {
+            splitOnWhitespace = in.readBoolean();
+        }


I'd probably add an else clause that sets splitOnWhitespace to the appropriate value just to be super clear.

dakrone

LGTM

This change adds an option called `split_on_whitespace` which prevents the query parser to split free text part on whitespace prior to analysis. Instead the queryparser would parse around only real 'operators'. Default to true. For instance the query `"foo bar"` would let the analyzer of the targeted field decide how the tokens should be splitted. Some options are missing in this change but I'd like to add them in a follow up PR in order to be able to simplify the backport in 5.x. The missing options (changes) are: * A `type` option which similarly to the `multi_match` query defines how the free text should be parsed when multi fields are defined. * Simple range query with additional tokens like ">100 50" are broken when `split_on_whitespace` is set to false. It should be possible to preserve this syntax and make the parser aware of this special syntax even when `split_on_whitespace` is set to false. * Since all this options would make the `query_string_query` very similar to a match (multi_match) query we should be able to share the code that produce the final Lucene query.

…or to 5.1

This change adds an option called `split_on_whitespace` which prevents the query parser to split free text part on whitespace prior to analysis. Instead the queryparser would parse around only real 'operators'. Default to true. For instance the query `"foo bar"` would let the analyzer of the targeted field decide how the tokens should be splitted. Some options are missing in this change but I'd like to add them in a follow up PR in order to be able to simplify the backport in 5.x. The missing options (changes) are: * A `type` option which similarly to the `multi_match` query defines how the free text should be parsed when multi fields are defined. * Simple range query with additional tokens like ">100 50" are broken when `split_on_whitespace` is set to false. It should be possible to preserve this syntax and make the parser aware of this special syntax even when `split_on_whitespace` is set to false. * Since all this options would make the `query_string_query` very similar to a match (multi_match) query we should be able to share the code that produce the final Lucene query.

jimczi added >enhancement :Query DSL v6.0.0-alpha1 v5.1.1 labels Oct 17, 2016

clintongormley mentioned this pull request Oct 17, 2016

Add "all field" execution mode to query_string query #20925

Merged

jimczi added the review label Oct 17, 2016

dakrone approved these changes Oct 17, 2016

View reviewed changes

nik9000 approved these changes Oct 30, 2016

View reviewed changes

dakrone approved these changes Oct 31, 2016

View reviewed changes

jimczi added 3 commits November 2, 2016 09:59

add serialization protection based on the stream version

11d0077

explicitly set default value for split_on_whitespace for versions pri…

3ee2fae

…or to 5.1

jimczi merged commit 9d6fac8 into elastic:master Nov 2, 2016

jimczi deleted the split_on_whitespace branch November 2, 2016 09:00

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose splitOnWhitespace in `Query String Query` #20965

Expose splitOnWhitespace in `Query String Query` #20965

jimczi commented Oct 17, 2016

dakrone left a comment

dakrone Oct 17, 2016

jpountz Oct 18, 2016

dakrone Oct 17, 2016

jimczi commented Oct 28, 2016

nik9000 left a comment

nik9000 Oct 30, 2016

dakrone left a comment

Expose splitOnWhitespace in Query String Query #20965

Expose splitOnWhitespace in Query String Query #20965

Conversation

jimczi commented Oct 17, 2016

dakrone left a comment

Choose a reason for hiding this comment

dakrone Oct 17, 2016

Choose a reason for hiding this comment

jpountz Oct 18, 2016

Choose a reason for hiding this comment

dakrone Oct 17, 2016

Choose a reason for hiding this comment

jimczi commented Oct 28, 2016

nik9000 left a comment

Choose a reason for hiding this comment

nik9000 Oct 30, 2016

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

Expose splitOnWhitespace in `Query String Query` #20965

Expose splitOnWhitespace in `Query String Query` #20965