Support term presence in queries #332

olivernn · 2018-03-05T20:14:58Z

Adds support for modifying the required presence of a given term in matching documents.

Currently, when searching with multiple terms, each term is optional in the search results. That is, a search for "foo bar" will return results with either "foo" or "bar" or both "foo" and "bar", effectively terms are combined with a logical OR.

This change adds support for marking terms as either required or prohibited. A required term must appear in a document for that document to be returned in the search results. A prohibited term must not appear in a document for that document to be returned.

A term's presence is indicated by either a prefixed "+" (plus) for required terms, and a "-" (minus) for prohibited term when using lunr.Index#search. Constructing queries programatically using lunr.Index#query also support presence using the presence option with a value the lunr.Query.presence enum.

For example, to perform a search for documents that must contain "foo" and optionally contain "bar" use the following search string: +foo bar.

A simple, editable, example should help make this clear.

There is one open question, when a search contains only prohibited terms currently no documents are returned. Technically this should return all documents in the index, but I'm not sure that is too useful. Is the current behaviour correct, i.e. just no search results, or should an InvalidQuery exception be thrown?

An alpha release is available on npm, lunr@2.2.0-alpha.1

Prohibited, required and optional presence modifiers returning expected results in search tests.

drzraf · 2018-03-06T03:37:12Z

Thank you for this nice feature!

IHMO searches that only contain prohibited terms should have their behavior configurable since the results for the "empty field" is something which vary for each application.

Adding another suggestion:.
I found useful that when doing AND-search, some of the terms could be optional according to whether or not they return results.

Ex:

A: 8 matches
B: 3 matches included A + 2 distinct matches
C: 0 matches

search "A or B" : 10
search "A and B" : 2
search "C": 0
search "A and C": 8   # instead of 0: "C" is omitted

Why?
When using autocompletion, it makes sense to have "strict" AND-search.
But without autocompletion, it's often annoying if a non-existing term(s) drop all the results instead of just refining them, like "+restaurant +pizza +vegan +sunday"

I'm pretty sure such a search-mode exist in full-search engines, I'd naively call that "ignore AND-ed terms that don't provide any result"
In such a mode, providing the result-count for each term would let the programmer react to the situation (ex: "no results included "sunday", showing results for the whole week instead")

olivernn · 2018-03-06T18:41:16Z

Thanks for the feedback, some great suggestions...

IHMO searches that only contain prohibited terms should have their behavior configurable since the results for the "empty field" is something which vary for each application.

It depends on what you mean by 'configurable'... If Lunr threw an InvalidQuery exception then at least the consumer would be able to handle that in whichever way they want, if you squint hard that is sort of configurable

I'm pretty sure the current behaviour of returning nothing is not right, I just sidestepped the issue when implementing this. I'd like to see what properly implementing this case looks like. I think it should perform the search, i.e. if all documents do not have the term then return all documents. More interesting is when some documents do have the term, that could easily be a valid use case.

I'm guessing though that you had something more concrete in mind when you mentioned configuring the behaviour. I'm going to pass on this for now, but I'm not ruling it out in the future.

I found useful that when doing AND-search, some of the terms could be optional according to whether or not they return results.

I looked (though not very hard) for this behaviour in Lucene and couldn't find anything, do you have any examples (for any search engine/library)?

It should be possible to implement this in application code though right? If a query returns no results, drop required terms and try again. Possibly not the most efficient, or simple solution I guess.

Rather than implement this directly in Lunr I think Lunr could provide the relevant information (matches per term) so that applications can handle this cases however they like. Probably isn't going to make it into 2.2.x though. It would almost definitely require a major version bump as lunr.Index#query just returns an array of matches currently.

drzraf · 2018-03-07T12:36:25Z

I don't have examples and I surely don't want such a discussion about exotic feature to block AND-search :)

It should be possible to implement this in application code though right?

yes (but with a performance impact because it could not reuse AND search capabilities).
In #264 (comment) I implement the behavior doing one search per term and then doing intersection of result sets, and then looking at terms that returned none from the intersection.

olivernn · 2018-03-07T19:35:16Z

I've hacked together an implementation where by searches containing only prohibited terms work as expected. Specifically, if the document does not contain that term it gets returned, even if that means all documents.

The last sticking point is how these results are scored, and therefore sorted. The way I've got it implemented now would mean all results have a score of NaN. This could be the correct result but it feels wrong. I think a score of 1 would be more intuitive. NaN would mean that the score is unrepresentable, where 1 would mean a perfect match.

Fixing the score to be 1 is trivial, but requires even more special casing for these kind of queries, which is something I'd like to avoid.

Previously calculating the similarity with an empty vector would give a score of NaN. This might technically be correct but probably isn't very useful. Instead we change this to zero. Scoring empty vectors is a side effect of introducing totally negated queries.

In totally negated queries there will be no matches and therefore an empty MatchData object is required for returning in the search results.

A negated query is one in which _all_ terms have a presence of prohibited. When executing a negated query some special handling is required and this method is used to trigger that handling.

A negated query is one in which all terms have a prohibited presence. Handling these queries requires some special handling. Specifically empty match data should be created as well as extracting _all_ field refs to construct the results.

In cases where we know a document/field is not going to make it into the results due to a terms presence (either required or prohibited) there is no point in calculating a similarity score. This leads to a minor performance improvement for queries that contain either required or prohibited terms.

olivernn · 2018-04-25T15:01:09Z

Finally got round to working on this again.

Totally negated queries are now supported, each document that does not contain that term will be in the results, the score should be zero for each of them.

An alpha release is available on npm, lunr@2.2.0-alpha.2

I think that covers everything on this feature, if there are no objections I aim to get a release with this and a couple of other small features in the next few days.

olivernn added 7 commits March 5, 2018 16:25

Add initial implementation of term presence

874e79b

Prohibited, required and optional presence modifiers returning expected results in search tests.

Add tests for lunr.Set

7abfbe8

Add query string support for presence modifiers

4bec556

Add presence queries to search tests

107d03f

Update documentation for term presence queries

16549b4

Fix some lint errors

8e27e2b

Build version 2.2.0-alpha.1

605eef5

olivernn mentioned this pull request Mar 5, 2018

Add AND operator for joining terms #264

Closed

olivernn added 6 commits April 25, 2018 16:36

Add support for empty MatchData

881528c

In totally negated queries there will be no matches and therefore an empty MatchData object is required for returning in the search results.

Add lunr.Query#isNegated

b492faf

A negated query is one in which _all_ terms have a presence of prohibited. When executing a negated query some special handling is required and this method is used to trigger that handling.

Add support for negated queries

0009a40

A negated query is one in which all terms have a prohibited presence. Handling these queries requires some special handling. Specifically empty match data should be created as well as extracting _all_ field refs to construct the results.

BUild version 2.2.0-alpha.2

fa72ba4

olivernn merged commit fa72ba4 into master Apr 30, 2018

imthenachoman mentioned this pull request Apr 13, 2022

Is there a global setting to change the default TERM PRESENCE to AND vs OR #512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support term presence in queries #332

Support term presence in queries #332

olivernn commented Mar 5, 2018

drzraf commented Mar 6, 2018

olivernn commented Mar 6, 2018

drzraf commented Mar 7, 2018

olivernn commented Mar 7, 2018

olivernn commented Apr 25, 2018

Support term presence in queries #332

Support term presence in queries #332

Conversation

olivernn commented Mar 5, 2018

drzraf commented Mar 6, 2018

olivernn commented Mar 6, 2018

drzraf commented Mar 7, 2018

olivernn commented Mar 7, 2018

olivernn commented Apr 25, 2018