Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support term presence in queries #332

Merged
merged 13 commits into from
Apr 30, 2018
Merged

Support term presence in queries #332

merged 13 commits into from
Apr 30, 2018

Conversation

olivernn
Copy link
Owner

@olivernn olivernn commented Mar 5, 2018

Adds support for modifying the required presence of a given term in matching documents.

Currently, when searching with multiple terms, each term is optional in the search results. That is, a search for "foo bar" will return results with either "foo" or "bar" or both "foo" and "bar", effectively terms are combined with a logical OR.

This change adds support for marking terms as either required or prohibited. A required term must appear in a document for that document to be returned in the search results. A prohibited term must not appear in a document for that document to be returned.

A term's presence is indicated by either a prefixed "+" (plus) for required terms, and a "-" (minus) for prohibited term when using lunr.Index#search. Constructing queries programatically using lunr.Index#query also support presence using the presence option with a value the lunr.Query.presence enum.

For example, to perform a search for documents that must contain "foo" and optionally contain "bar" use the following search string: +foo bar.

A simple, editable, example should help make this clear.

There is one open question, when a search contains only prohibited terms currently no documents are returned. Technically this should return all documents in the index, but I'm not sure that is too useful. Is the current behaviour correct, i.e. just no search results, or should an InvalidQuery exception be thrown?

An alpha release is available on npm, lunr@2.2.0-alpha.1

@drzraf
Copy link

drzraf commented Mar 6, 2018

Thank you for this nice feature!

IHMO searches that only contain prohibited terms should have their behavior configurable since the results for the "empty field" is something which vary for each application.

Adding another suggestion:.
I found useful that when doing AND-search, some of the terms could be optional according to whether or not they return results.

Ex:

A: 8 matches
B: 3 matches included A + 2 distinct matches
C: 0 matches

search "A or B" : 10
search "A and B" : 2
search "C": 0
search "A and C": 8   # instead of 0: "C" is omitted

Why?
When using autocompletion, it makes sense to have "strict" AND-search.
But without autocompletion, it's often annoying if a non-existing term(s) drop all the results instead of just refining them, like "+restaurant +pizza +vegan +sunday"

I'm pretty sure such a search-mode exist in full-search engines, I'd naively call that "ignore AND-ed terms that don't provide any result"
In such a mode, providing the result-count for each term would let the programmer react to the situation (ex: "no results included "sunday", showing results for the whole week instead")

@olivernn
Copy link
Owner Author

olivernn commented Mar 6, 2018

Thanks for the feedback, some great suggestions...

IHMO searches that only contain prohibited terms should have their behavior configurable since the results for the "empty field" is something which vary for each application.

It depends on what you mean by 'configurable'... If Lunr threw an InvalidQuery exception then at least the consumer would be able to handle that in whichever way they want, if you squint hard that is sort of configurable :trollface:

I'm pretty sure the current behaviour of returning nothing is not right, I just sidestepped the issue when implementing this. I'd like to see what properly implementing this case looks like. I think it should perform the search, i.e. if all documents do not have the term then return all documents. More interesting is when some documents do have the term, that could easily be a valid use case.

I'm guessing though that you had something more concrete in mind when you mentioned configuring the behaviour. I'm going to pass on this for now, but I'm not ruling it out in the future.

I found useful that when doing AND-search, some of the terms could be optional according to whether or not they return results.

I looked (though not very hard) for this behaviour in Lucene and couldn't find anything, do you have any examples (for any search engine/library)?

It should be possible to implement this in application code though right? If a query returns no results, drop required terms and try again. Possibly not the most efficient, or simple solution I guess.

Rather than implement this directly in Lunr I think Lunr could provide the relevant information (matches per term) so that applications can handle this cases however they like. Probably isn't going to make it into 2.2.x though. It would almost definitely require a major version bump as lunr.Index#query just returns an array of matches currently.

@drzraf
Copy link

drzraf commented Mar 7, 2018

I don't have examples and I surely don't want such a discussion about exotic feature to block AND-search :)

It should be possible to implement this in application code though right?

yes (but with a performance impact because it could not reuse AND search capabilities).
In #264 (comment) I implement the behavior doing one search per term and then doing intersection of result sets, and then looking at terms that returned none from the intersection.

@olivernn
Copy link
Owner Author

olivernn commented Mar 7, 2018

I've hacked together an implementation where by searches containing only prohibited terms work as expected. Specifically, if the document does not contain that term it gets returned, even if that means all documents.

The last sticking point is how these results are scored, and therefore sorted. The way I've got it implemented now would mean all results have a score of NaN. This could be the correct result but it feels wrong. I think a score of 1 would be more intuitive. NaN would mean that the score is unrepresentable, where 1 would mean a perfect match.

Fixing the score to be 1 is trivial, but requires even more special casing for these kind of queries, which is something I'd like to avoid.

Previously calculating the similarity with an empty vector would give a
score of NaN. This might technically be correct but probably isn't very
useful. Instead we change this to zero.

Scoring empty vectors is a side effect of introducing totally negated
queries.
In totally negated queries there will be no matches and therefore an
empty MatchData object is required for returning in the search results.
A negated query is one in which _all_ terms have a presence of
prohibited. When executing a negated query some special handling is
required and this method is used to trigger that handling.
A negated query is one in which all terms have a prohibited presence.
Handling these queries requires some special handling. Specifically
empty match data should be created as well as extracting _all_ field
refs to construct the results.
In cases where we know a document/field is not going to make it into the
results due to a terms presence (either required or prohibited) there is
no point in calculating a similarity score.

This leads to a minor performance improvement for queries that contain
either required or prohibited terms.
@olivernn
Copy link
Owner Author

Finally got round to working on this again.

Totally negated queries are now supported, each document that does not contain that term will be in the results, the score should be zero for each of them.

An alpha release is available on npm, lunr@2.2.0-alpha.2

I think that covers everything on this feature, if there are no objections I aim to get a release with this and a couple of other small features in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants