Refactor autocomplete analysis #109

missinglink · 2016-03-16T14:13:22Z

This PR refactors the analyzers used by the /v1/autocomplete endpoint, with the goals of:

removing all interdependencies with the /v1/search endpoint making subsequent refactoring easier.
providing a more robust method of handling synonym substitution:
- by considering the differences between 'index time' analysis and 'query time' analysis.
- by handling 'partial tokens' (partially complete words) and 'full tokens' differently.

Currently we use 3 different analyzers in the /v1/autocomplete endpoint:

analyzer	"trade center"
peliasOneEdgeGram	"t", "tr", "tra", "trad", "trade", "c", "ce", "cen", "cent", "cente", "center"
peliasTwoEdgeGram	"tr", "tra", "trad", "trade", "ce", "cen", "cent", "cente", "center"
peliasPhrase	"trade", "ctr"

The peliasPhrase analyzer was originally intended to be used with /v1/search and you can see above that the way it handles synonyms is mismatched with the way the other 2 analyzers handle the word center (for example). this is the cause of pelias/pelias#211

new analyzers:

The new analyzers proposed in this PR are:

analyzer	tokenizer	partial safe?	"center"
peliasIndexOneEdgeGram	1gram	×	"c", "ce", "cen", "cent", "cente", "center"
peliasIndexTwoEdgeGram	2gram	×	"ce", "cen", "cent", "cente", "center"
peliasQueryPartialToken	word	✔	"center"
peliasQueryFullToken	keyword	×	"center"

They produce the same tokens when given the abbreviated/contracted form "ctr":

analyzer	tokenizer	partial safe?	"ctr"
peliasIndexOneEdgeGram	1gram	×	"c", "ce", "cen", "cent", "cente", "center"
peliasIndexTwoEdgeGram	2gram	×	"ce", "cen", "cent", "cente", "center"
peliasQueryPartialToken	word	✔	"center"
peliasQueryFullToken	keyword	×	"center"

directionals:

They also handle directional synonyms in a similar way:

analyzer	tokenizer	partial safe?	"north"
peliasIndexOneEdgeGram	1gram	×	"n", "no", "nor", "nort", "north"
peliasIndexTwoEdgeGram	2gram	×	"no", "nor", "nort", "north", "n"
peliasQueryPartialToken	word	✔	"north"
peliasQueryFullToken	keyword	×	"north"

Again, they produce the same tokens when given the abbreviated/contracted form "n":

analyzer	tokenizer	partial safe?	"n"
peliasIndexOneEdgeGram	1gram	×	"n", "no", "nor", "nort", "north"
peliasIndexTwoEdgeGram	2gram	×	"no", "nor", "nort", "north", "n"
peliasQueryPartialToken	word	✔	"n"
peliasQueryFullToken	keyword	×	"north"

note: there is a bit of a 'hack' in place for the above peliasIndexTwoEdgeGram analysis that is specific to directionals, you can see it adds a single gram 'n' in to a token stream which usually only contains grams of size 2+. This improves address matching and reduces 'jitter'.

api/query changes:

All usages of existing analyzers in /v1/autocomplete must be updated:

peliasOneEdgeGram -> peliasQueryPartialToken
peliasPhrase -> peliasQueryFullToken

Additionally the autocomplete queries should no longer need to use the phrase.* index, all queries can safely be performed against the name.* index (if not already doing so).

note: we can discuss removing the phrase.* index completely! this would greatly reduce the cluster disk/ram usage, it might be possible to achieve all the functionality of /v1/search using the prefixGram index. let's discuss this in another issue.

dataset importer changes:

nil

risks / expected acceptance test changes:

There is not much that can go wrong here, the only differences at index time are that:

peliasIndexOneEdgeGram expands directionals whereas peliasOneEdgeGram does not.
peliasIndexTwoEdgeGram is the same and includes the 'hack' mentioned above.

The differences at query time are:

issue 211 is resolved
expect to see better handling of queries containing a single directional gram such as 'w 26 st'.

I've left some other changes I would like to make for a future PR in order to reduce the amount of changes going in at the same time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor autocomplete analysis #109

Refactor autocomplete analysis #109

missinglink commented Mar 16, 2016 •

edited

Loading

missinglink commented Mar 17, 2016

missinglink commented Mar 28, 2016

Refactor autocomplete analysis #109

Refactor autocomplete analysis #109

Conversation

missinglink commented Mar 16, 2016 • edited Loading

new analyzers:

directionals:

api/query changes:

dataset importer changes:

risks / expected acceptance test changes:

related:

missinglink commented Mar 17, 2016

missinglink commented Mar 28, 2016

missinglink commented Mar 16, 2016 •

edited

Loading