Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor autocomplete analysis #109

Closed
wants to merge 6 commits into from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Mar 16, 2016

This PR refactors the analyzers used by the /v1/autocomplete endpoint, with the goals of:

  • removing all interdependencies with the /v1/search endpoint making subsequent refactoring easier.
  • providing a more robust method of handling synonym substitution:
    • by considering the differences between 'index time' analysis and 'query time' analysis.
    • by handling 'partial tokens' (partially complete words) and 'full tokens' differently.

Currently we use 3 different analyzers in the /v1/autocomplete endpoint:

analyzer "trade center"
peliasOneEdgeGram "t", "tr", "tra", "trad", "trade", "c", "ce", "cen", "cent", "cente", "center"
peliasTwoEdgeGram "tr", "tra", "trad", "trade", "ce", "cen", "cent", "cente", "center"
peliasPhrase "trade", "ctr"

The peliasPhrase analyzer was originally intended to be used with /v1/search and you can see above that the way it handles synonyms is mismatched with the way the other 2 analyzers handle the word center (for example). this is the cause of pelias/pelias#211

new analyzers:

The new analyzers proposed in this PR are:

analyzer tokenizer partial safe? "center"
peliasIndexOneEdgeGram 1gram × "c", "ce", "cen", "cent", "cente", "center"
peliasIndexTwoEdgeGram 2gram × "ce", "cen", "cent", "cente", "center"
peliasQueryPartialToken word "center"
peliasQueryFullToken keyword × "center"

They produce the same tokens when given the abbreviated/contracted form "ctr":

analyzer tokenizer partial safe? "ctr"
peliasIndexOneEdgeGram 1gram × "c", "ce", "cen", "cent", "cente", "center"
peliasIndexTwoEdgeGram 2gram × "ce", "cen", "cent", "cente", "center"
peliasQueryPartialToken word "center"
peliasQueryFullToken keyword × "center"

directionals:

They also handle directional synonyms in a similar way:

analyzer tokenizer partial safe? "north"
peliasIndexOneEdgeGram 1gram × "n", "no", "nor", "nort", "north"
peliasIndexTwoEdgeGram 2gram × "no", "nor", "nort", "north", "n"
peliasQueryPartialToken word "north"
peliasQueryFullToken keyword × "north"

Again, they produce the same tokens when given the abbreviated/contracted form "n":

analyzer tokenizer partial safe? "n"
peliasIndexOneEdgeGram 1gram × "n", "no", "nor", "nort", "north"
peliasIndexTwoEdgeGram 2gram × "no", "nor", "nort", "north", "n"
peliasQueryPartialToken word "n"
peliasQueryFullToken keyword × "north"

note: there is a bit of a 'hack' in place for the above peliasIndexTwoEdgeGram analysis that is specific to directionals, you can see it adds a single gram 'n' in to a token stream which usually only contains grams of size 2+. This improves address matching and reduces 'jitter'.

api/query changes:

All usages of existing analyzers in /v1/autocomplete must be updated:

  • peliasOneEdgeGram -> peliasQueryPartialToken
  • peliasPhrase -> peliasQueryFullToken

Additionally the autocomplete queries should no longer need to use the phrase.* index, all queries can safely be performed against the name.* index (if not already doing so).

note: we can discuss removing the phrase.* index completely! this would greatly reduce the cluster disk/ram usage, it might be possible to achieve all the functionality of /v1/search using the prefixGram index. let's discuss this in another issue.

dataset importer changes:

nil

risks / expected acceptance test changes:

There is not much that can go wrong here, the only differences at index time are that:

  • peliasIndexOneEdgeGram expands directionals whereas peliasOneEdgeGram does not.
  • peliasIndexTwoEdgeGram is the same and includes the 'hack' mentioned above.

The differences at query time are:

  • issue 211 is resolved
  • expect to see better handling of queries containing a single directional gram such as 'w 26 st'.

I've left some other changes I would like to make for a future PR in order to reduce the amount of changes going in at the same time.

related:

closes #105
resolves pelias/pelias#211
related pelias/openaddresses#68

@missinglink
Copy link
Member Author

Travis reports the integration tests failing, this is an intermittent issue which depends on hardware, I tried to solve this in missinglink/elastictest#1 but it seems to still be an issue.

all tests pass when run locally.

@missinglink
Copy link
Member Author

note: these analysers should be updated with the new tokenizers in #113 once it's merged

the integration tests will also need to be updated, see analyzer_peliasPhrase.js for an example.

@orangejulius orangejulius deleted the refactor_autocomplete_analysis branch May 24, 2016 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant