custom tokenizers peliasNameTokenizer & peliasStreetTokenizer #113

missinglink · 2016-03-28T13:46:42Z

this PR introduces two custom 'tokenizers'.

the motivation behind this is to deal with cases where the data being indexed contains a non-whitespace delimiter:

# before

"a,b" -> ["ab"]
"a/b" -> ["ab"]
"a\b" -> ["ab"]

# after

"a,b" -> ["a","b"]
"a/b" -> ["a","b"]
"a\b" -> ["a","b"]

this was particularly noticeable for cross streets such as Bedell Street/133rd Avenue which were being incorrectly tokenized as:

{
  "tokens": [
    {
      "token": "bedell",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 1
    },
    {
      "token": "street133rd",
      "start_offset": 7,
      "end_offset": 19,
      "type": "word",
      "position": 2
    },
    {
      "token": "ave",
      "start_offset": 20,
      "end_offset": 26,
      "type": "SYNONYM",
      "position": 3
    }
  ]
}

closes #91

orangejulius · 2016-04-12T21:59:49Z

👍 on the code. let's run the acceptance tests in autocomplete mode on this once dev is done

orangejulius · 2016-04-14T15:45:12Z

Can you confirm that this code is in the dev build that finished today?

custom tokenizers peliasNameTokenizer & peliasStreetTokenizer

659509d

missinglink assigned missinglink and unassigned missinglink Mar 28, 2016

missinglink added the in progress label Mar 28, 2016

missinglink added this to the Autocomplete Improvements milestone Mar 28, 2016

missinglink added in review and removed in progress labels Mar 28, 2016

missinglink mentioned this pull request Mar 28, 2016

Refactor autocomplete analysis #109

Closed

orangejulius assigned missinglink Apr 21, 2016

missinglink mentioned this pull request Apr 22, 2016

autocomplete milestone #127

Merged

missinglink closed this in #127 Apr 22, 2016

missinglink added in review and removed in review labels Apr 22, 2016

missinglink mentioned this pull request Apr 29, 2016

autocomplete milestone pelias/api#526

Merged

orangejulius deleted the custom_tokenizers branch May 24, 2016 15:23

orangejulius mentioned this pull request Jun 1, 2016

house numbers vs. punctuation #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom tokenizers peliasNameTokenizer & peliasStreetTokenizer #113

custom tokenizers peliasNameTokenizer & peliasStreetTokenizer #113

missinglink commented Mar 28, 2016

orangejulius commented Apr 12, 2016

orangejulius commented Apr 14, 2016

custom tokenizers peliasNameTokenizer & peliasStreetTokenizer #113

custom tokenizers peliasNameTokenizer & peliasStreetTokenizer #113

Conversation

missinglink commented Mar 28, 2016

orangejulius commented Apr 12, 2016

orangejulius commented Apr 14, 2016