Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom tokenizers peliasNameTokenizer & peliasStreetTokenizer #113

Closed
wants to merge 1 commit into from

Conversation

missinglink
Copy link
Member

this PR introduces two custom 'tokenizers'.

the motivation behind this is to deal with cases where the data being indexed contains a non-whitespace delimiter:

# before

"a,b" -> ["ab"]
"a/b" -> ["ab"]
"a\b" -> ["ab"]

# after

"a,b" -> ["a","b"]
"a/b" -> ["a","b"]
"a\b" -> ["a","b"]

this was particularly noticeable for cross streets such as Bedell Street/133rd Avenue which were being incorrectly tokenized as:

{
  "tokens": [
    {
      "token": "bedell",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 1
    },
    {
      "token": "street133rd",
      "start_offset": 7,
      "end_offset": 19,
      "type": "word",
      "position": 2
    },
    {
      "token": "ave",
      "start_offset": 20,
      "end_offset": 26,
      "type": "SYNONYM",
      "position": 3
    }
  ]
}

closes #91

@orangejulius
Copy link
Member

👍 on the code. let's run the acceptance tests in autocomplete mode on this once dev is done

@orangejulius
Copy link
Member

Can you confirm that this code is in the dev build that finished today?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

character filter should treat commas as spaces
2 participants