-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Split words on Hyphen ? #3
Comments
Yes, that seems like a good solution. For what it's worth, I'm pretty shocked that Lunr doesn't split on anything but whitespace. (Commas? Semicolons?) We should file a bug on Lunr. |
OK, I see now how it works. It's not just whitespace; it's also trailing and leading non-letters. (Source)[https://github.com/olivernn/lunr.js/blob/master/lib/trimmer.js#L22-L23]. So yeah, it just doesn't work for punctuation within a word, such as "Pseudo-Mercator". Depending on your use case that may or may not be okay. |
Yes, the trim works correctly. But the tokenizer only splits on white char (\s). It looks like its standard to split on hypen, both PG and Couchdb-Lucene do it. For C-L, I did a simple test indexing the name property and query it like that: |
Opened a PR on Lunr: olivernn/lunr.js#98. |
Great! Better fixing it upstream. |
I noticed a difference in the way words are tokenized compared to Postgresql.
select to_tsvector('Pseudo-Mercator');
"'mercat':3 'pseudo':2 'pseudo-merc':1"
Basically, PG index both the "Pseudo-Mercator" and the sub words "Pseudo" and "Mercator".
Searching for "mercator" gives me a result with PG.
But, because Lunr only tokenize on white char, a search for "mercator" won't work.
I could create a afterTokenizer function to split each token and add them to the list.
So,
index.pipeline.run(lunr.tokenizer(text));
would beindex.pipeline.run(afterTokenizer(lunr.tokenizer(text)));
Is this the best way to acheive the same behavior ?
The text was updated successfully, but these errors were encountered: