How to Split words on Hyphen ? #3

jfgirard · 2014-07-08T14:19:00Z

I noticed a difference in the way words are tokenized compared to Postgresql.

select to_tsvector('Pseudo-Mercator');
"'mercat':3 'pseudo':2 'pseudo-merc':1"

Basically, PG index both the "Pseudo-Mercator" and the sub words "Pseudo" and "Mercator".

Searching for "mercator" gives me a result with PG.

But, because Lunr only tokenize on white char, a search for "mercator" won't work.

I could create a afterTokenizer function to split each token and add them to the list.

function afterTokenizer(tokens) {
     var split;
     tokens.forEach(function(token){
     split = token.split(/-/g);
     if(split.length > 1){
         tokens = tokens.concat(split);
      }
     });
     return tokens;
}

So, index.pipeline.run(lunr.tokenizer(text)); would be index.pipeline.run(afterTokenizer(lunr.tokenizer(text)));

Is this the best way to acheive the same behavior ?

The text was updated successfully, but these errors were encountered:

nolanlawson · 2014-07-08T15:21:07Z

Yes, that seems like a good solution.

For what it's worth, I'm pretty shocked that Lunr doesn't split on anything but whitespace. (Commas? Semicolons?) We should file a bug on Lunr.

nolanlawson · 2014-07-08T15:54:41Z

OK, I see now how it works. It's not just whitespace; it's also trailing and leading non-letters. (Source)[https://github.com/olivernn/lunr.js/blob/master/lib/trimmer.js#L22-L23].

So yeah, it just doesn't work for punctuation within a word, such as "Pseudo-Mercator". Depending on your use case that may or may not be okay.

jfgirard · 2014-07-08T16:50:42Z

Yes, the trim works correctly. But the tokenizer only splits on white char (\s).
https://github.com/olivernn/lunr.js/blob/master/lib/tokenizer.js#L28

It looks like its standard to split on hypen, both PG and Couchdb-Lucene do it.

For C-L, I did a simple test indexing the name property and query it like that:
localhost:5985/local/epsg/_design/cl/by_name?q=mercator&include_docs=true finds the doc with "pseudo-mercator".

nolanlawson · 2014-07-08T19:59:31Z

Opened a PR on Lunr: olivernn/lunr.js#98.

jfgirard · 2014-07-08T20:07:05Z

Great! Better fixing it upstream.

nolanlawson · 2014-07-14T23:14:12Z

5d0207a

nolanlawson closed this as completed Jul 8, 2014

nolanlawson mentioned this issue Jul 8, 2014

Split on hyphens as well as whitespace olivernn/lunr.js#98

Merged

jfgirard changed the title ~~How to Split words on Hypen ?~~ How to Split words on Hyphen ? Jul 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Split words on Hyphen ? #3

How to Split words on Hyphen ? #3

jfgirard commented Jul 8, 2014

nolanlawson commented Jul 8, 2014

nolanlawson commented Jul 8, 2014

jfgirard commented Jul 8, 2014

nolanlawson commented Jul 8, 2014

jfgirard commented Jul 8, 2014

nolanlawson commented Jul 14, 2014

How to Split words on Hyphen ? #3

How to Split words on Hyphen ? #3

Comments

jfgirard commented Jul 8, 2014

nolanlawson commented Jul 8, 2014

nolanlawson commented Jul 8, 2014

jfgirard commented Jul 8, 2014

nolanlawson commented Jul 8, 2014

jfgirard commented Jul 8, 2014

nolanlawson commented Jul 14, 2014