Tokenizer trimming causes offset matches if document content begins with whitespace #417

rgaiken · 2019-10-04T15:05:12Z

If a document starts with a whitespace character, trim() in tokenizer.js will remove those characters, causing the returned match data to be offset.

See this fiddle for a simple example

The text was updated successfully, but these errors were encountered:

Test for GH olivernn#417

Fixes GH olivernn#417

hoelzro · 2019-10-05T03:32:55Z

Thanks for the report @rgaiken ! I wrote up a fix in #418

Test for GH olivernn#417

This can throw off the token position metadata, as reported in GH olivernn#417 Fixes GH olivernn#417

Addresses GH olivernn#417

olivernn · 2019-10-06T16:42:30Z

I've just pushed 2.3.7 which includes the fix provided by @hoelzro .

hoelzro added a commit to hoelzro/lunr.js that referenced this issue Oct 5, 2019

Test that left-hand whitespace preserves token positions

28d4f97

Test for GH olivernn#417

hoelzro added a commit to hoelzro/lunr.js that referenced this issue Oct 5, 2019

Account for left-hand whitespace in token positions

c00c2b9

Fixes GH olivernn#417

hoelzro mentioned this issue Oct 5, 2019

Fix #417 #418

Merged

hoelzro added a commit to hoelzro/lunr.js that referenced this issue Oct 6, 2019

Test that left-hand whitespace preserves token positions

b188bf9

Test for GH olivernn#417

hoelzro added a commit to hoelzro/lunr.js that referenced this issue Oct 6, 2019

Don't trim strings during tokenization

c59fd90

This can throw off the token position metadata, as reported in GH olivernn#417 Fixes GH olivernn#417

hoelzro added a commit to hoelzro/lunr.js that referenced this issue Oct 6, 2019

Add test for right-hand whitespace

a67632e

Addresses GH olivernn#417

olivernn closed this as completed in 15e24c4 Oct 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer trimming causes offset matches if document content begins with whitespace #417

Tokenizer trimming causes offset matches if document content begins with whitespace #417

rgaiken commented Oct 4, 2019

hoelzro commented Oct 5, 2019

olivernn commented Oct 6, 2019

Tokenizer trimming causes offset matches if document content begins with whitespace #417

Tokenizer trimming causes offset matches if document content begins with whitespace #417

Comments

rgaiken commented Oct 4, 2019

hoelzro commented Oct 5, 2019

olivernn commented Oct 6, 2019