Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

能够支持中文吗 #53

Closed
lierisme opened this issue May 5, 2017 · 3 comments
Closed

能够支持中文吗 #53

lierisme opened this issue May 5, 2017 · 3 comments

Comments

@lierisme
Copy link

lierisme commented May 5, 2017

能够支持中文吗

@weixsong
Copy link
Owner

Hi, @lierisme , if you want to support Chinese, you just need to update the tokenizer, you could find a Chinese tokenizer, or, you could just tokenize Chinese by each word.
Chinese tokenizer is a complex package, so if you want to support a good chinese tokenizer, I think elasticlunr.js should not be run in browser

@caihaibin1991
Copy link

I replace the default to use jieba, but I do not why "Pipeline" return a empty array.
image

@hepezu
Copy link

hepezu commented Aug 14, 2022

@caihaibin1991

For supporting Chinese ,

  1. Change tokenizer or do tokenization before elasticlunr's default tokenizer. For example, "能够支持中文吗"->"能够 支持 中文 吗", after this preprocessing, the default tokenizer can work by split space. For tokenization, nodejieba, @node-rs/jieba or regex to split every character can be used.

  2. [Your issue] Remove default elasticlunr.trimmer, by index.pipeline.remove(elasticlunr.trimmer). The default trimmer trims all non-english characters, which results your issue.

  3. Further work. Make your own pipeline to process text, including tokenizer, trimmer, stemmer and stopword filter. Examples are in weixsong/lunr-languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants