Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.28 hebrew tokenizer #1728

Merged
merged 9 commits into from
Jul 7, 2022
18 changes: 9 additions & 9 deletions learn/advanced/tokenization.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,18 @@ This allows Meilisearch to function in several different languages with zero set

## Deep dive: The Meilisearch tokenizer

![Chart illustrating the architecture of Meilisearch's tokenizer](https://user-images.githubusercontent.com/6482087/102896344-8560d200-4466-11eb-8cfe-b4ae8741093b.jpg)

When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called an **analyzer**. The analyzer is responsible for determining the primary language of each field based on the scripts (e.g., Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding **pipeline** to each field.
When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called the tokenizer. The tokenizer is responsible for splitting each field by writing system (e.g. Latin alphabet, Chinese hanzi). It then applies the corresponding pipeline to each part of each document field.

We can break down the tokenization process like so:

1. Crawl the document(s) and determine the primary language for each field
2. Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists
1. Crawl the document(s), splitting each field by script
2. Go back over the documents part-by-part, running the corresponding tokenization pipeline, if it exists

Pipelines include many language-specific operations. Currently, we have two pipelines:
Pipelines include many language-specific operations. Currently, we have four pipelines:

1. A specialized Chinese pipeline using [Jieba](https://github.com/messense/jieba-rs)
2. A default Meilisearch pipeline that separates words based on categories. Works with a variety of languages
1. A default Meilisearch pipeline for languages that use whitespace to separate words. Uses [unicode segmenter](https://github.com/unicode-rs/unicode-segmentation)
2. A specialized Chinese pipeline using [Jieba](https://github.com/messense/jieba-rs)
3. A specialized Japanese pipeline using [Lindera](https://github.com/lindera-morphology/lindera)
4. A specialized Hebrew pipeline based off the default Meilisearch pipeline. Uses [Niqqud](https://docs.rs/niqqud/latest/niqqud/) for normalization

For more details, check out the [feature specification](https://github.com/meilisearch/specifications/blob/master/text/0001-script-based-tokenizer.md).
For more details, check out the [tokenizer contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md).
37 changes: 19 additions & 18 deletions learn/what_is_meilisearch/language.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,43 @@
# Language

**Meilisearch is multilingual**, featuring optimized support for:
Meilisearch is multilingual, featuring optimized support for:

- **Any language that uses whitespace to separate words**
- **Chinese** (through [Jieba](https://github.com/messense/jieba-rs))
- **Japanese** (through [Lindera](https://github.com/lindera-morphology/lindera))
- Any language that uses whitespace to separate words
- Chinese
- Japanese
- Hebrew

We aim to provide global language support, and your feedback helps us [move closer to that goal](#improving-our-language-support). If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [GitHub repository](https://github.com/meilisearch/meilisearch/issues/new/choose).
We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please [open an issue in our tokenizer repo](https://github.com/meilisearch/charabia/issues/new).
guimachiavelli marked this conversation as resolved.
Show resolved Hide resolved

If you'd like to learn more about how different languages are processed in Meilisearch, see our [tokenizer documentation](/learn/advanced/tokenization.md).
[Read more about our tokenizer](/learn/advanced/tokenization.md)

## Improving our language support

While we have employees from all over the world at Meilisearch, we don't speak every language. In fact, we rely almost entirely on feedback from external contributors to know how our engine is performing across different languages.
While we have employees from all over the world at Meilisearch, we don't speak every language. We rely almost entirely on feedback from external contributors to understand how our engine is performing across different languages.

If you'd like to help us create a more global Meilisearch, please consider sharing your tests, results, and general feedback with us through [GitHub issues](https://github.com/meilisearch/Meilisearch/issues). Here are some of the languages that have been requested by users and their corresponding issue:
If you'd like to request optimized support for a language that we don't currently support, please upvote the related [discussion in our product repository](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer) or [open a new one](https://github.com/meilisearch/product/discussions/new) if it doesn't exist.
maryamsulemani97 marked this conversation as resolved.
Show resolved Hide resolved

- [Arabic](https://github.com/meilisearch/meilisearch/issues/554)
- [Lao](https://github.com/meilisearch/meilisearch/issues/563)
- [Persian/Farsi](https://github.com/meilisearch/meilisearch/issues/553)
- [Thai](https://github.com/meilisearch/meilisearch/issues/864)

If you'd like us to add or improve support for a language that isn't in the above list, please create an [issue](https://github.com/meilisearch/meilisearch/issues/new?assignees=&labels=&template=feature_request.md&title=) saying so, and then make a [pull request on the documentation](https://github.com/meilisearch/documentation/edit/master/reference/features/language.md) to add it to the above list.
If you'd like to help by developing a tokenizer pipeline yourself: first of all, thank you! We recommend that you take a look at the [tokenizer contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md) before making a PR.

## FAQ

### What do you mean when you say Meilisearch offers _optimized_ support for a language?

Under the hood, Meilisearch relies on tokenizers that identify the most important parts of each document in a given dataset. We currently use two tokenization pipelines: one for languages that separate words with spaces and one specifically tailored for Chinese. Languages that delimit their words in other ways will still work, but the quality and relevancy of search results may vary significantly.
Under the hood, Meilisearch relies on tokenizers that identify the most important parts of each document in a given dataset. We currently use four tokenization pipelines:

- A default pipeline designed for languages that separate words with spaces
- A pipeline specifically tailored for Chinese
- A pipeline specifically tailored for Japanese
- A pipeline specifically tailored for Hebrew

### My language does not use whitespace to separate words. Can I still use Meilisearch?

Yes, but your experience might not be optimized and results might be less relevant than in whitespace-separated languages and Chinese.
Yes, but search results might be less relevant than in one of the fully optimized languages.

### My language does not use the Roman alphabet. Can I still use Meilisearch?

Yes—our users work with many different alphabets and writing systems such as Cyrillic, Thai, and Japanese.
Yes—our users work with many different alphabets and writing systems, such as Cyrillic, Thai, and Japanese.

### Does Meilisearch plan to support additional languages in the future?

Yes, we definitely do. The more feedback we get from native speakers, the easier it is for us to understand how to improve performance for those languages—and the more requests to improve support for a specific language, the more likely we are to devote resources to that project.
Yes, we definitely do. The more [feedback](https://github.com/meilisearch/product/discussions?discussions_q=label%3Aproduct%3Acore%3Atokenizer) we get from native speakers, the easier it is for us to understand how to improve performance for those languages. Similarly, the more requests we get to improve support for a specific language, the more likely we are to devote resources to that project.