Skip to content

Latest commit

 

History

History
557 lines (447 loc) · 16 KB

analysis-icu.asciidoc

File metadata and controls

557 lines (447 loc) · 16 KB

ICU analysis plugin

The ICU Analysis plugin integrates the Lucene ICU module into {es}, adding extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.

Important
ICU analysis and backwards compatibility

From time to time, the ICU library receives updates such as adding new characters and emojis, and improving collation (sort) orders. These changes may or may not affect search and sort orders, depending on which characters sets you are using.

While we restrict ICU upgrades to major versions, you may find that an index created in the previous major version will need to be reindexed in order to return correct (and correctly ordered) results, and to take advantage of new characters.

ICU analyzer

The icu_analyzer analyzer performs basic normalization, tokenization and character folding, using the icu_normalizer char filter, icu_tokenizer and icu_folding token filter

The following parameters are accepted:

method

Normalization method. Accepts nfkc, nfc or nfkc_cf (default)

mode

Normalization mode. Accepts compose (default) or decompose.

ICU normalization character filter

Normalizes characters as explained here. It registers itself as the icu_normalizer character filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default). Set the mode parameter to decompose to convert nfc to nfd or nfkc to nfkd respectively:

Which letters are normalized can be controlled by specifying the unicode_set_filter parameter, which accepts a UnicodeSet.

Here are two examples, the default usage and a customised character filter:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { (1)
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "icu_normalizer"
            ]
          },
          "nfd_normalized": { (2)
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "nfd_normalizer"
            ]
          }
        },
        "char_filter": {
          "nfd_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc",
            "mode": "decompose"
          }
        }
      }
    }
  }
}
  1. Uses the default nfkc_cf normalization.

  2. Uses the customized nfd_normalizer token filter, which is set to use nfc normalization with decomposition.

ICU tokenizer

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the {ref}/analysis-standard-tokenizer.html[standard tokenizer], but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}
Rules customization

experimental[This functionality is marked as experimental in Lucene]

You can customize the icu-tokenizer behavior by specifying per-script rule files, see the RBBI rules syntax reference for a more detailed explanation.

To add icu tokenizer rules, set the rule_files settings, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a rule file name. Rule files are placed ES_HOME/config directory.

As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi:

.+ {200};

Then create an analyzer to use this rule file as follows:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "icu_user_file": {
            "type": "icu_tokenizer",
            "rule_files": "Latn:KeywordTokenizer.rbbi"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "icu_user_file"
          }
        }
      }
    }
  }
}

GET icu_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Elasticsearch. Wow!"
}

The above analyze request returns the following:

{
   "tokens": [
      {
         "token": "Elasticsearch. Wow!",
         "start_offset": 0,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 0
      }
   ]
}

ICU normalization token filter

Normalizes characters as explained here. It registers itself as the icu_normalizer token filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default).

Which letters are normalized can be controlled by specifying the unicode_set_filter parameter, which accepts a UnicodeSet.

You should probably prefer the Normalization character filter.

Here are two examples, the default usage and a customised token filter:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { (1)
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_normalizer"
            ]
          },
          "nfc_normalized": { (2)
            "tokenizer": "icu_tokenizer",
            "filter": [
              "nfc_normalizer"
            ]
          }
        },
        "filter": {
          "nfc_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc"
          }
        }
      }
    }
  }
}
  1. Uses the default nfkc_cf normalization.

  2. Uses the customized nfc_normalizer token filter, which is set to use nfc normalization.

ICU folding token filter

Case folding of Unicode characters based on UTR#30, like the {ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter] on steroids. It registers itself as the icu_folding token filter and is available to all indices:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "folded": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_folding"
            ]
          }
        }
      }
    }
  }
}

The ICU folding token filter already does Unicode normalization, so there is no need to use Normalize character or token filter as well.

Which letters are folded can be controlled by specifying the unicode_set_filter parameter, which accepts a UnicodeSet.

The following example exempts Swedish characters from folding. It is important to note that both upper and lowercase forms should be specified, and that these filtered character are not lowercased which is why we add the lowercase filter as well:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "swedish_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "swedish_folding",
              "lowercase"
            ]
          }
        },
        "filter": {
          "swedish_folding": {
            "type": "icu_folding",
            "unicode_set_filter": "[^åäöÅÄÖ]"
          }
        }
      }
    }
  }
}

ICU collation token filter

Warning

This token filter has been deprecated since Lucene 5.0. Please use ICU Collation Keyword Field.

ICU collation keyword field

Collations are used for sorting documents in a language-specific word order. The icu_collation_keyword field type is available to all indices and will encode the terms directly as bytes in a doc values field and a single indexed token just like a standard {ref}/keyword.html[Keyword Field].

Defaults to using {defguide}/sorting-collations.html#uca[DUCET collation], which is a best-effort attempt at language-neutral sorting.

Below is an example of how to set up a field for sorting German names in ``phonebook'' order:

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "name": {   (1)
        "type": "text",
        "fields": {
          "sort": {  (2)
            "type": "icu_collation_keyword",
            "index": false,
            "language": "de",
            "country": "DE",
            "variant": "@collation=phonebook"
          }
        }
      }
    }
  }
}

GET /my-index-000001/_search (3)
{
  "query": {
    "match": {
      "name": "Fritz"
    }
  },
  "sort": "name.sort"
}
  1. The name field uses the standard analyzer, and so supports full text queries.

  2. The name.sort field is an icu_collation_keyword field that will preserve the name as a single token doc values, and applies the German ``phonebook'' order.

  3. An example query which searches the name field and sorts on the name.sort field.

Parameters for ICU collation keyword fields

The following parameters are accepted by icu_collation_keyword fields:

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true (default) or false.

index

Should the field be searchable? Accepts true (default) or false.

null_value

Accepts a string value which is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

{ref}/ignore-above.html[ignore_above]

Strings longer than the ignore_above setting will be ignored. Checking is performed on the original string before the collation. The ignore_above setting can be updated on existing fields using the {ref}/indices-put-mapping.html[PUT mapping API]. By default, there is no limit and all values will be indexed.

store

Whether the field value should be stored and retrievable separately from the {ref}/mapping-source-field.html[_source] field. Accepts true or false (default).

fields

Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations.

Collation options
strength

The strength property determines the minimum level of difference considered significant during comparison. Possible values are : primary, secondary, tertiary, quaternary or identical. See the ICU Collation documentation for a more detailed explanation for each value. Defaults to tertiary unless otherwise specified in the collation.

decomposition

Possible values: no (default, but collation-dependent) or canonical. Setting this decomposition property to canonical allows the Collator to handle unnormalized text properly, producing the same results as if the text were normalized. If no is set, it is the user’s responsibility to ensure that all text is already in the appropriate form before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between faster and more complete collation behavior. Since a great many of the world’s languages do not require text normalization, most locales set no as the default decomposition mode.

The following options are expert only:

alternate

Possible values: shifted or non-ignorable. Sets the alternate handling for strength quaternary to be either shifted or non-ignorable. Which boils down to ignoring punctuation and whitespace.

case_level

Possible values: true or false (default). Whether case level sorting is required. When strength is set to primary this will ignore accent differences.

case_first

Possible values: lower or upper. Useful to control which case is sorted first when the case is not ignored for strength tertiary. The default depends on the collation.

numeric

Possible values: true or false (default) . Whether digits are sorted according to their numeric representation. For example the value egg-9 is sorted before the value egg-21.

variable_top

Single character or contraction. Controls what is variable for alternate.

hiragana_quaternary_mode

Possible values: true or false. Distinguishing between Katakana and Hiragana characters in quaternary strength.

ICU transform token filter

Transforms are used to process Unicode text in many different ways, such as case mapping, normalization, transliteration and bidirectional text handling.

You can define which transformation you want to apply with the id parameter (defaults to Null), and specify text direction with the dir parameter which accepts forward (default) for LTR and reverse for RTL. Custom rulesets are not yet supported.

For example:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "latin": {
            "tokenizer": "keyword",
            "filter": [
              "myLatinTransform"
            ]
          }
        },
        "filter": {
          "myLatinTransform": {
            "type": "icu_transform",
            "id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" (1)
          }
        }
      }
    }
  }
}

GET icu_sample/_analyze
{
  "analyzer": "latin",
  "text": "你好" (2)
}

GET icu_sample/_analyze
{
  "analyzer": "latin",
  "text": "здравствуйте" (3)
}

GET icu_sample/_analyze
{
  "analyzer": "latin",
  "text": "こんにちは" (4)
}
  1. This transforms transliterates characters to Latin, and separates accents from their base characters, removes the accents, and then puts the remaining text into an unaccented form.

  2. Returns ni hao.

  3. Returns zdravstvujte.

  4. Returns kon’nichiha.

For more documentation, Please see the user guide of ICU Transform.