Stemming

edit

Stemming is the process of reducing a word to its root form. This ensures variants of a word match during a search.

For example, walking and walked can be stemmed to the same root word: walk. Once stemmed, an occurrence of either word would match the other in a search.

Stemming is language-dependent but often involves removing prefixes and suffixes from words.

In some cases, the root form of a stemmed word may not be a real word. For example, jumping and jumpiness can both be stemmed to jumpi. While jumpi isn’t a real English word, it doesn’t matter for search; if all variants of a word are reduced to the same root form, they will match correctly.

Stemmer token filters

edit

In Elasticsearch, stemming is handled by stemmer token filters. These token filters can be categorized based on how they stem words:

Because stemming changes tokens, we recommend using the same stemmer token filters during index and search analysis.

Algorithmic stemmers

edit

Algorithmic stemmers apply a series of rules to each word to reduce it to its root form. For example, an algorithmic stemmer for English may remove the -s and -es prefixes from the end of plural words.

Algorithmic stemmers have a few advantages:

  • They require little setup and usually work well out of the box.
  • They use little memory.
  • They are typically faster than dictionary stemmers.

However, most algorithmic stemmers only alter the existing text of a word. This means they may not work well with irregular words that don’t contain their root form, such as:

  • be, are, and am
  • mouse and mice
  • foot and feet

The following token filters use algorithmic stemming:

  • stemmer, which provides algorithmic stemming for several languages, some with additional variants.
  • kstem, a stemmer for English that combines algorithmic stemming with a built-in dictionary.
  • porter_stem, our recommended algorithmic stemmer for English.
  • snowball, which uses Snowball-based stemming rules for several languages.

Dictionary stemmers

edit

Dictionary stemmers look up words in a provided dictionary, replacing unstemmed word variants with stemmed words from the dictionary.

In theory, dictionary stemmers are well suited for:

  • Stemming irregular words
  • Discerning between words that are spelled similarly but not related conceptually, such as:

    • organ and organization
    • broker and broken

In practice, algorithmic stemmers typically outperform dictionary stemmers. This is because dictionary stemmers have the following disadvantages:

  • Dictionary quality
    A dictionary stemmer is only as good as its dictionary. To work well, these dictionaries must include a significant number of words, be updated regularly, and change with language trends. Often, by the time a dictionary has been made available, it’s incomplete and some of its entries are already outdated.
  • Size and performance
    Dictionary stemmers must load all words, prefixes, and suffixes from its dictionary into memory. This can use a significant amount of RAM. Low-quality dictionaries may also be less efficient with prefix and suffix removal, which can slow the stemming process significantly.

You can use the hunspell token filter to perform dictionary stemming.

If available, we recommend trying an algorithmic stemmer for your language before using the hunspell token filter.

Control stemming

edit

Sometimes stemming can produce shared root words that are spelled similarly but not related conceptually. For example, a stemmer may reduce both skies and skiing to the same root word: ski.

To prevent this and better control stemming, you can use the following token filters:

  • stemmer_override, which lets you define rules for stemming specific tokens.
  • keyword_marker, which marks specified tokens as keywords. Keyword tokens are not stemmed by subsequent stemmer token filters.
  • conditional, which can be used to mark tokens as keywords, similar to the keyword_marker filter.

For built-in language analyzers, you also can use the stem_exclusion parameter to specify a list of words that won’t be stemmed.