New

The executive guide to generative AI

Read more

Stop token filter

edit

Removes stop words from a token stream.

When not customized, the filter removes the following English stop words by default:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

In addition to English, the stop filter supports predefined stop word lists for several languages. You can also specify your own stop words as an array or file.

The stop filter uses Lucene’s StopFilter.

Example

edit

The following analyze API request uses the stop filter to remove the stop words a and the from a quick fox jumps over the lazy dog:

resp = client.indices.analyze(
    tokenizer="standard",
    filter=[
        "stop"
    ],
    text="a quick fox jumps over the lazy dog",
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      'stop'
    ],
    text: 'a quick fox jumps over the lazy dog'
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "standard",
  filter: ["stop"],
  text: "a quick fox jumps over the lazy dog",
});
console.log(response);
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [ "stop" ],
  "text": "a quick fox jumps over the lazy dog"
}

The filter produces the following tokens:

[ quick, fox, jumps, over, lazy, dog ]

Add to an analyzer

edit

The following create index API request uses the stop filter to configure a new custom analyzer.

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "stop"
                    ]
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'whitespace',
            filter: [
              'stop'
            ]
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "whitespace",
          filter: ["stop"],
        },
      },
    },
  },
});
console.log(response);
PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": [ "stop" ]
        }
      }
    }
  }
}

Configurable parameters

edit
stopwords

(Optional, string or array of strings) Language value, such as _arabic_ or _thai_. Defaults to _english_.

Each language value corresponds to a predefined list of stop words in Lucene. See Stop words by language for supported language values and their stop words.

Also accepts an array of stop words.

For an empty list of stop words, use _none_.

stopwords_path

(Optional, string) Path to a file that contains a list of stop words to remove.

This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each stop word in the file must be separated by a line break.

ignore_case
(Optional, Boolean) If true, stop word matching is case insensitive. For example, if true, a stop word of the matches and removes The, THE, or the. Defaults to false.
remove_trailing

(Optional, Boolean) If true, the last token of a stream is removed if it’s a stop word. Defaults to true.

This parameter should be false when using the filter with a completion suggester. This would ensure a query like green a matches and suggests green apple while still removing other stop words.

Customize

edit

To customize the stop filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom case-insensitive stop filter that removes stop words from the _english_ stop words list:

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "default": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "my_custom_stop_words_filter"
                    ]
                }
            },
            "filter": {
                "my_custom_stop_words_filter": {
                    "type": "stop",
                    "ignore_case": True
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          default: {
            tokenizer: 'whitespace',
            filter: [
              'my_custom_stop_words_filter'
            ]
          }
        },
        filter: {
          my_custom_stop_words_filter: {
            type: 'stop',
            ignore_case: true
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        default: {
          tokenizer: "whitespace",
          filter: ["my_custom_stop_words_filter"],
        },
      },
      filter: {
        my_custom_stop_words_filter: {
          type: "stop",
          ignore_case: true,
        },
      },
    },
  },
});
console.log(response);
PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "my_custom_stop_words_filter" ]
        }
      },
      "filter": {
        "my_custom_stop_words_filter": {
          "type": "stop",
          "ignore_case": true
        }
      }
    }
  }
}

You can also specify your own list of stop words. For example, the following request creates a custom case-insensitive stop filter that removes only the stop words and, is, and the:

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "default": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "my_custom_stop_words_filter"
                    ]
                }
            },
            "filter": {
                "my_custom_stop_words_filter": {
                    "type": "stop",
                    "ignore_case": True,
                    "stopwords": [
                        "and",
                        "is",
                        "the"
                    ]
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          default: {
            tokenizer: 'whitespace',
            filter: [
              'my_custom_stop_words_filter'
            ]
          }
        },
        filter: {
          my_custom_stop_words_filter: {
            type: 'stop',
            ignore_case: true,
            stopwords: [
              'and',
              'is',
              'the'
            ]
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        default: {
          tokenizer: "whitespace",
          filter: ["my_custom_stop_words_filter"],
        },
      },
      filter: {
        my_custom_stop_words_filter: {
          type: "stop",
          ignore_case: true,
          stopwords: ["and", "is", "the"],
        },
      },
    },
  },
});
console.log(response);
PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "my_custom_stop_words_filter" ]
        }
      },
      "filter": {
        "my_custom_stop_words_filter": {
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "and", "is", "the" ]
        }
      }
    }
  }
}

Stop words by language

edit

The following list contains supported language values for the stopwords parameter and a link to their predefined stop words in Lucene.

_armenian_
Armenian stop words
_bengali_
Bengali stop words
_brazilian_ (Brazilian Portuguese)
Brazilian Portuguese stop words
_bulgarian_
Bulgarian stop words
_catalan_
Catalan stop words
_cjk_ (Chinese, Japanese, and Korean)
CJK stop words
_english_
English stop words
_estonian_
Estonian stop words
_finnish_
Finnish stop words
_galician_
Galician stop words
_hungarian_
Hungarian stop words
_indonesian_
Indonesian stop words
_italian_
Italian stop words
_latvian_
Latvian stop words
_lithuanian_
Lithuanian stop words
_norwegian_
Norwegian stop words
_persian_
Persian stop words
_portuguese_
Portuguese stop words
_romanian_
Romanian stop words
_russian_
Russian stop words
_serbian_
Serbian stop words
_spanish_
Spanish stop words
_swedish_
Swedish stop words
_turkish_
Turkish stop words
Was this helpful?
Feedback