WARNING: Version 6.1 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

« Character Filters Mapping Char Filter »

› › ›

HTML Strip Char Filter

edit

HTML Strip Char Filter

edit

The html_strip character filter strips HTML elements from the text and replaces HTML entities with their decoded value (e.g. replacing & with &).

Example output

edit

POST _analyze
{
  "tokenizer":      "keyword", 
  "char_filter":  [ "html_strip" ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

The keyword tokenizer returns a single term.

The above example returns the term:

[ \nI'm so happy!\n ]

The same example with the standard tokenizer would return the following terms:

[ I'm, so, happy ]

Configuration

edit

The html_strip character filter accepts the following parameter:

escaped_tags

An array of HTML tags which should not be stripped from the original text.

Example configuration

edit

In this example, we configure the html_strip character filter to leave <b> tags in place:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

The above example produces the following term:

[ \nI'm so <b>happy</b>!\n ]

« Character Filters Mapping Char Filter »