IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Unique Token Filter Pattern Replace Token Filter »

› › ›

Pattern Capture Token Filter

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Pattern Capture Token Filter

edit

The pattern_capture token filter, unlike the pattern tokenizer, emits a token for every capture group in the regular expression. Patterns are not anchored to the beginning and end of the string, so each pattern can match multiple times, and matches are allowed to overlap.

Beware of Pathological Regular Expressions

The pattern capture token filter uses Java Regular Expressions.

A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.

For instance a pattern like :

"(([a-z]+)(\d*))"

when matched against:

"abc123def456"

would produce the tokens: [ abc123, abc, 123, def456, def, 456 ]

If preserve_original is set to true (the default) then it would also emit the original token: abc123def456.

This is particularly useful for indexing text like camel-case code, eg stripHTML where a user may search for "strip html" or "striphtml":

PUT test
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "code" : {
               "type" : "pattern_capture",
               "preserve_original" : true,
               "patterns" : [
                  "(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
                  "(\\d+)"
               ]
            }
         },
         "analyzer" : {
            "code" : {
               "tokenizer" : "pattern",
               "filter" : [ "code", "lowercase" ]
            }
         }
      }
   }
}

Copy as curl Try in Elastic

When used to analyze the text

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml

this emits the tokens: [ import, static, org, apache, commons, lang, stringescapeutils, string, escape, utils, escapehtml, escape, html ]

Another example is analyzing email addresses:

PUT test
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "email" : {
               "type" : "pattern_capture",
               "preserve_original" : true,
               "patterns" : [
                  "([^@]+)",
                  "(\\p{L}+)",
                  "(\\d+)",
                  "@(.+)"
               ]
            }
         },
         "analyzer" : {
            "email" : {
               "tokenizer" : "uax_url_email",
               "filter" : [ "email", "lowercase",  "unique" ]
            }
         }
      }
   }
}

Copy as curl Try in Elastic

When the above analyzer is used on an email address like:

john-smith_123@foo-bar.com

it would produce the following tokens:

john-smith_123@foo-bar.com, john-smith_123,
john, smith, 123, foo-bar.com, foo, bar, com

Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand.

Note: All tokens are emitted in the same position, and with the same character offsets, so when combined with highlighting, the whole original token will be highlighted, not just the matching subset. For instance, querying the above email address for "smith" would highlight:

  <em>john-smith_123@foo-bar.com</em>

not:

  john-<em>smith</em>_123@foo-bar.com

« Unique Token Filter Pattern Replace Token Filter »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Pattern Capture Token Filter

Pattern Capture Token Filter

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards