IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Mapping character filter Normalizers »

› › ›

Pattern replace character filter

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Pattern replace character filter

edit

The pattern_replace character filter uses a regular expression to match characters which should be replaced with the specified replacement string. The replacement string can refer to capture groups in the regular expression.

Beware of Pathological Regular Expressions

The pattern replace character filter uses Java Regular Expressions.

A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.

Configuration

edit

The pattern_replace character filter accepts the following parameters:

`pattern`	A Java regular expression. Required.
`replacement`	The replacement string, which can reference capture groups using the `$1`..`$9` syntax, as explained here.
`flags`	Java regular expression flags. Flags should be pipe-separated, eg `"CASE_INSENSITIVE\|COMMENTS"`.

Example configuration

edit

In this example, we configure the pattern_replace character filter to replace any embedded dashes in numbers with underscores, i.e 123-456-789 → 123_456_789:

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "char_filter": [
                        "my_char_filter"
                    ]
                }
            },
            "char_filter": {
                "my_char_filter": {
                    "type": "pattern_replace",
                    "pattern": "(\\d+)-(?=\\d)",
                    "replacement": "$1_"
                }
            }
        }
    },
)
print(resp)

resp1 = client.indices.analyze(
    index="my-index-000001",
    analyzer="my_analyzer",
    text="My credit card is 123-456-789",
)
print(resp1)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'standard',
            char_filter: [
              'my_char_filter'
            ]
          }
        },
        char_filter: {
          my_char_filter: {
            type: 'pattern_replace',
            pattern: '(\\d+)-(?=\\d)',
            replacement: '$1_'
          }
        }
      }
    }
  }
)
puts response

response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
    analyzer: 'my_analyzer',
    text: 'My credit card is 123-456-789'
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "standard",
          char_filter: ["my_char_filter"],
        },
      },
      char_filter: {
        my_char_filter: {
          type: "pattern_replace",
          pattern: "(\\d+)-(?=\\d)",
          replacement: "$1_",
        },
      },
    },
  },
});
console.log(response);

const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_analyzer",
  text: "My credit card is 123-456-789",
});
console.log(response1);

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}

Copy as curl Try in Elastic

The above example produces the following terms:

[ My, credit, card, is, 123_456_789 ]

Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting, as can be seen in the following example.

This example inserts a space whenever it encounters a lower-case letter followed by an upper-case letter (i.e. fooBarBaz → foo Bar Baz), allowing camelCase words to be queried individually:

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "char_filter": [
                        "my_char_filter"
                    ],
                    "filter": [
                        "lowercase"
                    ]
                }
            },
            "char_filter": {
                "my_char_filter": {
                    "type": "pattern_replace",
                    "pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
                    "replacement": " "
                }
            }
        }
    },
    mappings={
        "properties": {
            "text": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    },
)
print(resp)

resp1 = client.indices.analyze(
    index="my-index-000001",
    analyzer="my_analyzer",
    text="The fooBarBaz method",
)
print(resp1)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'standard',
            char_filter: [
              'my_char_filter'
            ],
            filter: [
              'lowercase'
            ]
          }
        },
        char_filter: {
          my_char_filter: {
            type: 'pattern_replace',
            pattern: '(?<=\\p{Lower})(?=\\p{Upper})',
            replacement: ' '
          }
        }
      }
    },
    mappings: {
      properties: {
        text: {
          type: 'text',
          analyzer: 'my_analyzer'
        }
      }
    }
  }
)
puts response

response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
    analyzer: 'my_analyzer',
    text: 'The fooBarBaz method'
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "standard",
          char_filter: ["my_char_filter"],
          filter: ["lowercase"],
        },
      },
      char_filter: {
        my_char_filter: {
          type: "pattern_replace",
          pattern: "(?<=\\p{Lower})(?=\\p{Upper})",
          replacement: " ",
        },
      },
    },
  },
  mappings: {
    properties: {
      text: {
        type: "text",
        analyzer: "my_analyzer",
      },
    },
  },
});
console.log(response);

const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_analyzer",
  text: "The fooBarBaz method",
});
console.log(response1);

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The fooBarBaz method"
}

Copy as curl Try in Elastic

The above returns the following terms:

[ the, foo, bar, baz, method ]

Querying for bar will find the document correctly, but highlighting on the result will produce incorrect highlights, because our character filter changed the length of the original text:

resp = client.index(
    index="my-index-000001",
    id="1",
    refresh=True,
    document={
        "text": "The fooBarBaz method"
    },
)
print(resp)

resp1 = client.search(
    index="my-index-000001",
    query={
        "match": {
            "text": "bar"
        }
    },
    highlight={
        "fields": {
            "text": {}
        }
    },
)
print(resp1)

response = client.index(
  index: 'my-index-000001',
  id: 1,
  refresh: true,
  body: {
    text: 'The fooBarBaz method'
  }
)
puts response

response = client.search(
  index: 'my-index-000001',
  body: {
    query: {
      match: {
        text: 'bar'
      }
    },
    highlight: {
      fields: {
        text: {}
      }
    }
  }
)
puts response

const response = await client.index({
  index: "my-index-000001",
  id: 1,
  refresh: "true",
  document: {
    text: "The fooBarBaz method",
  },
});
console.log(response);

const response1 = await client.search({
  index: "my-index-000001",
  query: {
    match: {
      text: "bar",
    },
  },
  highlight: {
    fields: {
      text: {},
    },
  },
});
console.log(response1);

PUT my-index-000001/_doc/1?refresh
{
  "text": "The fooBarBaz method"
}

GET my-index-000001/_search
{
  "query": {
    "match": {
      "text": "bar"
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}

Copy as curl Try in Elastic

The output from the above is:

{
  "timed_out": false,
  "took": $body.took,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my-index-000001",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "text": "The fooBarBaz method"
        },
        "highlight": {
          "text": [
            "The foo<em>Ba</em>rBaz method" 
          ]
        }
      }
    ]
  }
}

Note the incorrect highlight.

« Mapping character filter Normalizers »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Pattern replace character filter

Pattern replace character filter

Configuration

Example configuration

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards