normalizer
The normalizer property of keyword fields is similar to analyzer except that it guarantees that the analysis chain produces a single token.
The normalizer is applied prior to indexing the keyword, as well as at search-time when the keyword field is searched via a query parser such as the match query or via a term-level query such as the term query.
A simple normalizer called lowercase ships with elasticsearch and can be used. Custom normalizers can be defined as part of analysis settings as follows.
PUT index
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"properties": {
"foo": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
PUT index/_doc/1
{
"foo": "BÀR"
}
PUT index/_doc/2
{
"foo": "bar"
}
PUT index/_doc/3
{
"foo": "baz"
}
POST index/_refresh
GET index/_search
{
"query": {
"term": {
"foo": "BAR"
}
}
}
GET index/_search
{
"query": {
"match": {
"foo": "BAR"
}
}
}
The above queries match documents 1 and 2 since BÀR is converted to bar at both index and query time.
{
"took": $body.took,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped" : 0,
"failed": 0
},
"hits": {
"total" : {
"value": 2,
"relation": "eq"
},
"max_score": 0.4700036,
"hits": [
{
"_index": "index",
"_id": "1",
"_score": 0.4700036,
"_source": {
"foo": "BÀR"
}
},
{
"_index": "index",
"_id": "2",
"_score": 0.4700036,
"_source": {
"foo": "bar"
}
}
]
}
}
Also, the fact that keywords are converted prior to indexing also means that aggregations return normalized values:
GET index/_search
{
"size": 0,
"aggs": {
"foo_terms": {
"terms": {
"field": "foo"
}
}
}
}
returns
{
"took": 43,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped" : 0,
"failed": 0
},
"hits": {
"total" : {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"foo_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bar",
"doc_count": 2
},
{
"key": "baz",
"doc_count": 1
}
]
}
}
}
When synthetic source is enabled on an index, keyword fields with a normalizer store the original pre-normalized value separately so that it can be included in the synthetic source. This can cause elevated disk usage since values are stored twice: once pre-normalized, and once normalized.
To reduce disk usage, the normalizer_skip_store_original_value mapping parameter can be set to true. When enabled, the original value is not stored, and the normalized value is used in the reconstructed source instead.
For custom normalizers, this parameter defaults to false. However, for the built-in lowercase normalizer, it defaults to true.