WARNING: Version 2.1 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
fielddata
editfielddata
editMost fields are indexed by default, which makes them searchable. The inverted index allows queries to look up the search term in unique sorted list of terms, and from that immediately have access to the list of documents that contain the term.
Sorting, aggregations, and access to field values in scripts requires a different data access pattern. Instead of lookup up the term and finding documents, we need to be able to look up the document and find the terms that is has in a field.
Most fields can use index-time, on-disk doc_values
to support
this type of data access pattern, but analyzed
string fields do not support
doc_values
.
Instead, analyzed
strings use a query-time data structure called
fielddata
. This data structure is built on demand the first time that a
field is used for aggregations, sorting, or is accessed in a script. It is built
by reading the entire inverted index for each segment from disk, inverting the
term ↔︎ document relationship, and storing the result in memory, in the
JVM heap.
Loading fielddata is an expensive process so, once it has been loaded, it remains in memory for the lifetime of the segment.
Fielddata can fill up your heap space
fielddata.format
editFor analyzed
string fields, the fielddata format
controls whether
fielddata should be enabled or not. It accepts: disabled
and paged_bytes
(enabled, which is the default). To disable fielddata loading, you can use
the following mapping:
PUT my_index { "mappings": { "my_type": { "properties": { "text": { "type": "string", "fielddata": { "format": "disabled" } } } } } }
Fielddata and other datatypes
Historically, other field datatypes also used fielddata, but this has been replaced
by index-time, disk-based doc_values
.
fielddata.loading
editThis per-field setting controls when fielddata is loaded into memory. It accepts three options:
|
Fielddata is only loaded into memory when it is needed. (default) |
|
Fielddata is loaded into memory before a new search segment becomes visible to search. This can reduce the latency that a user may experience if their search request has to trigger lazy loading from a big segment. |
|
Loading fielddata into memory is only part of the work that is required. After loading the fielddata for each segment, Elasticsearch builds the Global ordinals data structure to make a list of all unique terms across all the segments in a shard. By default, global ordinals are built lazily. If the field has a very high cardinality, global ordinals may take some time to build, in which case you can use eager loading instead. |
fielddata.filter
editFielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by frequency or by regular expression, or a combination of the two:
- Filtering by frequency
-
The frequency filter allows you to only load terms whose document frequency falls between a
min
andmax
value, which can be expressed an absolute number (when the number is bigger than 1.0) or as a percentage (eg0.01
is1%
and1.0
is100%
). Frequency is calculated per segment. Percentages are based on the number of docs which have a value for the field, as opposed to all docs in the segment.Small segments can be excluded completely by specifying the minimum number of docs that the segment should contain with
min_segment_size
:PUT my_index { "mappings": { "my_type": { "properties": { "tag": { "type": "string", "fielddata": { "filter": { "frequency": { "min": 0.001, "max": 0.1, "min_segment_size": 500 } } } } } } } }
- Filtering by regex
-
Terms can also be filtered by regular expression - only values which match the regular expression are loaded. Note: the regular expression is applied to each term in the field, not to the whole field value. For instance, to only load hashtags from a tweet, we can use a regular expression which matches terms beginning with
#
:PUT my_index { "mappings": { "my_type": { "properties": { "tweet": { "type": "string", "analyzer": "whitespace", "fielddata": { "filter": { "regex": { "pattern": "^#.*" } } } } } } } }
These filters can be updated on an existing field mapping and will take effect the next time the fielddata for a segment is loaded. Use the Clear Cache API to reload the fielddata using the new filters.