- Elasticsearch Guide: other versions:
- Getting Started
- Setup
- Breaking changes
- API Conventions
- Document APIs
- Search APIs
- Search
- URI Search
- Request Body Search
- Search Template
- Search Shards API
- Aggregations
- Min Aggregation
- Max Aggregation
- Sum Aggregation
- Avg Aggregation
- Stats Aggregation
- Extended Stats Aggregation
- Value Count Aggregation
- Percentiles Aggregation
- Percentile Ranks Aggregation
- Cardinality Aggregation
- Geo Bounds Aggregation
- Top hits Aggregation
- Scripted Metric Aggregation
- Global Aggregation
- Filter Aggregation
- Filters Aggregation
- Missing Aggregation
- Nested Aggregation
- Reverse nested Aggregation
- Children Aggregation
- Terms Aggregation
- Significant Terms Aggregation
- Range Aggregation
- Date Range Aggregation
- IPv4 Range Aggregation
- Histogram Aggregation
- Date Histogram Aggregation
- Geo Distance Aggregation
- GeoHash grid Aggregation
- Facets
- Suggesters
- Multi Search API
- Count API
- Search Exists API
- Validate API
- Explain API
- Percolator
- More Like This API
- Indices APIs
- Create Index
- Delete Index
- Get Index
- Indices Exists
- Open / Close Index API
- Put Mapping
- Get Mapping
- Get Field Mapping
- Types Exists
- Delete Mapping
- Index Aliases
- Update Indices Settings
- Get Settings
- Analyze
- Index Templates
- Warmers
- Status
- Indices Stats
- Indices Segments
- Indices Recovery
- Clear Cache
- Flush
- Refresh
- Optimize
- Upgrade
- Shadow replica indices
- cat APIs
- Cluster APIs
- Query DSL
- Queries
- Match Query
- Multi Match Query
- Bool Query
- Boosting Query
- Common Terms Query
- Constant Score Query
- Dis Max Query
- Filtered Query
- Fuzzy Like This Query
- Fuzzy Like This Field Query
- Function Score Query
- Fuzzy Query
- GeoShape Query
- Has Child Query
- Has Parent Query
- Ids Query
- Indices Query
- Match All Query
- More Like This Query
- Nested Query
- Prefix Query
- Query String Query
- Simple Query String Query
- Range Query
- Regexp Query
- Span First Query
- Span Multi Term Query
- Span Near Query
- Span Not Query
- Span Or Query
- Span Term Query
- Term Query
- Terms Query
- Top Children Query
- Wildcard Query
- Minimum Should Match
- Multi Term Query Rewrite
- Template Query
- Filters
- And Filter
- Bool Filter
- Exists Filter
- Geo Bounding Box Filter
- Geo Distance Filter
- Geo Distance Range Filter
- Geo Polygon Filter
- GeoShape Filter
- Geohash Cell Filter
- Has Child Filter
- Has Parent Filter
- Ids Filter
- Indices Filter
- Limit Filter
- Match All Filter
- Missing Filter
- Nested Filter
- Not Filter
- Or Filter
- Prefix Filter
- Query Filter
- Range Filter
- Regexp Filter
- Script Filter
- Term Filter
- Terms Filter
- Type Filter
- Queries
- Mapping
- Analysis
- Analyzers
- Tokenizers
- Token Filters
- Standard Token Filter
- ASCII Folding Token Filter
- Length Token Filter
- Lowercase Token Filter
- Uppercase Token Filter
- NGram Token Filter
- Edge NGram Token Filter
- Porter Stem Token Filter
- Shingle Token Filter
- Stop Token Filter
- Word Delimiter Token Filter
- Stemmer Token Filter
- Stemmer Override Token Filter
- Keyword Marker Token Filter
- Keyword Repeat Token Filter
- KStem Token Filter
- Snowball Token Filter
- Phonetic Token Filter
- Synonym Token Filter
- Compound Word Token Filter
- Reverse Token Filter
- Elision Token Filter
- Truncate Token Filter
- Unique Token Filter
- Pattern Capture Token Filter
- Pattern Replace Token Filter
- Trim Token Filter
- Limit Token Count Token Filter
- Hunspell Token Filter
- Common Grams Token Filter
- Normalization Token Filter
- CJK Width Token Filter
- CJK Bigram Token Filter
- Delimited Payload Token Filter
- Keep Words Token Filter
- Keep Types Token Filter
- Classic Token Filter
- Apostrophe Token Filter
- Character Filters
- ICU Analysis Plugin
- Modules
- Index Modules
- Testing
- Glossary of terms
WARNING: Version 1.5 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Field data formats
editField data formats
editThe field data format controls how field data should be stored.
Depending on the field type, there might be several field data types
available. In particular, string and numeric types support the doc_values
format which allows for computing the field data data-structures at indexing
time and storing them on disk. Although it will make the index larger and may
be slightly slower, this implementation will be more near-realtime-friendly
and will require much less memory from the JVM than other implementations.
Here is an example of how to configure the tag
field to use the fst
field
data format.
{ "tag": { "type": "string", "fielddata": { "format": "fst" } } }
It is possible to change the field data format (and the field data settings in general) on a live index by using the update mapping API. When doing so, field data which had already been loaded for existing segments will remain alive while new segments will use the new field data configuration. Thanks to the background merging process, all segments will eventually use the new field data format.
String field data types
edit-
paged_bytes
(default) - Stores unique terms sequentially in a large buffer and maps documents to the indices of the terms they contain in this large buffer.
-
fst
-
Stores terms in a FST. Slower to build than
paged_bytes
but can help lower memory usage if many terms share common prefixes and/or suffixes. -
doc_values
-
Computes and stores field data data-structures on disk at indexing time.
Lowers memory usage but only works on non-analyzed strings (
index
:no
ornot_analyzed
).
Numeric field data types
edit-
array
(default) - Stores field values in memory using arrays.
-
doc_values
- Computes and stores field data data-structures on disk at indexing time.
Geo point field data types
edit-
array
(default) - Stores latitudes and longitudes in arrays.
-
doc_values
- Computes and stores field data data-structures on disk at indexing time.
Global ordinals
editGlobal ordinals is a data-structure on top of field data, that maintains an incremental numbering for all the terms in field data in a lexicographic order. Each term has a unique number and the number of term A is lower than the number of term B. Global ordinals are only supported on string fields.
Field data on string also has ordinals, which is a unique numbering for all terms in a particular segment and field. Global ordinals just build on top of this, by providing a mapping between the segment ordinals and the global ordinals. The latter being unique across the entire shard.
Global ordinals can be beneficial in search features that use segment ordinals already such as the terms aggregator to improve the execution time. Often these search features need to merge the segment ordinal results to a cross segment terms result. With global ordinals this mapping happens during field data load time instead of during each query execution. With global ordinals search features only need to resolve the actual term when building the (shard) response, but during the execution there is no need at all to use the actual terms and the unique numbering global ordinals provided is sufficient and improves the execution time.
Global ordinals for a specified field are tied to all the segments of a shard (Lucene index), which is different than for field data for a specific field which is tied to a single segment. For this reason global ordinals need to be rebuilt in its entirety once new segments become visible. This one time cost would happen anyway without global ordinals, but then it would happen for each search execution instead!
The loading time of global ordinals depends on the number of terms in a field, but in general it is low, since it source field data has already been loaded. The memory overhead of global ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals can move the loading time from the first search request, to the refresh itself.
Fielddata loading
editBy default, field data is loaded lazily, ie. the first time that a query that requires them is executed. However, this can make the first requests that follow a merge operation quite slow since fielddata loading is a heavy operation.
It is possible to force field data to be loaded and cached eagerly through the
loading
setting of fielddata:
{ "category": { "type": "string", "fielddata": { "loading": "eager" } } }
Global ordinals can also be eagerly loaded:
{ "category": { "type": "string", "fielddata": { "loading": "eager_global_ordinals" } } }
With the above setting both field data and global ordinals for a specific field are eagerly loaded.
Disabling field data loading
editField data can take a lot of RAM so it makes sense to disable field data
loading on the fields that don’t need field data, for example those that are
used for full-text search only. In order to disable field data loading, just
change the field data format to disabled
. When disabled, all requests that
will try to load field data, e.g. when they include aggregations and/or sorting,
will return an error.
{ "text": { "type": "string", "fielddata": { "format": "disabled" } } }
The disabled
format is supported by all field types.
Filtering fielddata
editIt is possible to control which field values are loaded into memory, which is particularly useful for string fields. When specifying the mapping for a field, you can also specify a fielddata filter.
Fielddata filters can be changed using the PUT mapping API. After changing the filters, use the Clear Cache API to reload the fielddata using the new filters.
Filtering by frequency:
editThe frequency filter allows you to only load terms whose frequency falls
between a min
and max
value, which can be expressed an absolute
number or as a percentage (eg 0.01
is 1%
). Frequency is calculated
per segment. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.
Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with min_segment_size
:
{ "tag": { "type": "string", "fielddata": { "filter": { "frequency": { "min": 0.001, "max": 0.1, "min_segment_size": 500 } } } } }
Filtering by regex
editTerms can also be filtered by regular expression - only values which
match the regular expression are loaded. Note: the regular expression is
applied to each term in the field, not to the whole field value. For
instance, to only load hashtags from a tweet, we can use a regular
expression which matches terms beginning with #
:
{ "tweet": { "type": "string", "analyzer": "whitespace" "fielddata": { "filter": { "regex": { "pattern": "^#.*" } } } } }
Combining filters
editThe frequency
and regex
filters can be combined:
{ "tweet": { "type": "string", "analyzer": "whitespace" "fielddata": { "filter": { "regex": { "pattern": "^#.*", }, "frequency": { "min": 0.001, "max": 0.1, "min_segment_size": 500 } } } } }
On this page
ElasticON events are back!
Learn about the Elastic Search AI Platform from the experts at our live events.
Register now