- Elasticsearch Guide: other versions:
- Getting Started
- Setup
- Breaking changes
- API Conventions
- Document APIs
- Search APIs
- Search
- URI Search
- Request Body Search
- Search Template
- Search Shards API
- Aggregations
- Min Aggregation
- Max Aggregation
- Sum Aggregation
- Avg Aggregation
- Stats Aggregation
- Extended Stats Aggregation
- Value Count Aggregation
- Percentiles Aggregation
- Percentile Ranks Aggregation
- Cardinality Aggregation
- Geo Bounds Aggregation
- Top hits Aggregation
- Scripted Metric Aggregation
- Global Aggregation
- Filter Aggregation
- Filters Aggregation
- Missing Aggregation
- Nested Aggregation
- Reverse nested Aggregation
- Children Aggregation
- Terms Aggregation
- Significant Terms Aggregation
- Range Aggregation
- Date Range Aggregation
- IPv4 Range Aggregation
- Histogram Aggregation
- Date Histogram Aggregation
- Geo Distance Aggregation
- GeoHash grid Aggregation
- Facets
- Suggesters
- Multi Search API
- Count API
- Search Exists API
- Validate API
- Explain API
- Percolator
- More Like This API
- Field stats API
- Indices APIs
- Create Index
- Delete Index
- Get Index
- Indices Exists
- Open / Close Index API
- Put Mapping
- Get Mapping
- Get Field Mapping
- Types Exists
- Delete Mapping
- Index Aliases
- Update Indices Settings
- Get Settings
- Analyze
- Index Templates
- Warmers
- Status
- Indices Stats
- Indices Segments
- Indices Recovery
- Clear Cache
- Flush
- Refresh
- Optimize
- Shadow replica indices
- Upgrade
- cat APIs
- Cluster APIs
- Query DSL
- Queries
- Match Query
- Multi Match Query
- Bool Query
- Boosting Query
- Common Terms Query
- Constant Score Query
- Dis Max Query
- Filtered Query
- Fuzzy Like This Query
- Fuzzy Like This Field Query
- Function Score Query
- Fuzzy Query
- GeoShape Query
- Has Child Query
- Has Parent Query
- Ids Query
- Indices Query
- Match All Query
- More Like This Query
- Nested Query
- Prefix Query
- Query String Query
- Simple Query String Query
- Range Query
- Regexp Query
- Span First Query
- Span Multi Term Query
- Span Near Query
- Span Not Query
- Span Or Query
- Span Term Query
- Term Query
- Terms Query
- Top Children Query
- Wildcard Query
- Minimum Should Match
- Multi Term Query Rewrite
- Template Query
- Filters
- And Filter
- Bool Filter
- Exists Filter
- Geo Bounding Box Filter
- Geo Distance Filter
- Geo Distance Range Filter
- Geo Polygon Filter
- GeoShape Filter
- Geohash Cell Filter
- Has Child Filter
- Has Parent Filter
- Ids Filter
- Indices Filter
- Limit Filter
- Match All Filter
- Missing Filter
- Nested Filter
- Not Filter
- Or Filter
- Prefix Filter
- Query Filter
- Range Filter
- Regexp Filter
- Script Filter
- Term Filter
- Terms Filter
- Type Filter
- Queries
- Mapping
- Analysis
- Analyzers
- Tokenizers
- Token Filters
- Standard Token Filter
- ASCII Folding Token Filter
- Length Token Filter
- Lowercase Token Filter
- Uppercase Token Filter
- NGram Token Filter
- Edge NGram Token Filter
- Porter Stem Token Filter
- Shingle Token Filter
- Stop Token Filter
- Word Delimiter Token Filter
- Stemmer Token Filter
- Stemmer Override Token Filter
- Keyword Marker Token Filter
- Keyword Repeat Token Filter
- KStem Token Filter
- Snowball Token Filter
- Phonetic Token Filter
- Synonym Token Filter
- Compound Word Token Filter
- Reverse Token Filter
- Elision Token Filter
- Truncate Token Filter
- Unique Token Filter
- Pattern Capture Token Filter
- Pattern Replace Token Filter
- Trim Token Filter
- Limit Token Count Token Filter
- Hunspell Token Filter
- Common Grams Token Filter
- Normalization Token Filter
- CJK Width Token Filter
- CJK Bigram Token Filter
- Delimited Payload Token Filter
- Keep Words Token Filter
- Keep Types Token Filter
- Classic Token Filter
- Apostrophe Token Filter
- Character Filters
- ICU Analysis Plugin
- Modules
- Index Modules
- Testing
- Glossary of terms
WARNING: Version 1.7 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Merge
editMerge
editThis functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
A shard in elasticsearch is a Lucene index, and a Lucene index is broken down into segments. Segments are internal storage elements in the index where the index data is stored, and are immutable up to delete markers. Segments are, periodically, merged into larger segments to keep the index size at bay and expunge deletes.
The more segments one has in the Lucene index means slower searches and more memory used. Segment merging is used to reduce the number of segments, however merges can be expensive to perform, especially on low IO environments. Merges can be throttled using store level throttling.
Policy
editThe index merge policy module allows one to control which segments of a
shard index are to be merged. There are several types of policies with
the default set to tiered
.
tiered
editMerges segments of approximately equal size, subject to an allowed
number of segments per tier. This is similar to log_bytes_size
merge
policy, except this merge policy is able to merge non-adjacent segment,
and separates how many segments are merged at once from how many
segments are allowed per tier. This merge policy also does not
over-merge (i.e., cascade merges).
This policy has the following settings:
-
index.merge.policy.expunge_deletes_allowed
-
When expungeDeletes is called, we only merge away a segment if its delete
percentage is over this threshold. Default is
10
. -
index.merge.policy.floor_segment
-
Segments smaller than this are "rounded up" to this size, i.e. treated as
equal (floor) size for merge selection. This is to prevent frequent
flushing of tiny segments, thus preventing a long tail in the index. Default
is
2mb
. -
index.merge.policy.max_merge_at_once
-
Maximum number of segments to be merged at a time during "normal" merging.
Default is
10
. -
index.merge.policy.max_merge_at_once_explicit
-
Maximum number of segments to be merged at a time, during optimize or
expungeDeletes. Default is
30
. -
index.merge.policy.max_merged_segment
-
Maximum sized segment to produce during normal merging (not explicit
optimize). This setting is approximate: the estimate of the merged segment
size is made by summing sizes of to-be-merged segments (compensating for
percent deleted docs). Default is
5gb
. -
index.merge.policy.segments_per_tier
-
Sets the allowed number of segments per tier. Smaller values mean more
merging but fewer segments. Default is
10
. Note, this value needs to be >= than themax_merge_at_once
otherwise you’ll force too many merges to occur. -
index.merge.policy.reclaim_deletes_weight
-
Controls how aggressively merges that reclaim more deletions are favored.
Higher values favor selecting merges that reclaim deletions. A value of
0.0
means deletions don’t impact merge selection. Defaults to2.0
. -
index.compound_format
-
Should the index be stored in compound format or not. Defaults to
false
. Seeindex.compound_format
in Index Settings.
For normal merging, this policy first computes a "budget" of how many segments are allowed to be in the index. If the index is over-budget, then the policy sorts segments by decreasing size (proportionally considering percent deletes), and then finds the least-cost merge. Merge cost is measured by a combination of the "skew" of the merge (size of largest seg divided by smallest seg), total merge size and pct deletes reclaimed, so that merges with lower skew, smaller size and those reclaiming more deletes, are favored.
If a merge will produce a segment that’s larger than
max_merged_segment
then the policy will merge fewer segments (down to
1 at once, if that one has deletions) to keep the segment size under
budget.
Note, this can mean that for large shards that holds many gigabytes of
data, the default of max_merged_segment
(5gb
) can cause for many
segments to be in an index, and causing searches to be slower. Use the
indices segments API to see the segments that an index has, and
possibly either increase the max_merged_segment
or issue an optimize
call for the index (try and aim to issue it on a low traffic time).
log_byte_size
editDeprecated in 1.6.0.
This policy will be removed in 2.0 in favour of the tiered merge policy
A merge policy that merges segments into levels of exponentially increasing byte size, where each level has fewer segments than the value of the merge factor. Whenever extra segments (beyond the merge factor upper bound) are encountered, all segments within the level are merged.
This policy has the following settings:
Setting | Description |
---|---|
index.merge.policy.merge_factor |
Determines how often segment indices
are merged by index operation. With smaller values, less RAM is used
while indexing, and searches on unoptimized indices are faster, but
indexing speed is slower. With larger values, more RAM is used during
indexing, and while searches on unoptimized indices are slower, indexing
is faster. Thus larger values (greater than 10) are best for batch index
creation, and smaller values (lower than 10) for indices that are
interactively maintained. Defaults to |
index.merge.policy.min_merge_size |
A size setting type which sets the
minimum size for the lowest level segments. Any segments below this size
are considered to be on the same level (even if they vary drastically in
size) and will be merged whenever there are mergeFactor of them. This
effectively truncates the "long tail" of small segments that would
otherwise be created into a single level. If you set this too large, it
could greatly increase the merging cost during indexing (if you flush
many small segments). Defaults to |
index.merge.policy.max_merge_size |
A size setting type which sets the largest segment (measured by total byte size of the segment’s files) that may be merged with other segments. Defaults to unbounded. |
index.merge.policy.max_merge_docs |
Determines the largest segment (measured by document count) that may be merged with other segments. Defaults to unbounded. |
log_doc
editDeprecated in 1.6.0.
This policy will be removed in 2.0 in favour of the tiered merge policy
A merge policy that tries to merge segments into levels of exponentially increasing document count, where each level has fewer segments than the value of the merge factor. Whenever extra segments (beyond the merge factor upper bound) are encountered, all segments within the level are merged.
Setting | Description |
---|---|
index.merge.policy.merge_factor |
Determines how often segment indices
are merged by index operation. With smaller values, less RAM is used
while indexing, and searches on unoptimized indices are faster, but
indexing speed is slower. With larger values, more RAM is used during
indexing, and while searches on unoptimized indices are slower, indexing
is faster. Thus larger values (greater than 10) are best for batch index
creation, and smaller values (lower than 10) for indices that are
interactively maintained. Defaults to |
index.merge.policy.min_merge_docs |
Sets the minimum size for the lowest
level segments. Any segments below this size are considered to be on the
same level (even if they vary drastically in size) and will be merged
whenever there are mergeFactor of them. This effectively truncates the
"long tail" of small segments that would otherwise be created into a
single level. If you set this too large, it could greatly increase the
merging cost during indexing (if you flush many small segments).
Defaults to |
index.merge.policy.max_merge_docs |
Determines the largest segment (measured by document count) that may be merged with other segments. Defaults to unbounded. |
Scheduling
editThe merge scheduler (ConcurrentMergeScheduler) controls the execution of merge operations once they are needed (according to the merge policy). Merges run in separate threads, and when the maximum number of threads is reached, further merges will wait until a merge thread becomes available. The merge scheduler supports this setting:
-
index.merge.scheduler.max_thread_count
-
The maximum number of threads that may be merging at once. Defaults to
Math.max(1, Math.min(3, Runtime.getRuntime().availableProcessors() / 2))
which works well for a good solid-state-disk (SSD). If your index is on spinning platter drives instead, decrease this to 1.
SerialMergeScheduler
editThis is accepted for backwards compatibility, but just uses ConcurrentMergeScheduler with index.merge.scheduler.max_thread_count set to 1 so that only 1 merge may run at a time.