- Elasticsearch Guide: other versions:
- What’s new in 8.17
- Elasticsearch basics
- Quick starts
- Set up Elasticsearch
- Run Elasticsearch locally
- Installing Elasticsearch
- Configuring Elasticsearch
- Important Elasticsearch configuration
- Secure settings
- Auditing settings
- Circuit breaker settings
- Cluster-level shard allocation and routing settings
- Miscellaneous cluster settings
- Cross-cluster replication settings
- Discovery and cluster formation settings
- Data stream lifecycle settings
- Field data cache settings
- Local gateway settings
- Health Diagnostic settings
- Index lifecycle management settings
- Index management settings
- Index recovery settings
- Indexing buffer settings
- Inference settings
- License settings
- Machine learning settings
- Monitoring settings
- Node settings
- Networking
- Node query cache settings
- Path settings
- Search settings
- Security settings
- Shard request cache settings
- Snapshot and restore settings
- Transforms settings
- Thread pools
- Watcher settings
- Set JVM options
- Important system configuration
- Bootstrap Checks
- Heap size check
- File descriptor check
- Memory lock check
- Maximum number of threads check
- Max file size check
- Maximum size virtual memory check
- Maximum map count check
- Client JVM check
- Use serial collector check
- System call filter check
- OnError and OnOutOfMemoryError checks
- Early-access check
- All permission check
- Discovery configuration check
- Bootstrap Checks for X-Pack
- Starting Elasticsearch
- Stopping Elasticsearch
- Discovery and cluster formation
- Add and remove nodes in your cluster
- Full-cluster restart and rolling restart
- Remote clusters
- Plugins
- Upgrade Elasticsearch
- Index modules
- Mapping
- Dynamic mapping
- Explicit mapping
- Runtime fields
- Field data types
- Aggregate metric
- Alias
- Arrays
- Binary
- Boolean
- Completion
- Date
- Date nanoseconds
- Dense vector
- Flattened
- Geopoint
- Geoshape
- Histogram
- IP
- Join
- Keyword
- Nested
- Numeric
- Object
- Pass-through object
- Percolator
- Point
- Range
- Rank feature
- Rank features
- Search-as-you-type
- Semantic text
- Shape
- Sparse vector
- Text
- Token count
- Unsigned long
- Version
- Metadata fields
- Mapping parameters
analyzer
coerce
copy_to
doc_values
dynamic
eager_global_ordinals
enabled
format
ignore_above
index.mapping.ignore_above
ignore_malformed
index
index_options
index_phrases
index_prefixes
meta
fields
normalizer
norms
null_value
position_increment_gap
properties
search_analyzer
similarity
store
subobjects
term_vector
- Mapping limit settings
- Removal of mapping types
- Text analysis
- Overview
- Concepts
- Configure text analysis
- Built-in analyzer reference
- Tokenizer reference
- Token filter reference
- Apostrophe
- ASCII folding
- CJK bigram
- CJK width
- Classic
- Common grams
- Conditional
- Decimal digit
- Delimited payload
- Dictionary decompounder
- Edge n-gram
- Elision
- Fingerprint
- Flatten graph
- Hunspell
- Hyphenation decompounder
- Keep types
- Keep words
- Keyword marker
- Keyword repeat
- KStem
- Length
- Limit token count
- Lowercase
- MinHash
- Multiplexer
- N-gram
- Normalization
- Pattern capture
- Pattern replace
- Phonetic
- Porter stem
- Predicate script
- Remove duplicates
- Reverse
- Shingle
- Snowball
- Stemmer
- Stemmer override
- Stop
- Synonym
- Synonym graph
- Trim
- Truncate
- Unique
- Uppercase
- Word delimiter
- Word delimiter graph
- Character filters reference
- Normalizers
- Index templates
- Data streams
- Ingest pipelines
- Example: Parse logs
- Enrich your data
- Processor reference
- Append
- Attachment
- Bytes
- Circle
- Community ID
- Convert
- CSV
- Date
- Date index name
- Dissect
- Dot expander
- Drop
- Enrich
- Fail
- Fingerprint
- Foreach
- Geo-grid
- GeoIP
- Grok
- Gsub
- HTML strip
- Inference
- IP Location
- Join
- JSON
- KV
- Lowercase
- Network direction
- Pipeline
- Redact
- Registered domain
- Remove
- Rename
- Reroute
- Script
- Set
- Set security user
- Sort
- Split
- Terminate
- Trim
- Uppercase
- URL decode
- URI parts
- User agent
- Ingest pipelines in Search
- Aliases
- Search your data
- Re-ranking
- Query DSL
- Aggregations
- Bucket aggregations
- Adjacency matrix
- Auto-interval date histogram
- Categorize text
- Children
- Composite
- Date histogram
- Date range
- Diversified sampler
- Filter
- Filters
- Frequent item sets
- Geo-distance
- Geohash grid
- Geohex grid
- Geotile grid
- Global
- Histogram
- IP prefix
- IP range
- Missing
- Multi Terms
- Nested
- Parent
- Random sampler
- Range
- Rare terms
- Reverse nested
- Sampler
- Significant terms
- Significant text
- Terms
- Time series
- Variable width histogram
- Subtleties of bucketing range fields
- Metrics aggregations
- Pipeline aggregations
- Average bucket
- Bucket script
- Bucket count K-S test
- Bucket correlation
- Bucket selector
- Bucket sort
- Change point
- Cumulative cardinality
- Cumulative sum
- Derivative
- Extended stats bucket
- Inference bucket
- Max bucket
- Min bucket
- Moving function
- Moving percentiles
- Normalize
- Percentiles bucket
- Serial differencing
- Stats bucket
- Sum bucket
- Bucket aggregations
- Geospatial analysis
- Connectors
- EQL
- ES|QL
- SQL
- Overview
- Getting Started with SQL
- Conventions and Terminology
- Security
- SQL REST API
- SQL Translate API
- SQL CLI
- SQL JDBC
- SQL ODBC
- SQL Client Applications
- SQL Language
- Functions and Operators
- Comparison Operators
- Logical Operators
- Math Operators
- Cast Operators
- LIKE and RLIKE Operators
- Aggregate Functions
- Grouping Functions
- Date/Time and Interval Functions and Operators
- Full-Text Search Functions
- Mathematical Functions
- String Functions
- Type Conversion Functions
- Geo Functions
- Conditional Functions And Expressions
- System Functions
- Reserved keywords
- SQL Limitations
- Scripting
- Data management
- ILM: Manage the index lifecycle
- Tutorial: Customize built-in policies
- Tutorial: Automate rollover
- Index management in Kibana
- Overview
- Concepts
- Index lifecycle actions
- Configure a lifecycle policy
- Migrate index allocation filters to node roles
- Troubleshooting index lifecycle management errors
- Start and stop index lifecycle management
- Manage existing indices
- Skip rollover
- Restore a managed data stream or index
- Data tiers
- Autoscaling
- Monitor a cluster
- Roll up or transform your data
- Set up a cluster for high availability
- Snapshot and restore
- Secure the Elastic Stack
- Elasticsearch security principles
- Start the Elastic Stack with security enabled automatically
- Manually configure security
- Updating node security certificates
- User authentication
- Built-in users
- Service accounts
- Internal users
- Token-based authentication services
- User profiles
- Realms
- Realm chains
- Security domains
- Active Directory user authentication
- File-based user authentication
- LDAP user authentication
- Native user authentication
- OpenID Connect authentication
- PKI user authentication
- SAML authentication
- Kerberos authentication
- JWT authentication
- Integrating with other authentication systems
- Enabling anonymous access
- Looking up users without authentication
- Controlling the user cache
- Configuring SAML single-sign-on on the Elastic Stack
- Configuring single sign-on to the Elastic Stack using OpenID Connect
- User authorization
- Built-in roles
- Defining roles
- Role restriction
- Security privileges
- Document level security
- Field level security
- Granting privileges for data streams and aliases
- Mapping users and groups to roles
- Setting up field and document level security
- Submitting requests on behalf of other users
- Configuring authorization delegation
- Customizing roles and authorization
- Enable audit logging
- Restricting connections with IP filtering
- Securing clients and integrations
- Operator privileges
- Troubleshooting
- Some settings are not returned via the nodes settings API
- Authorization exceptions
- Users command fails due to extra arguments
- Users are frequently locked out of Active Directory
- Certificate verification fails for curl on Mac
- SSLHandshakeException causes connections to fail
- Common SSL/TLS exceptions
- Common Kerberos exceptions
- Common SAML issues
- Internal Server Error in Kibana
- Setup-passwords command fails due to connection failure
- Failures due to relocation of the configuration files
- Limitations
- Watcher
- Cross-cluster replication
- Data store architecture
- REST APIs
- API conventions
- Common options
- REST API compatibility
- Autoscaling APIs
- Behavioral Analytics APIs
- Compact and aligned text (CAT) APIs
- cat aliases
- cat allocation
- cat anomaly detectors
- cat component templates
- cat count
- cat data frame analytics
- cat datafeeds
- cat fielddata
- cat health
- cat indices
- cat master
- cat nodeattrs
- cat nodes
- cat pending tasks
- cat plugins
- cat recovery
- cat repositories
- cat segments
- cat shards
- cat snapshots
- cat task management
- cat templates
- cat thread pool
- cat trained model
- cat transforms
- Cluster APIs
- Cluster allocation explain
- Cluster get settings
- Cluster health
- Health
- Cluster reroute
- Cluster state
- Cluster stats
- Cluster update settings
- Nodes feature usage
- Nodes hot threads
- Nodes info
- Prevalidate node removal
- Nodes reload secure settings
- Nodes stats
- Cluster Info
- Pending cluster tasks
- Remote cluster info
- Task management
- Voting configuration exclusions
- Create or update desired nodes
- Get desired nodes
- Delete desired nodes
- Get desired balance
- Reset desired balance
- Cross-cluster replication APIs
- Connector APIs
- Create connector
- Delete connector
- Get connector
- List connectors
- Update connector API key id
- Update connector configuration
- Update connector index name
- Update connector features
- Update connector filtering
- Update connector name and description
- Update connector pipeline
- Update connector scheduling
- Update connector service type
- Create connector sync job
- Cancel connector sync job
- Delete connector sync job
- Get connector sync job
- List connector sync jobs
- Check in a connector
- Update connector error
- Update connector last sync stats
- Update connector status
- Check in connector sync job
- Claim connector sync job
- Set connector sync job error
- Set connector sync job stats
- Data stream APIs
- Document APIs
- Enrich APIs
- EQL APIs
- ES|QL APIs
- Features APIs
- Fleet APIs
- Graph explore API
- Index APIs
- Alias exists
- Aliases
- Analyze
- Analyze index disk usage
- Clear cache
- Clone index
- Close index
- Create index
- Create or update alias
- Create or update component template
- Create or update index template
- Create or update index template (legacy)
- Delete component template
- Delete dangling index
- Delete alias
- Delete index
- Delete index template
- Delete index template (legacy)
- Exists
- Field usage stats
- Flush
- Force merge
- Get alias
- Get component template
- Get field mapping
- Get index
- Get index settings
- Get index template
- Get index template (legacy)
- Get mapping
- Import dangling index
- Index recovery
- Index segments
- Index shard stores
- Index stats
- Index template exists (legacy)
- List dangling indices
- Open index
- Refresh
- Resolve index
- Resolve cluster
- Rollover
- Shrink index
- Simulate index
- Simulate template
- Split index
- Unfreeze index
- Update index settings
- Update mapping
- Index lifecycle management APIs
- Create or update lifecycle policy
- Get policy
- Delete policy
- Move to step
- Remove policy
- Retry policy
- Get index lifecycle management status
- Explain lifecycle
- Start index lifecycle management
- Stop index lifecycle management
- Migrate indices, ILM policies, and legacy, composable and component templates to data tiers routing
- Inference APIs
- Delete inference API
- Get inference API
- Perform inference API
- Create inference API
- Stream inference API
- Update inference API
- AlibabaCloud AI Search inference service
- Amazon Bedrock inference service
- Anthropic inference service
- Azure AI studio inference service
- Azure OpenAI inference service
- Cohere inference service
- Elasticsearch inference service
- ELSER inference service
- Google AI Studio inference service
- Google Vertex AI inference service
- HuggingFace inference service
- Mistral inference service
- OpenAI inference service
- Watsonx inference service
- Info API
- Ingest APIs
- Licensing APIs
- Logstash APIs
- Machine learning APIs
- Machine learning anomaly detection APIs
- Add events to calendar
- Add jobs to calendar
- Close jobs
- Create jobs
- Create calendars
- Create datafeeds
- Create filters
- Delete calendars
- Delete datafeeds
- Delete events from calendar
- Delete filters
- Delete forecasts
- Delete jobs
- Delete jobs from calendar
- Delete model snapshots
- Delete expired data
- Estimate model memory
- Flush jobs
- Forecast jobs
- Get buckets
- Get calendars
- Get categories
- Get datafeeds
- Get datafeed statistics
- Get influencers
- Get jobs
- Get job statistics
- Get model snapshots
- Get model snapshot upgrade statistics
- Get overall buckets
- Get scheduled events
- Get filters
- Get records
- Open jobs
- Post data to jobs
- Preview datafeeds
- Reset jobs
- Revert model snapshots
- Start datafeeds
- Stop datafeeds
- Update datafeeds
- Update filters
- Update jobs
- Update model snapshots
- Upgrade model snapshots
- Machine learning data frame analytics APIs
- Create data frame analytics jobs
- Delete data frame analytics jobs
- Evaluate data frame analytics
- Explain data frame analytics
- Get data frame analytics jobs
- Get data frame analytics jobs stats
- Preview data frame analytics
- Start data frame analytics jobs
- Stop data frame analytics jobs
- Update data frame analytics jobs
- Machine learning trained model APIs
- Clear trained model deployment cache
- Create or update trained model aliases
- Create part of a trained model
- Create trained models
- Create trained model vocabulary
- Delete trained model aliases
- Delete trained models
- Get trained models
- Get trained models stats
- Infer trained model
- Start trained model deployment
- Stop trained model deployment
- Update trained model deployment
- Migration APIs
- Node lifecycle APIs
- Query rules APIs
- Reload search analyzers API
- Repositories metering APIs
- Rollup APIs
- Root API
- Script APIs
- Search APIs
- Search Application APIs
- Searchable snapshots APIs
- Security APIs
- Authenticate
- Change passwords
- Clear cache
- Clear roles cache
- Clear privileges cache
- Clear API key cache
- Clear service account token caches
- Create API keys
- Create or update application privileges
- Create or update role mappings
- Create or update roles
- Bulk create or update roles API
- Bulk delete roles API
- Create or update users
- Create service account tokens
- Delegate PKI authentication
- Delete application privileges
- Delete role mappings
- Delete roles
- Delete service account token
- Delete users
- Disable users
- Enable users
- Enroll Kibana
- Enroll node
- Get API key information
- Get application privileges
- Get builtin privileges
- Get role mappings
- Get roles
- Query Role
- Get service accounts
- Get service account credentials
- Get Security settings
- Get token
- Get user privileges
- Get users
- Grant API keys
- Has privileges
- Invalidate API key
- Invalidate token
- OpenID Connect prepare authentication
- OpenID Connect authenticate
- OpenID Connect logout
- Query API key information
- Query User
- Update API key
- Update Security settings
- Bulk update API keys
- SAML prepare authentication
- SAML authenticate
- SAML logout
- SAML invalidate
- SAML complete logout
- SAML service provider metadata
- SSL certificate
- Activate user profile
- Disable user profile
- Enable user profile
- Get user profiles
- Suggest user profile
- Update user profile data
- Has privileges user profile
- Create Cross-Cluster API key
- Update Cross-Cluster API key
- Snapshot and restore APIs
- Snapshot lifecycle management APIs
- SQL APIs
- Synonyms APIs
- Text structure APIs
- Transform APIs
- Usage API
- Watcher APIs
- Definitions
- Command line tools
- elasticsearch-certgen
- elasticsearch-certutil
- elasticsearch-create-enrollment-token
- elasticsearch-croneval
- elasticsearch-keystore
- elasticsearch-node
- elasticsearch-reconfigure-node
- elasticsearch-reset-password
- elasticsearch-saml-metadata
- elasticsearch-service-tokens
- elasticsearch-setup-passwords
- elasticsearch-shard
- elasticsearch-syskeygen
- elasticsearch-users
- Optimizations
- Troubleshooting
- Fix common cluster issues
- Diagnose unassigned shards
- Add a missing tier to the system
- Allow Elasticsearch to allocate the data in the system
- Allow Elasticsearch to allocate the index
- Indices mix index allocation filters with data tiers node roles to move through data tiers
- Not enough nodes to allocate all shard replicas
- Total number of shards for an index on a single node exceeded
- Total number of shards per node has been reached
- Troubleshooting corruption
- Fix data nodes out of disk
- Fix master nodes out of disk
- Fix other role nodes out of disk
- Start index lifecycle management
- Start Snapshot Lifecycle Management
- Restore from snapshot
- Troubleshooting broken repositories
- Addressing repeated snapshot policy failures
- Troubleshooting an unstable cluster
- Troubleshooting discovery
- Troubleshooting monitoring
- Troubleshooting transforms
- Troubleshooting Watcher
- Troubleshooting searches
- Troubleshooting shards capacity health issues
- Troubleshooting an unbalanced cluster
- Capture diagnostics
- Migration guide
- Release notes
- Elasticsearch version 8.17.1
- Elasticsearch version 8.17.0
- Elasticsearch version 8.16.2
- Elasticsearch version 8.16.1
- Elasticsearch version 8.16.0
- Elasticsearch version 8.15.5
- Elasticsearch version 8.15.4
- Elasticsearch version 8.15.3
- Elasticsearch version 8.15.2
- Elasticsearch version 8.15.1
- Elasticsearch version 8.15.0
- Elasticsearch version 8.14.3
- Elasticsearch version 8.14.2
- Elasticsearch version 8.14.1
- Elasticsearch version 8.14.0
- Elasticsearch version 8.13.4
- Elasticsearch version 8.13.3
- Elasticsearch version 8.13.2
- Elasticsearch version 8.13.1
- Elasticsearch version 8.13.0
- Elasticsearch version 8.12.2
- Elasticsearch version 8.12.1
- Elasticsearch version 8.12.0
- Elasticsearch version 8.11.4
- Elasticsearch version 8.11.3
- Elasticsearch version 8.11.2
- Elasticsearch version 8.11.1
- Elasticsearch version 8.11.0
- Elasticsearch version 8.10.4
- Elasticsearch version 8.10.3
- Elasticsearch version 8.10.2
- Elasticsearch version 8.10.1
- Elasticsearch version 8.10.0
- Elasticsearch version 8.9.2
- Elasticsearch version 8.9.1
- Elasticsearch version 8.9.0
- Elasticsearch version 8.8.2
- Elasticsearch version 8.8.1
- Elasticsearch version 8.8.0
- Elasticsearch version 8.7.1
- Elasticsearch version 8.7.0
- Elasticsearch version 8.6.2
- Elasticsearch version 8.6.1
- Elasticsearch version 8.6.0
- Elasticsearch version 8.5.3
- Elasticsearch version 8.5.2
- Elasticsearch version 8.5.1
- Elasticsearch version 8.5.0
- Elasticsearch version 8.4.3
- Elasticsearch version 8.4.2
- Elasticsearch version 8.4.1
- Elasticsearch version 8.4.0
- Elasticsearch version 8.3.3
- Elasticsearch version 8.3.2
- Elasticsearch version 8.3.1
- Elasticsearch version 8.3.0
- Elasticsearch version 8.2.3
- Elasticsearch version 8.2.2
- Elasticsearch version 8.2.1
- Elasticsearch version 8.2.0
- Elasticsearch version 8.1.3
- Elasticsearch version 8.1.2
- Elasticsearch version 8.1.1
- Elasticsearch version 8.1.0
- Elasticsearch version 8.0.1
- Elasticsearch version 8.0.0
- Elasticsearch version 8.0.0-rc2
- Elasticsearch version 8.0.0-rc1
- Elasticsearch version 8.0.0-beta1
- Elasticsearch version 8.0.0-alpha2
- Elasticsearch version 8.0.0-alpha1
- Dependencies and versions
Tune for search speed
editTune for search speed
editGive memory to the filesystem cache
editElasticsearch heavily relies on the filesystem cache in order to make search fast. In general, you should make sure that at least half the available memory goes to the filesystem cache so that Elasticsearch can keep hot regions of the index in physical memory.
Avoid page cache thrashing by using modest readahead values on Linux
editSearch can cause a lot of randomized read I/O. When the underlying block device has a high readahead value, there may be a lot of unnecessary read I/O done, especially when files are accessed using memory mapping (see storage types).
Most Linux distributions use a sensible readahead value of 128KiB
for a
single plain device, however, when using software raid, LVM or dm-crypt the
resulting block device (backing Elasticsearch path.data)
may end up having a very large readahead value (in the range of several MiB).
This usually results in severe page (filesystem) cache thrashing adversely
affecting search (or update) performance.
You can check the current value in KiB
using
lsblk -o NAME,RA,MOUNTPOINT,TYPE,SIZE
.
Consult the documentation of your distribution on how to alter this value
(for example with a udev
rule to persist across reboots, or via
blockdev --setra
as a transient setting). We recommend a value of 128KiB
for readahead.
blockdev
expects values in 512 byte sectors whereas lsblk
reports
values in KiB
. As an example, to temporarily set readahead to 128KiB
for /dev/nvme0n1
, specify blockdev --setra 256 /dev/nvme0n1
.
Use faster hardware
editIf your searches are I/O-bound, consider increasing the size of the filesystem cache (see above) or using faster storage. Each search involves a mix of sequential and random reads across multiple files, and there may be many searches running concurrently on each shard, so SSD drives tend to perform better than spinning disks.
If your searches are CPU-bound, consider using a larger number of faster CPUs.
Local vs. remote storage
editDirectly-attached (local) storage generally performs better than remote storage because it is simpler to configure well and avoids communications overheads.
Some remote storage performs very poorly, especially under the kind of load that Elasticsearch imposes. However, with careful tuning, it is sometimes possible to achieve acceptable performance using remote storage too. Before committing to a particular storage architecture, benchmark your system with a realistic workload to determine the effects of any tuning parameters. If you cannot achieve the performance you expect, work with the vendor of your storage system to identify the problem.
Document modeling
editDocuments should be modeled so that search-time operations are as cheap as possible.
In particular, joins should be avoided. nested
can make queries
several times slower and parent-child relations can make
queries hundreds of times slower. So if the same questions can be answered without
joins by denormalizing documents, significant speedups can be expected.
Search as few fields as possible
editThe more fields a query_string
or
multi_match
query targets, the slower it is.
A common technique to improve search speed over multiple fields is to copy
their values into a single field at index time, and then use this field at
search time. This can be automated with the copy-to
directive of
mappings without having to change the source of documents. Here is an example
of an index containing movies that optimizes queries that search over both the
name and the plot of the movie by indexing both values into the name_and_plot
field.
resp = client.indices.create( index="movies", mappings={ "properties": { "name_and_plot": { "type": "text" }, "name": { "type": "text", "copy_to": "name_and_plot" }, "plot": { "type": "text", "copy_to": "name_and_plot" } } }, ) print(resp)
response = client.indices.create( index: 'movies', body: { mappings: { properties: { name_and_plot: { type: 'text' }, name: { type: 'text', copy_to: 'name_and_plot' }, plot: { type: 'text', copy_to: 'name_and_plot' } } } } ) puts response
const response = await client.indices.create({ index: "movies", mappings: { properties: { name_and_plot: { type: "text", }, name: { type: "text", copy_to: "name_and_plot", }, plot: { type: "text", copy_to: "name_and_plot", }, }, }, }); console.log(response);
PUT movies { "mappings": { "properties": { "name_and_plot": { "type": "text" }, "name": { "type": "text", "copy_to": "name_and_plot" }, "plot": { "type": "text", "copy_to": "name_and_plot" } } } }
Pre-index data
editYou should leverage patterns in your queries to optimize the way data is indexed.
For instance, if all your documents have a price
field and most queries run
range
aggregations on a fixed
list of ranges, you could make this aggregation faster by pre-indexing the ranges
into the index and using a terms
aggregations.
For instance, if documents look like:
resp = client.index( index="index", id="1", document={ "designation": "spoon", "price": 13 }, ) print(resp)
response = client.index( index: 'index', id: 1, body: { designation: 'spoon', price: 13 } ) puts response
const response = await client.index({ index: "index", id: 1, document: { designation: "spoon", price: 13, }, }); console.log(response);
PUT index/_doc/1 { "designation": "spoon", "price": 13 }
and search requests look like:
resp = client.search( index="index", aggs={ "price_ranges": { "range": { "field": "price", "ranges": [ { "to": 10 }, { "from": 10, "to": 100 }, { "from": 100 } ] } } }, ) print(resp)
response = client.search( index: 'index', body: { aggregations: { price_ranges: { range: { field: 'price', ranges: [ { to: 10 }, { from: 10, to: 100 }, { from: 100 } ] } } } } ) puts response
const response = await client.search({ index: "index", aggs: { price_ranges: { range: { field: "price", ranges: [ { to: 10, }, { from: 10, to: 100, }, { from: 100, }, ], }, }, }, }); console.log(response);
GET index/_search { "aggs": { "price_ranges": { "range": { "field": "price", "ranges": [ { "to": 10 }, { "from": 10, "to": 100 }, { "from": 100 } ] } } } }
Then documents could be enriched by a price_range
field at index time, which
should be mapped as a keyword
:
resp = client.indices.create( index="index", mappings={ "properties": { "price_range": { "type": "keyword" } } }, ) print(resp) resp1 = client.index( index="index", id="1", document={ "designation": "spoon", "price": 13, "price_range": "10-100" }, ) print(resp1)
response = client.indices.create( index: 'index', body: { mappings: { properties: { price_range: { type: 'keyword' } } } } ) puts response response = client.index( index: 'index', id: 1, body: { designation: 'spoon', price: 13, price_range: '10-100' } ) puts response
const response = await client.indices.create({ index: "index", mappings: { properties: { price_range: { type: "keyword", }, }, }, }); console.log(response); const response1 = await client.index({ index: "index", id: 1, document: { designation: "spoon", price: 13, price_range: "10-100", }, }); console.log(response1);
PUT index { "mappings": { "properties": { "price_range": { "type": "keyword" } } } } PUT index/_doc/1 { "designation": "spoon", "price": 13, "price_range": "10-100" }
And then search requests could aggregate this new field rather than running a
range
aggregation on the price
field.
resp = client.search( index="index", aggs={ "price_ranges": { "terms": { "field": "price_range" } } }, ) print(resp)
response = client.search( index: 'index', body: { aggregations: { price_ranges: { terms: { field: 'price_range' } } } } ) puts response
const response = await client.search({ index: "index", aggs: { price_ranges: { terms: { field: "price_range", }, }, }, }); console.log(response);
GET index/_search { "aggs": { "price_ranges": { "terms": { "field": "price_range" } } } }
Consider mapping identifiers as keyword
editNot all numeric data should be mapped as a numeric field data type.
Elasticsearch optimizes numeric fields, such as integer
or long
, for
range
queries. However, keyword
fields
are better for term
and other
term-level queries.
Identifiers, such as an ISBN or a product ID, are rarely used in range
queries. However, they are often retrieved using term-level queries.
Consider mapping a numeric identifier as a keyword
if:
-
You don’t plan to search for the identifier data using
range
queries. -
Fast retrieval is important.
term
query searches onkeyword
fields are often faster thanterm
searches on numeric fields.
If you’re unsure which to use, you can use a multi-field to map
the data as both a keyword
and a numeric data type.
Avoid scripts
editIf possible, avoid using script-based sorting, scripts in
aggregations, and the script_score
query. See
Scripts, caching, and search speed.
Search rounded dates
editQueries on date fields that use now
are typically not cacheable since the
range that is being matched changes all the time. However switching to a
rounded date is often acceptable in terms of user experience, and has the
benefit of making better use of the query cache.
For instance the below query:
resp = client.index( index="index", id="1", document={ "my_date": "2016-05-11T16:30:55.328Z" }, ) print(resp) resp1 = client.search( index="index", query={ "constant_score": { "filter": { "range": { "my_date": { "gte": "now-1h", "lte": "now" } } } } }, ) print(resp1)
response = client.index( index: 'index', id: 1, body: { my_date: '2016-05-11T16:30:55.328Z' } ) puts response response = client.search( index: 'index', body: { query: { constant_score: { filter: { range: { my_date: { gte: 'now-1h', lte: 'now' } } } } } } ) puts response
const response = await client.index({ index: "index", id: 1, document: { my_date: "2016-05-11T16:30:55.328Z", }, }); console.log(response); const response1 = await client.search({ index: "index", query: { constant_score: { filter: { range: { my_date: { gte: "now-1h", lte: "now", }, }, }, }, }, }); console.log(response1);
PUT index/_doc/1 { "my_date": "2016-05-11T16:30:55.328Z" } GET index/_search { "query": { "constant_score": { "filter": { "range": { "my_date": { "gte": "now-1h", "lte": "now" } } } } } }
could be replaced with the following query:
resp = client.search( index="index", query={ "constant_score": { "filter": { "range": { "my_date": { "gte": "now-1h/m", "lte": "now/m" } } } } }, ) print(resp)
response = client.search( index: 'index', body: { query: { constant_score: { filter: { range: { my_date: { gte: 'now-1h/m', lte: 'now/m' } } } } } } ) puts response
const response = await client.search({ index: "index", query: { constant_score: { filter: { range: { my_date: { gte: "now-1h/m", lte: "now/m", }, }, }, }, }, }); console.log(response);
GET index/_search { "query": { "constant_score": { "filter": { "range": { "my_date": { "gte": "now-1h/m", "lte": "now/m" } } } } } }
In that case we rounded to the minute, so if the current time is 16:31:29
,
the range query will match everything whose value of the my_date
field is
between 15:31:00
and 16:31:59
. And if several users run a query that
contains this range in the same minute, the query cache could help speed things
up a bit. The longer the interval that is used for rounding, the more the query
cache can help, but beware that too aggressive rounding might also hurt user
experience.
It might be tempting to split ranges into a large cacheable part and smaller not cacheable parts in order to be able to leverage the query cache, as shown below:
resp = client.search( index="index", query={ "constant_score": { "filter": { "bool": { "should": [ { "range": { "my_date": { "gte": "now-1h", "lte": "now-1h/m" } } }, { "range": { "my_date": { "gt": "now-1h/m", "lt": "now/m" } } }, { "range": { "my_date": { "gte": "now/m", "lte": "now" } } } ] } } } }, ) print(resp)
response = client.search( index: 'index', body: { query: { constant_score: { filter: { bool: { should: [ { range: { my_date: { gte: 'now-1h', lte: 'now-1h/m' } } }, { range: { my_date: { gt: 'now-1h/m', lt: 'now/m' } } }, { range: { my_date: { gte: 'now/m', lte: 'now' } } } ] } } } } } ) puts response
const response = await client.search({ index: "index", query: { constant_score: { filter: { bool: { should: [ { range: { my_date: { gte: "now-1h", lte: "now-1h/m", }, }, }, { range: { my_date: { gt: "now-1h/m", lt: "now/m", }, }, }, { range: { my_date: { gte: "now/m", lte: "now", }, }, }, ], }, }, }, }, }); console.log(response);
GET index/_search { "query": { "constant_score": { "filter": { "bool": { "should": [ { "range": { "my_date": { "gte": "now-1h", "lte": "now-1h/m" } } }, { "range": { "my_date": { "gt": "now-1h/m", "lt": "now/m" } } }, { "range": { "my_date": { "gte": "now/m", "lte": "now" } } } ] } } } } }
However such practice might make the query run slower in some cases since the
overhead introduced by the bool
query may defeat the savings from better
leveraging the query cache.
Force-merge read-only indices
editIndices that are read-only may benefit from being merged down to a single segment. This is typically the case with time-based indices: only the index for the current time frame is getting new documents while older indices are read-only. Shards that have been force-merged into a single segment can use simpler and more efficient data structures to perform searches.
Do not force-merge indices to which you are still writing, or to which you will write again in the future. Instead, rely on the automatic background merge process to perform merges as needed to keep the index running smoothly. If you continue to write to a force-merged index then its performance may become much worse.
Warm up global ordinals
editGlobal ordinals are a data structure that is used to optimize the performance of aggregations. They are calculated lazily and stored in the JVM heap as part of the field data cache. For fields that are heavily used for bucketing aggregations, you can tell Elasticsearch to construct and cache the global ordinals before requests are received. This should be done carefully because it will increase heap usage and can make refreshes take longer. The option can be updated dynamically on an existing mapping by setting the eager global ordinals mapping parameter:
resp = client.indices.create( index="index", mappings={ "properties": { "foo": { "type": "keyword", "eager_global_ordinals": True } } }, ) print(resp)
response = client.indices.create( index: 'index', body: { mappings: { properties: { foo: { type: 'keyword', eager_global_ordinals: true } } } } ) puts response
const response = await client.indices.create({ index: "index", mappings: { properties: { foo: { type: "keyword", eager_global_ordinals: true, }, }, }, }); console.log(response);
PUT index { "mappings": { "properties": { "foo": { "type": "keyword", "eager_global_ordinals": true } } } }
Warm up the filesystem cache
editIf the machine running Elasticsearch is restarted, the filesystem cache will be
empty, so it will take some time before the operating system loads hot regions
of the index into memory so that search operations are fast. You can explicitly
tell the operating system which files should be loaded into memory eagerly
depending on the file extension using the
index.store.preload
setting.
Loading data into the filesystem cache eagerly on too many indices or too many files will make search slower if the filesystem cache is not large enough to hold all the data. Use with caution.
Use index sorting to speed up conjunctions
editIndex sorting can be useful in order to make conjunctions faster at the cost of slightly slower indexing. Read more about it in the index sorting documentation.
Use preference
to optimize cache utilization
editThere are multiple caches that can help with search performance, such as the filesystem cache, the request cache or the query cache. Yet all these caches are maintained at the node level, meaning that if you run the same request twice in a row, have 1 replica or more and use round-robin, the default routing algorithm, then those two requests will go to different shard copies, preventing node-level caches from helping.
Since it is common for users of a search application to run similar requests one after another, for instance in order to analyze a narrower subset of the index, using a preference value that identifies the current user or session could help optimize usage of the caches.
Replicas might help with throughput, but not always
editIn addition to improving resiliency, replicas can help improve throughput. For instance if you have a single-shard index and three nodes, you will need to set the number of replicas to 2 in order to have 3 copies of your shard in total so that all nodes are utilized.
Now imagine that you have a 2-shards index and two nodes. In one case, the number of replicas is 0, meaning that each node holds a single shard. In the second case the number of replicas is 1, meaning that each node has two shards. Which setup is going to perform best in terms of search performance? Usually, the setup that has fewer shards per node in total will perform better. The reason for that is that it gives a greater share of the available filesystem cache to each shard, and the filesystem cache is probably Elasticsearch’s number 1 performance factor. At the same time, beware that a setup that does not have replicas is subject to failure in case of a single node failure, so there is a trade-off between throughput and availability.
So what is the right number of replicas? If you have a cluster that has
num_nodes
nodes, num_primaries
primary shards in total and if you want to
be able to cope with max_failures
node failures at once at most, then the
right number of replicas for you is
max(max_failures, ceil(num_nodes / num_primaries) - 1)
.
Tune your queries with the Search Profiler
editThe Profile API provides detailed information about how each component of your queries and aggregations impacts the time it takes to process the request.
The Search Profiler in Kibana makes it easy to navigate and analyze the profile results and give you insight into how to tune your queries to improve performance and reduce load.
Because the Profile API itself adds significant overhead to the query, this information is best used to understand the relative cost of the various query components. It does not provide a reliable measure of actual processing time.
Faster phrase queries with index_phrases
editThe text
field has an index_phrases
option that
indexes 2-shingles and is automatically leveraged by query parsers to run phrase
queries that don’t have a slop. If your use-case involves running lots of phrase
queries, this can speed up queries significantly.
Faster prefix queries with index_prefixes
editThe text
field has an index_prefixes
option that
indexes prefixes of all terms and is automatically leveraged by query parsers to
run prefix queries. If your use-case involves running lots of prefix queries,
this can speed up queries significantly.
Use constant_keyword
to speed up filtering
editThere is a general rule that the cost of a filter is mostly a function of the
number of matched documents. Imagine that you have an index containing cycles.
There are a large number of bicycles and many searches perform a filter on
cycle_type: bicycle
. This very common filter is unfortunately also very costly
since it matches most documents. There is a simple way to avoid running this
filter: move bicycles to their own index and filter bicycles by searching this
index instead of adding a filter to the query.
Unfortunately this can make client-side logic tricky, which is where
constant_keyword
helps. By mapping cycle_type
as a constant_keyword
with
value bicycle
on the index that contains bicycles, clients can keep running
the exact same queries as they used to run on the monolithic index and
Elasticsearch will do the right thing on the bicycles index by ignoring filters
on cycle_type
if the value is bicycle
and returning no hits otherwise.
Here is what mappings could look like:
resp = client.indices.create( index="bicycles", mappings={ "properties": { "cycle_type": { "type": "constant_keyword", "value": "bicycle" }, "name": { "type": "text" } } }, ) print(resp) resp1 = client.indices.create( index="other_cycles", mappings={ "properties": { "cycle_type": { "type": "keyword" }, "name": { "type": "text" } } }, ) print(resp1)
response = client.indices.create( index: 'bicycles', body: { mappings: { properties: { cycle_type: { type: 'constant_keyword', value: 'bicycle' }, name: { type: 'text' } } } } ) puts response response = client.indices.create( index: 'other_cycles', body: { mappings: { properties: { cycle_type: { type: 'keyword' }, name: { type: 'text' } } } } ) puts response
const response = await client.indices.create({ index: "bicycles", mappings: { properties: { cycle_type: { type: "constant_keyword", value: "bicycle", }, name: { type: "text", }, }, }, }); console.log(response); const response1 = await client.indices.create({ index: "other_cycles", mappings: { properties: { cycle_type: { type: "keyword", }, name: { type: "text", }, }, }, }); console.log(response1);
PUT bicycles { "mappings": { "properties": { "cycle_type": { "type": "constant_keyword", "value": "bicycle" }, "name": { "type": "text" } } } } PUT other_cycles { "mappings": { "properties": { "cycle_type": { "type": "keyword" }, "name": { "type": "text" } } } }
We are splitting our index in two: one that will contain only bicycles, and another one that contains other cycles: unicycles, tricycles, etc. Then at search time, we need to search both indices, but we don’t need to modify queries.
resp = client.search( index="bicycles,other_cycles", query={ "bool": { "must": { "match": { "description": "dutch" } }, "filter": { "term": { "cycle_type": "bicycle" } } } }, ) print(resp)
response = client.search( index: 'bicycles,other_cycles', body: { query: { bool: { must: { match: { description: 'dutch' } }, filter: { term: { cycle_type: 'bicycle' } } } } } ) puts response
const response = await client.search({ index: "bicycles,other_cycles", query: { bool: { must: { match: { description: "dutch", }, }, filter: { term: { cycle_type: "bicycle", }, }, }, }, }); console.log(response);
GET bicycles,other_cycles/_search { "query": { "bool": { "must": { "match": { "description": "dutch" } }, "filter": { "term": { "cycle_type": "bicycle" } } } } }
On the bicycles
index, Elasticsearch will simply ignore the cycle_type
filter and rewrite the search request to the one below:
resp = client.search( index="bicycles,other_cycles", query={ "match": { "description": "dutch" } }, ) print(resp)
response = client.search( index: 'bicycles,other_cycles', body: { query: { match: { description: 'dutch' } } } ) puts response
const response = await client.search({ index: "bicycles,other_cycles", query: { match: { description: "dutch", }, }, }); console.log(response);
GET bicycles,other_cycles/_search { "query": { "match": { "description": "dutch" } } }
On the other_cycles
index, Elasticsearch will quickly figure out that
bicycle
doesn’t exist in the terms dictionary of the cycle_type
field and
return a search response with no hits.
This is a powerful way of making queries cheaper by putting common values in a
dedicated index. This idea can also be combined across multiple fields: for
instance if you track the color of each cycle and your bicycles
index ends up
having a majority of black bikes, you could split it into a bicycles-black
and a bicycles-other-colors
indices.
The constant_keyword
is not strictly required for this optimization: it is
also possible to update the client-side logic in order to route queries to the
relevant indices based on filters. However constant_keyword
makes it
transparently and allows to decouple search requests from the index topology in
exchange of very little overhead.
On this page
- Give memory to the filesystem cache
- Avoid page cache thrashing by using modest readahead values on Linux
- Use faster hardware
- Local vs. remote storage
- Document modeling
- Search as few fields as possible
- Pre-index data
- Consider mapping identifiers as
keyword
- Avoid scripts
- Search rounded dates
- Force-merge read-only indices
- Warm up global ordinals
- Warm up the filesystem cache
- Use index sorting to speed up conjunctions
- Use
preference
to optimize cache utilization - Replicas might help with throughput, but not always
- Tune your queries with the Search Profiler
- Faster phrase queries with
index_phrases
- Faster prefix queries with
index_prefixes
- Use
constant_keyword
to speed up filtering