Elasticsearch Guide: other versions:
Getting Started
- Basic Concepts
- Installation
- Exploring Your Cluster
- Modifying Your Data
- Exploring Your Data
- Conclusion
Set up Elasticsearch
- Installing Elasticsearch
- Configuring Elasticsearch
- Important Elasticsearch configuration
- Secure Settings
- Bootstrap Checks
- Important System Configuration
- Upgrading Elasticsearch
- Stopping Elasticsearch
Set up X-Pack
- Installing X-Pack
- X-Pack Settings
- Configuring X-Pack Java Clients
- Configuring Security
Breaking changes
- Breaking changes in 5.5
- Breaking changes in 5.4
- Breaking changes in 5.3
- Breaking changes in 5.2
  - Shadow Replicas are deprecated
- Breaking changes in 5.1
- Breaking changes in 5.0
API Conventions
- Multiple Indices
- Date math support in index names
- Common options
- URL-based access control
Document APIs
- Reading and Writing documents
- Index API
- Get API
- Delete API
- Delete By Query API
- Update API
- Update By Query API
- Multi Get API
- Bulk API
- Reindex API
- Term Vectors
- Multi termvectors API
- ?refresh
Search APIs
- Search
- URI Search
- Request Body Search
  - Query
  - From / Size
  - Sort
  - Source filtering
  - Fields
  - Script Fields
  - Doc value Fields
  - Post filter
  - Highlighting
  - Rescoring
  - Search Type
  - Scroll
  - Preference
  - Explain
  - Version
  - Index Boost
  - min_score
  - Named Queries
  - Inner hits
  - Field Collapsing
  - Search After
- Search Template
- Multi Search Template
- Search Shards API
- Suggesters
- Multi Search API
- Count API
- Validate API
- Explain API
- Profile API
- Percolator
- Field stats API
- Field Capabilities API
Aggregations
- Metrics Aggregations
- Bucket Aggregations
- Pipeline Aggregations
- Matrix Aggregations
  - Matrix Stats
- Caching heavy aggregations
- Returning only aggregation results
- Aggregation Metadata
- Returning the type of the aggregation
Indices APIs
- Create Index
- Delete Index
- Get Index
- Indices Exists
- Open / Close Index API
- Shrink Index
- Rollover Index
- Put Mapping
- Get Mapping
- Get Field Mapping
- Types Exists
- Index Aliases
- Update Indices Settings
- Get Settings
- Analyze
  - Explain Analyze
- Index Templates
- Shadow replica indices
  - Node level settings related to shadow replicas
- Indices Stats
- Indices Segments
- Indices Recovery
- Indices Shard Stores
- Clear Cache
- Flush
  - Synced Flush
- Refresh
- Force Merge
cat APIs
- cat aliases
- cat allocation
- cat count
- cat fielddata
- cat health
- cat indices
- cat master
- cat nodeattrs
- cat nodes
- cat pending tasks
- cat plugins
- cat recovery
- cat repositories
- cat thread pool
- cat shards
- cat segments
- cat snapshots
- cat templates
Cluster APIs
- Cluster Health
- Cluster State
- Cluster Stats
- Pending cluster tasks
- Cluster Reroute
- Cluster Update Settings
- Nodes Stats
- Nodes Info
- Remote Cluster Info
- Task Management API
- Nodes hot_threads
- Cluster Allocation Explain API
Query DSL
- Query and filter context
- Match All Query
- Full text queries
- Term level queries
- Compound queries
- Joining queries
- Geo queries
- Specialized queries
- Span queries
- Minimum Should Match
- Multi Term Query Rewrite
Mapping
- Field datatypes
- Meta-Fields
- Mapping parameters
- Dynamic Mapping
Analysis
- Anatomy of an analyzer
- Testing analyzers
- Analyzers
- Normalizers
- Tokenizers
- Token Filters
- Character Filters
Modules
- Cluster
- Discovery
- Local Gateway
- HTTP
- Indices
- Network Settings
- Node
- Plugins
- Scripting
- Snapshot And Restore
- Thread Pool
- Transport
- Tribe node
- Cross Cluster Search
Index Modules
- Analysis
- Index Shard Allocation
- Mapper
- Merge
- Similarity module
- Slow Log
- Store
  - Pre-loading data into the file system cache
- Translog
Ingest Node
- Pipeline Definition
- Ingest APIs
- Accessing Data in Pipelines
- Handling Failures in Pipelines
- Processors
X-Pack APIs
- Info API
- Explore API
- Machine Learning APIs
- Security APIs
- Watcher APIs
- Definitions
How To
- General recommendations
- Recipes
- Tune for indexing speed
- Tune for search speed
- Tune for disk usage
Testing
- Java Testing Framework
Glossary of terms
Release Notes
- 5.5.3 Release Notes
- 5.5.2 Release Notes
- 5.5.1 Release Notes
- 5.5.0 Release Notes
- 5.4.3 Release Notes
- 5.4.2 Release Notes
- 5.4.1 Release Notes
- 5.4.0 Release Notes
- 5.3.3 Release Notes
- 5.3.2 Release Notes
- 5.3.1 Release Notes
- 5.3.0 Release Notes
- 5.2.2 Release Notes
- 5.2.1 Release Notes
- 5.2.0 Release Notes
- 5.1.2 Release Notes
- 5.1.1 Release Notes
- 5.1.0 Release Notes
- 5.0.2 Release Notes
- 5.0.1 Release Notes
- 5.0.0 Combined Release Notes
- 5.0.0 GA Release Notes
- 5.0.0-rc1 Release Notes
- 5.0.0-beta1 Release Notes
- 5.0.0-alpha5 Release Notes
- 5.0.0-alpha4 Release Notes
- 5.0.0-alpha3 Release Notes
- 5.0.0-alpha2 Release Notes
- 5.0.0-alpha1 Release Notes
- 5.0.0-alpha1 Release Notes (Changes previously released in 2.x)

WARNING: Version 5.5 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

« How To Recipes »

› ›

General recommendations

edit

General recommendations

edit

Don’t return large result sets

edit

Elasticsearch is designed as a search engine, which makes it very good at getting back the top documents that match a query. However, it is not as good for workloads that fall into the database domain, such as retrieving all documents that match a particular query. If you need to do this, make sure to use the Scroll API.

Avoid large documents

edit

Given that the default http.max_context_length is set to 100MB, Elasticsearch will refuse to index any document that is larger than that. You might decide to increase that particular setting, but Lucene still has a limit of about 2GB.

Even without considering hard limits, large documents are usually not practical. Large documents put more stress on network, memory usage and disk, even for search requests that do not request the _source since Elasticsearch needs to fetch the _id of the document in all cases, and the cost of getting this field is bigger for large documents due to how the filesystem cache works. Indexing this document can use an amount of memory that is a multiplier of the original size of the document. Proximity search (phrase queries for instance) and highlighting also become more expensive since their cost directly depends on the size of the original document.

It is sometimes useful to reconsider what the unit of information should be. For instance, the fact you want to make books searchable doesn’t necesarily mean that a document should consist of a whole book. It might be a better idea to use chapters or even paragraphs as documents, and then have a property in these documents that identifies which book they belong to. This does not only avoid the issues with large documents, it also makes the search experience better. For instance if a user searches for two words foo and bar, a match across different chapters is probably very poor, while a match within the same paragraph is likely good.

Avoid sparsity

edit

The data-structures behind Lucene, which Elasticsearch relies on in order to index and store data, work best with dense data, ie. when all documents have the same fields. This is especially true for fields that have norms enabled (which is the case for text fields by default) or doc values enabled (which is the case for numerics, date, ip and keyword by default).

The reason is that Lucene internally identifies documents with so-called doc ids, which are integers between 0 and the total number of documents in the index. These doc ids are used for communication between the internal APIs of Lucene: for instance searching on a term with a match query produces an iterator of doc ids, and these doc ids are then used to retrieve the value of the norm in order to compute a score for these documents. The way this norm lookup is implemented currently is by reserving one byte for each document. The norm value for a given doc id can then be retrieved by reading the byte at index doc_id. While this is very efficient and helps Lucene quickly have access to the norm values of every document, this has the drawback that documents that do not have a value will also require one byte of storage.

In practice, this means that if an index has M documents, norms will require M bytes of storage per field, even for fields that only appear in a small fraction of the documents of the index. Although slightly more complex with doc values due to the fact that doc values have multiple ways that they can be encoded depending on the type of field and on the actual data that the field stores, the problem is very similar. In case you wonder: fielddata, which was used in Elasticsearch pre-2.0 before being replaced with doc values, also suffered from this issue, except that the impact was only on the memory footprint since fielddata was not explicitly materialized on disk.

Note that even though the most notable impact of sparsity is on storage requirements, it also has an impact on indexing speed and search speed since these bytes for documents that do not have a field still need to be written at index time and skipped over at search time.

It is totally fine to have a minority of sparse fields in an index. But beware that if sparsity becomes the rule rather than the exception, then the index will not be as efficient as it could be.

This section mostly focused on norms and doc values because those are the two features that are most affected by sparsity. Sparsity also affect the efficiency of the inverted index (used to index text/keyword fields) and dimensional points (used to index geo_point and numerics) but to a lesser extent.

Here are some recommendations that can help avoid sparsity:

Avoid putting unrelated data in the same index

edit

You should avoid putting documents that have totally different structures into the same index in order to avoid sparsity. It is often better to put these documents into different indices, you could also consider giving fewer shards to these smaller indices since they will contain fewer documents overall.

Note that this advice does not apply in the case that you need to use parent/child relations between your documents since this feature is only supported on documents that live in the same index.

Normalize document structures

edit

Even if you really need to put different kinds of documents in the same index, maybe there are opportunities to reduce sparsity. For instance if all documents in the index have a timestamp field but some call it timestamp and others call it creation_date, it would help to rename it so that all documents have the same field name for the same data.

Avoid types

edit

Types might sound like a good way to store multiple tenants in a single index. They are not: given that types store everything in a single index, having multiple types that have different fields in a single index will also cause problems due to sparsity as described above. If your types do not have very similar mappings, you might want to consider moving them to a dedicated index.

Disable `norms` and `doc_values` on sparse fields

edit

If none of the above recommendations apply in your case, you might want to check whether you actually need norms and doc_values on your sparse fields. norms can be disabled if producing scores is not necessary on a field, this is typically true for fields that are only used for filtering. doc_values can be disabled on fields that are neither used for sorting nor for aggregations. Beware that this decision should not be made lightly since these parameters cannot be changed on a live index, so you would have to reindex if you realize that you need norms or doc_values.

« How To Recipes »

On this page

Don’t return large result sets
Avoid large documents
Avoid sparsity
Avoid putting unrelated data in the same index
Normalize document structures
Avoid types
Disable norms and doc_values on sparse fields

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

General recommendations

General recommendations

Don’t return large result sets

Avoid large documents

Avoid sparsity

Avoid putting unrelated data in the same index

Normalize document structures

Avoid types

Disable `norms` and `doc_values` on sparse fields

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

The Search AI Company

Generative AI

Search

Security

Observability

By solution

Industries

General recommendations

General recommendations

Don’t return large result sets

Avoid large documents

Avoid sparsity

Avoid putting unrelated data in the same index

Normalize document structures

Avoid types

Disable norms and doc_values on sparse fields

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

Disable `norms` and `doc_values` on sparse fields