- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment
WARNING: This documentation covers Elasticsearch 2.x. The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Add an Index
editAdd an Index
editTo add data to Elasticsearch, we need an index—a place to store related data. In reality, an index is just a logical namespace that points to one or more physical shards.
A shard is a low-level worker unit that holds just a slice of all the data in the index. In Inside a Shard, we explain in detail how a shard works, but for now it is enough to know that a shard is a single instance of Lucene, and is a complete search engine in its own right. Our documents are stored and indexed in shards, but our applications don’t talk to them directly. Instead, they talk to an index.
Shards are how Elasticsearch distributes data around your cluster. Think of shards as containers for data. Documents are stored in shards, and shards are allocated to nodes in your cluster. As your cluster grows or shrinks, Elasticsearch will automatically migrate shards between nodes so that the cluster remains balanced.
A shard can be either a primary shard or a replica shard. Each document in your index belongs to a single primary shard, so the number of primary shards that you have determines the maximum amount of data that your index can hold.
While a primary shard can technically contain up to Integer.MAX_VALUE - 128 documents, the practical limit depends on your use case: the hardware you have, the size and complexity of your documents, how you index and query your documents, and your expected response times.
A replica shard is just a copy of a primary shard. Replicas are used to provide redundant copies of your data to protect against hardware failure, and to serve read requests like searching or retrieving a document.
The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time.
Let’s create an index called blogs
in our empty one-node cluster. By
default, indices are assigned five primary shards, but for the purpose of this
demonstration, we’ll assign just three primary shards and one replica (one replica
of every primary shard):
PUT /blogs { "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 } }
Our cluster now looks like Figure 2, “A single-node cluster with an index”. All three primary shards have been allocated to Node 1
.

If we were to check the
cluster-health
now, we would see this:
{ "cluster_name": "elasticsearch", "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 3, "active_shards": 3, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 50 }
A cluster health of yellow
means that all primary shards are up and
running (the cluster is capable of serving any request successfully) but
not all replica shards are active. In fact, all three replica shards
are currently unassigned
—they haven’t been allocated to a node. It
doesn’t make sense to store copies of the same data on the same node. If we
were to lose that node, we would lose all copies of our data.
Currently, our cluster is fully functional but at risk of data loss in case of hardware failure.