WARNING: Version 2.4 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Reindex API
editReindex API
editThe reindex API is new and should still be considered experimental. The API may change in ways that are not backwards compatible
Reindex does not attempt to set up the destination index. It does
not copy the settings of the source index. You should set up the destination
index prior to running a _reindex
action, including setting up mappings, shard
counts, replicas, etc.
The most basic form of _reindex
just copies documents from one index to another.
This will copy documents from the twitter
index into the new_twitter
index:
POST /_reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } }
That will return something like this:
{ "took" : 147, "timed_out": false, "created": 120, "updated": 0, "batches": 1, "version_conflicts": 0, "failures" : [ ], "created": 12344 }
Just like _update_by_query
, _reindex
gets a
snapshot of the source index but its target must be a different index so
version conflicts are unlikely. The dest
element can be configured like the
index API to control optimistic concurrency control. Just leaving out
version_type
(as above) or setting it to internal
will cause Elasticsearch
to blindly dump documents into the target, overwriting any that happen to have
the same type and id:
POST /_reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "internal" } }
Setting version_type
to external
will cause Elasticsearch to preserve the
version
from the source, create any documents that are missing, and update
any documents that have an older version in the destination index than they do
in the source index:
POST /_reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "external" } }
Settings op_type
to create
will cause _reindex
to only create missing
documents in the target index. All existing documents will cause a version
conflict:
POST /_reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "op_type": "create" } }
By default version conflicts abort the _reindex
process but you can just
count them by settings "conflicts": "proceed"
in the request body:
POST /_reindex { "conflicts": "proceed", "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "op_type": "create" } }
You can limit the documents by adding a type to the source
or by adding a
query. This will only copy tweet
's made by kimchy
into new_twitter
:
POST /_reindex { "source": { "index": "twitter", "type": "tweet", "query": { "term": { "user": "kimchy" } } }, "dest": { "index": "new_twitter" } }
index
and type
in source
can both be lists, allowing you to copy from
lots of sources in one request. This will copy documents from the tweet
and
post
types in the twitter
and blog
index. It’d include the post
type in
the twitter
index and the tweet
type in the blog
index. If you want to be
more specific you’ll need to use the query
. It also makes no effort to handle
ID collisions. The target index will remain valid but it’s not easy to predict
which document will survive because the iteration order isn’t well defined.
POST /_reindex { "source": { "index": ["twitter", "blog"], "type": ["tweet", "post"] }, "dest": { "index": "all_together" } }
It’s also possible to limit the number of processed documents by setting
size
. This will only copy a single document from twitter
to
new_twitter
:
POST /_reindex { "size": 1, "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } }
If you want a particular set of documents from the twitter index you’ll
need to sort. Sorting makes the scroll less efficient but in some contexts
it’s worth it. If possible, prefer a more selective query to size
and sort
.
This will copy 10000 documents from twitter
into new_twitter
:
POST /_reindex { "size": 10000, "source": { "index": "twitter", "sort": { "date": "desc" } }, "dest": { "index": "new_twitter" } }
The source
section supports all the elements that are supported in a
search request. For instance only a subset of the
fields from the original documents can be reindexed using source filtering
as follows:
POST _reindex { "source": { "index": "twitter", "_source": ["user", "tweet"] }, "dest": { "index": "new_twitter" } }
Like _update_by_query
, _reindex
supports a script that modifies the
document. Unlike _update_by_query
, the script is allowed to modify the
document’s metadata. This example bumps the version of the source document:
POST /_reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "external" }, "script": { "inline": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}" } }
Think of the possibilities! Just be careful! With great power…. You can change:
-
_id
-
_type
-
_index
-
_version
-
_routing
-
_parent
-
_timestamp
-
_ttl
Setting _version
to null
or clearing it from the ctx
map is just like not
sending the version in an indexing request. It will cause that document to be
overwritten in the target index regardless of the version on the target or the
version type you use in the _reindex
request.
By default if _reindex
sees a document with routing then the routing is
preserved unless it’s changed by the script. You can set routing
on the
dest
request to change this:
-
keep
- Sets the routing on the bulk request sent for each match to the routing on the match. The default.
-
discard
- Sets the routing on the bulk request sent for each match to null.
-
=<some text>
-
Sets the routing on the bulk request sent for each match to all text after
the
=
.
For example, you can use the following request to copy all documents from
the source
index with the company name cat
into the dest
index with
routing set to cat
.
POST /_reindex { "source": { "index": "source" "query": { "match": { "company": "cat" } } }, "dest": { "index": "dest", "routing": "=cat" } }
By default _reindex
uses scroll batches of 1000. You can change the
batch size with the size
field in the source
element:
POST _reindex { "source": { "index": "source", "size": 100 }, "dest": { "index": "dest" } }
URL Parameters
editIn addition to the standard parameters like pretty
, the Reindex API also
supports refresh
, wait_for_completion
, consistency
, timeout
, and
requests_per_second
.
Sending the refresh
url parameter will cause all indexes to which the request
wrote to be refreshed. This is different than the Index API’s refresh
parameter which causes just the shard that received the new data to be refreshed.
If the request contains wait_for_completion=false
then Elasticsearch will
perform some preflight checks, launch the request, and then return a task
which can be used with Tasks APIs to cancel or get
the status of the task. For now, once the request is finished the task is gone
and the only place to look for the ultimate result of the task is in the
Elasticsearch log file. This will be fixed soon.
consistency
controls how many copies of a shard must respond to each write
request. timeout
controls how long each write request waits for unavailable
shards to become available. Both work exactly how they work in the
Bulk API.
requests_per_second
can be set to any decimal number (1.4
, 6
, 1000
, etc)
and throttles the number of requests per second that the reindex issues. The
throttling is done waiting between bulk batches so that it can manipulate the
scroll timeout. The wait time is the difference between the time it took the
batch to complete and the time requests_per_second * requests_in_the_batch
.
Since the batch isn’t broken into multiple bulk requests large batch sizes will
cause Elasticsearch to create many requests and then wait for a while before
starting the next set. This is "bursty" instead of "smooth". The default is
unlimited
which is also the only non-number value that it accepts.
Response body
editThe JSON response looks like this:
{ "took" : 639, "updated": 0, "created": 123, "batches": 1, "version_conflicts": 2, "retries": 0, "throttled_millis": 0, "failures" : [ ] }
-
took
- The number of milliseconds from start to end of the whole operation.
-
updated
- The number of documents that were successfully updated.
-
created
- The number of documents that were successfully created.
-
batches
- The number of scroll responses pulled back by the the reindex.
-
version_conflicts
- The number of version conflicts that reindex hit.
-
retries
- The number of retries that the reindex did in response to a full queue.
-
throttled_millis
-
Number of milliseconds the request slept to conform to
requests_per_second
. -
failures
-
Array of all indexing failures. If this is non-empty then the request aborted
because of those failures. See
conflicts
for how to prevent version conflicts from aborting the operation.
Works with the Task API
editWhile Reindex is running you can fetch their status using the Task API:
GET /_tasks/?pretty&detailed=true&actions=*reindex
The responses looks like:
{ "nodes" : { "r1A2WoRbTwKZ516z6NEs5A" : { "name" : "Tyrannus", "transport_address" : "127.0.0.1:9300", "host" : "127.0.0.1", "ip" : "127.0.0.1:9300", "attributes" : { "testattr" : "test", "portsfile" : "true" }, "tasks" : { "r1A2WoRbTwKZ516z6NEs5A:36619" : { "node" : "r1A2WoRbTwKZ516z6NEs5A", "id" : 36619, "type" : "transport", "action" : "indices:data/write/reindex", "status" : { "total" : 6154, "updated" : 3500, "created" : 0, "deleted" : 0, "batches" : 4, "version_conflicts" : 0, "noops" : 0, "retries": 0, "throttled_millis": 0 }, "description" : "" } } } } }
this object contains the actual status. It is just like the response json
with the important addition of the |
Works with the Cancel Task API
editAny Reindex can be canceled using the Task Cancel API:
POST /_tasks/{task_id}/_cancel
The task_id
can be found using the tasks API above.
Cancelation should happen quickly but might take a few seconds. The task status API above will continue to list the task until it is wakes to cancel itself.
Rethrottling
editThe value of requests_per_second
can be changed on a running reindex using
the _rethrottle
API:
POST /_reindex/{task_id}/_rethrottle?requests_per_second=unlimited
The task_id
can be found using the tasks API above.
Just like when setting it on the _reindex
API requests_per_second
can be
either unlimited
to disable throttling or any decimal number like 1.7
or
12
to throttle to that level. Rethrottling that speeds up the query takes
effect immediately but rethrotting that slows down the query will take effect
on after completing the current batch. This prevents scroll timeouts.
Reindex to change the name of a field
edit_reindex
can be used to build a copy of an index with renamed fields. Say you
create an index containing documents that look like this:
POST test/test/1?refresh&pretty { "text": "words words", "flag": "foo" }
But you don’t like the name flag
and want to replace it with tag
.
_reindex
can create the other index for you:
POST _reindex?pretty { "source": { "index": "test" }, "dest": { "index": "test2" }, "script": { "inline": "ctx._source.tag = ctx._source.remove(\"flag\")" } }
Now you can get the new document:
GET test2/test/1?pretty
and it’ll look like:
{ "text": "words words", "tag": "foo" }
Or you can search by tag
or whatever you want.