Reindex is coming!
_reindex
and _update_by_query
are coming to
Elasticsearch 2.3.0 and 5.0.0-alpha1! Hurray!
_reindex reads documents from one index and writes them to another index. It can be used to copy documents from one index to another, enrich documents with fields, or recreate the index to change settings that are locked when the index is created.
_update_by_query reads documents from an index and writes them back to the same index. It can be used to update fields in many documents at once or to pick up mapping changes that can be made online.
_reindex
copies documents
The _reindex
API is really just a convenient way to copy documents
from one index to another. Everything else that it can do is an outgrowth of
that. If all you want to do is to copy all the documents from the
src
index into the dest
index you invoke
_reindex
like this:
curl -XPOST localhost:9200/_reindex?pretty -d'{ "source": { "index": "src" }, "dest": { "index": "dest" } }'
If you want to be a little more selective and, say, only copy docments tagged
with
bananas
you invoke _reindex
like this:
curl -XPOST localhost:9200/_reindex?pretty -d'{ "source": { "index": "src", "query": { "match": { "tags": "bananas" } } }, "dest": { "index": "dest" } }'
If you want to copy documents tagged with bananas
but you want to
add the
chocolate
tag to all copied documents you invoke
_reindex
like this:
curl -XPOST localhost:9200/_reindex?pretty -d'{ "source": { "index": "src", "query": { "match": { "tags": "bananas" } } }, "dest": { "index": "dest" }, "script": { "inline": "ctx._source.tags += \"chocolate\"" } }'
That requires that you have dynamic scripts enabled but you can do the same
thing with non-
inline
scripts.
Recreating an index to change settings that are locked at index creations is a
bit more involved but still simpler than before
_reindex
:
# Say you have an old index that you made like this curl -XPUT localhost:9200/test_1 -d'{ "aliases": { "test": {} } }' for i in $(seq 1 1000); do curl -XPOST localhost:9200/test/test -d'{"tags": ["bananas"]}' echo done curl -XPOST localhost:9200/test/_refresh?pretty # But you don't like having the default number of shards # You can make a copy of it with the new number of shards curl -XPUT localhost:9200/test_2 -d'{ "settings": { "number_of_shards": 1 } }' curl -XPOST 'localhost:9200/_reindex?pretty&refresh' -d'{ "source": { "index": "test" }, "dest": { "index": "test_2" } }' # Then just swing the alias to the new index curl -XPOST localhost:9200/_aliases?pretty -d'{ "actions": [ { "remove": { "index": "test_1", "alias": "index" } }, { "add": { "index": "test_2", "alias": "index" } } ] }' # Then when you are good and sure you are done with it you can curl -XDELETE localhost:9200/test_1?pretty
_update_by_query
modifies documents
The simplest way to invoke update by query isn't particularly useful on its own:
curl -XPOST localhost:9200/test/_update_by_query?pretty
That will just increment the document version number on each document in the
test
index and fail if you modify a document while it is running.
A more interesting example is adding the
chocolate
tag to all
documents with the
bananas
tag:
curl -XPOST 'localhost:9200/test/_update_by_query?pretty&refresh' -d'{ "query": { "bool": { "must": [ {"match": {"tags": "bananas"}} ], "must_not": [ {"match": {"tags": "chocolate"}} ] } }, "script": { "inline": "ctx._source.tags += \"chocolate\"" } }'
Like the last version this will fail if any documents are changed while it is
running, but it is written in such a way that you can just retry it and it'll
pick up from where it left off. If you've already modified whatever application
is making the concurrent updates to add the
chocolate
tag whenever
it sees
bananas
then you can safely ignore version conflicts in
the
_update_by_query
. You can tell it to do so by setting
conflicts=proceed
. It will just count the version conflicts and
continue performing updates. Now the command looks like this:
curl -XPOST 'localhost:9200/test/_update_by_query?pretty&refresh&conflicts=proceed' -d'{ "query": { "bool": { "must": [ {"match": {"tags": "bananas"}} ], "must_not": [ {"match": {"tags": "chocolate"}} ] } }, "script": { "inline": "ctx._source.tags += \"chocolate\"" } }'
Finally, you can use _update_by_query
to suck up mapping changes
that only take effect when the document is modified like adding a new field to
an existing field. For example:
# Say I made an index with tags not_analyzed because, you know, they are tags after all curl -XPUT localhost:9200/test_3?pretty -d'{ "mappings": { "test": { "properties": { "tags": { "type": "string", "index": "not_analyzed" } } } } }' for i in $(seq 1 1000); do curl -XPOST localhost:9200/test_3/test -d'{"tags": ["bananas"]}' echo done curl -XPOST localhost:9200/test_3/_refresh?pretty # But now I want to search on tags using the standard analyzer so I can search for banana and find bananas curl -XPUT localhost:9200/test_3/_mapping/test?pretty -d'{ "properties": { "tags": { "type": "string", "index": "not_analyzed", "fields": { "analyzed": { "type": "string", "analyzer": "standard" } } } } }' # This doesn't take effect immediately curl 'localhost:9200/test_3/_search?pretty' -d'{ "query": { "match": { "tags.analyzed": "bananas" } } }' # :( # But we can _update_by_query to pick up the new mapping on all documents curl -XPOST 'localhost:9200/test_3/_update_by_query?pretty&conflicts=proceed&refresh' # And now the new mapping has been applied to the whole index! curl 'localhost:9200/test_3/_search?pretty' -d'{ "query": { "match": { "tags.analyzed": "bananas" } } }'
Getting the status
_reindex
and _update_by_query
can touch millions of
documents so they can take a long time. You can fetch their status with:
curl localhost:9200/_tasks?pretty&detailed&actions=*reindex,*byquery
That will contain a field that looks like:
"BHgHr0cETkOehwqZ2N_-aQ:28295" : { "node" : "BHgHr0cETkOehwqZ2N_-aQ", "id" : 28295, "type" : "transport", "action" : "indices:data/write/reindex", "start_time_in_millis" : 1458767149108, "running_time_in_nanos" : 5475314, "status" : { "total" : 6154, "updated" : 3500, "created" : 0, "deleted" : 0, "batches" : 36, "version_conflicts" : 0, "noops" : 0, "retries": 0, "throttled_millis": 0 } }
You can read the docs
for more, but the gist is that _reindex plans to do total
operations and has already done updated + created + deleted + noops
of them. So you can estimate how complete the request is by dividing those
numbers.
Cancelling
_reindex
was so long in coming because Elasticsearch lacked a way
to cancel running tasks. For short running tasks like
_search
and
indexing that is fine. But, like I wrote above,
_reindex
and
_update_by_query
can touch millions of documents are take a long
time. The tasks themselves are ok with that, but you may not be. Say you realize
ten minutes into a three hour long
_update_by_query
that you made
a mistake in the script. There isn't a way to rollback the changes that the
reindex already made but you can cancel it so it won't make any more such
changes:
curl -XPOST localhost:9200/_task/{taskId}/_cancel
And where do you get the taskId? It is the name of the object returned by the
task listing API in the last section of this blog post. The one in the example
return is
BHgHr0cETkOehwqZ2N_-aQ:28295
.
In Elasticsearch task cancelation is opt in. It kind of has to be that way in
any Java application. Anyway, tasks that can be canceled like
_reindex
and _update_by_query
periodically check to
see if they have been canceled and then shut themselves down. This means that
you might see the task if you immediately list its status after it has been
canceled. It will go away on its own and you can't cancel it any harder without
stopping the node it is running on.
Remember that Elasticsearch is a search engine
Every update has to mark the document as deleted and index the entire new
document. The deleted documents have to then be merged out of the index.
_reindex
and _update_by_query
don't save anything in
that process. They work just as though you performed a scroll query and indexed
all the results. Running a zillion
_reindex
s or
_update_by_query
s is unlikely to be the most efficient use of
computer resources to accomplish some task. You will almost always be better off
making changes to the application that adds data to Elasticsearch rather than
updating the data after the fact.
_reindex
and
_update_by_query
are most useful for turning the data that you
already have in Elasticsearch into the data that you want to be in
Elasticsearch.