Where are my documents?
Refreshing news...
When you send Elasticsearch a request that modifies or creates documents and
it replies with
200 OK
or 201 CREATED
it has synced
the changes to disk on all active shards
1.
That means that the changes will survive catastrophic system shutdown but it doesn't mean that the changes are available for search. The process that
makes changes available for search is called a "refresh" and it is the topic of
this post.
Refreshes are performed periodically
(
index.refresh_interval
),
when the
indexing buffer
is full, and on demand (?refresh
)2.
On demand refreshing is rarely used outside of testing because it creates small
index segments which are inefficient to create and search and must later be merged
into larger segments. Waiting for the indexing buffer to be full is
unpredictable so we can't rely on it either. That means that we mostly think of
the index as being refreshed every
index.refresh_interval
, which
defaults to 1 second.
The problem
Refreshing every second is fine if you are indexing something like logs where you expect to be some amount of time behind real time, but if you are indexing blog posts or comments or calendars then it can be a bit difficult. For anything where a user might expect to make a change and immediately be able to search for that change (blog, forum, scheduling app) your application needs some way to know that the change is visible for search. This is doubly true for applications that want to use search for something interesting after the user's change (think scheduling or aggregations). In those cases you have a few options all of which have interesting tradeoffs:
- Wait for the refresh, perhaps polling to check that it is there.
- Force a refresh with
?refresh
- Wait for the refresh to occur with
?refresh=wait_for
(coming in 5.0-alpha4!)
Wait for the refresh
You could just wait for the refresh interval to pass. This has the advantage of being something you can do totally asynchronously. The disadvantage is that you have to wait for the whole one second and even then it is not guaranteed. Refresh isn't instant. Usually it is pretty quick but some refreshes will be slower than others so you can't really predict it. For applications where you can tolerate not knowing for sure if something is available for search then this is totally the right choice. But this blog post really isn't about those applications. So, for the sake of this blog post, we're going to assume this option isn't good enough for you.
Force a refresh with ?refresh
You could force an immediate refresh. This has the advantage of being pretty quick. Like I said a few paragraphs up, it has the disadvantage of creating small segments that are inefficient to create, search, and merge. For plenty of use cases this inefficiency is worth the speed. Don't be afraid to force a refresh if it makes sense for your use case.
For example, say you are loading something into Elasticsearch and plan to analyze the results. This search index is just for you so you know when you are done loading documents. At that point you shouldn't hesitate to refresh the index. Waiting isn't going to help.
I should mention that adding ?refresh
to an index, update,
delete, or bulk request is subtly different than performing a
refresh
API call. Refresh API calls will refresh all the shards on the index.
?refresh
will only refresh the shards that have been modified. So
for index, update, and delete requests that is just the shard to which the
document was routed. For bulk requests that is all shards to which any document
was routed.
?refresh
might also be a bad choice because it affects other
indexing in the same index. Say you have a bulk loading process that works
quite well. But now you want to start inserting a few documents into the same
index interactively. If you do it with
?refresh
then, suddenly,
you've started refreshing documents outside of whatever refresh interval you
were using for the bulk load. If you do that frequently enough that'll change
the search and index performance of the bulk loading process.
Wait for the refresh with ?refresh=wait_for
(coming in 5.0-alpha4!)
Elasticsearch 5.0 brings a hybrid approach between the two options. Adding
?refresh=wait_for
to index, update, delete, or bulk request will
cause the request to wait until its changes have been made visible for search
before returning to the user. This has the advantage of being correct without
creating inefficient segments. It has the disadvantage of having to wait for
the refresh. You don't have to wait for as long as the "wait for the refresh"
option because Elasticsearch signals you as soon as the document is ready for
search. So if the change comes half way through the refresh interval you only
have to wait for half of the time.
Unlike ?refresh
, ?refresh=wait_for
won't affect
concurrent indexing on the same index. It has no
effect on segment size because it doesn't force a refresh immediately. If you
must know when the refresh happens, you can wait for the refresh, and you plan
to upgrade to 5.0
3, then this
is the right choice!
Even if you are super excited to upgrade to 5.0-alpha4 to get this feature keep in mind that Elasticsearch's alphas and betas are for testing purposes only because they aren't compatible with the GA release. We are still finalizing the wire level communications and on disk layout so 5.0' alphas and betas aren't guaranteed to upgrade to properly to 5.0.0, either with rolling restarts or a full cluster restart. Please test this feature to see if it fits for you but don't upgrade production clusters to alphas or betas.
Back to the feature, there is a limit to the number of
?refresh=wait_for
API calls that can be waiting on any one shard:
index.max_refresh_listeners
which defaults to 1000
.
If a request with
?refresh=wait_for
comes in while all the slots
are full then Elasticsearch will refresh the shard and reply to the request
immediately.
What does ?refresh=wait_for
do if you set the
index.refresh_interval
to -1
, disabling periodic
refreshes, you may ask? Well the answer is that
?refresh=wait_for
will honor whatever refresh interval you
configure. The request will only return when you fill the indexing buffer,
force an explicit refresh, or try to wait on more than
index.max_refresh_listeners
requests in the same shard.
index.refresh_interval
is just about the maximum number of time
that
?refresh=wait_for
will have to wait for the changes to become
visible. If you use
?refresh=wait_for
, raising the refresh
interval will make indexing feel slower and slower to your users. And
lowering it will make indexing feel faster and faster. So it might be tempting
to lower the refresh interval. Doing so will make less and less efficient
segments. Making the refresh interval the same as the write rate is as
inefficient as using
?refresh
on every request.
Pick the refresh strategy that makes sense for you
Ultimately there is no silver bullet for refreshes. Elasticsearch's
index.refresh_interval
is a useful because it coalesces several
changes into one big change to the search index, making a more efficient index.
You either wait for the refresh interval, potentially slowing down your users,
or you force an immediate refresh and pay the price at search and merge time.
?refresh=wait_for
gives you a tool to make waiting for the refresh
interval interactive so you can make whatever tradeoffs make sense for you.
Footnotes
1 This is not always true but it is
the recommended configuration. See
index.translog.durability
.
2 Refreshes also occur during recovery, the process that moves shards between nodes. This ought to be rare enough not to factor into most thinking about refreshes.
3 You can cobble together something that works alright in older versions of Elasticsearch using the steps here and/or here . The trouble is that it doesn't work well with bulk and or replicas. It is far from perfect.