Elasticsearch highlights

edit

This list summarizes the most important enhancements in Elasticsearch 7.7. For the complete list, go to Elasticsearch release highlights.

Fixed index corruption on shrunk indices

edit

Applying deletes or updates on an index after it had been shrunk would likely corrupt the index. We advise users of Elasticsearch 6.x who opt in for soft deletes on some of their indices and all users of Elasticsearch 7.x to upgrade to 7.7 as soon as possible to no longer be subject to this corruption bug. In case upgrading in the near future is not an option, we recommend to completely stop using _shrink on read-write indices and to do a force-merge right after shrinking on read-only indices, which significantly reduces the likeliness of being affected by this bug in case deletes or updates get applied by mistake. This bug is fixed as of Elasticsearch 7.7.0. Low-level details can be found on the corresponding issue.

Significant reduction of heap usage of segments

edit

This release of Elasticsearch significantly reduces the amount of heap memory that is needed to keep Lucene segments open. In addition to helping with cluster stability, this helps reduce costs by storing much more data per node before hitting memory limits.

Transforms – now in GA!

edit

In 7.7, we move transforms from beta to general availability.

Transforms enable you to pivot existing Elasticsearch indices using group-by and aggregations into a destination feature index, which provides opportunities for new insights and analytics. For example, you can use transforms to pivot your data into entity-centric indices that summarize the behavior of users or sessions or other entities in your data.

Transforms now include support for cross-cluster search. Allowing you to create your destination feature index on a separate cluster from the source indices.

Aggregation support has been expanded within transforms to include support for multi-value (percentiles) and filter aggregations. We also optimized the performance of the date histogram aggregations.

Introducing multiclass classification

edit

Classification using multiple classes is now available in data frame analytics. Classification is a supervised machine learning technique which has been already available as a binary process in the previous release. Multiclass classification works well with up to 30 distinct categories.

Feature importance at inference time

edit

Feature importance now can be calculated at inference time. This value provides further insight into the results of a classification or regression job and therefore helps interpret these results.

Finer memory control for bucket aggregations

edit

While building buckets, aggregations will now periodically check the real-memory circuit breaker before continuing to allocate more buckets. This allows better responsivity to memory pressure and avoids OutOfMemory situations due to allocating more buckets than the node can handle.

A new way of searching: asynchronously

edit

You can now submit long-running searches using the new _async_search API. The new API accepts the same parameters and request body as the Search API. However, instead of blocking and returning the final response only when it’s entirely finished, you can retrieve results from an async search as they become available.

The request takes a parameter, wait_for_completion, which controls how long the server will wait until it sends back a response. The first response contains among others a search unique ID, a response version, an indication if this response is partial or not, plus the usual metadata (shards involved, number of hits etc) and potentially results. If the response is not complete and final, the client can continue polling for results, issuing a new request using the provided search ID. If new results are available, the returned version is incremented and the new batch of results are returned. This can continue until all the results are fetched.

Unless deleted earlier by the user, the asynchronous searches are kept alive for a given interval. This defaults to 5 days and can be controlled by another request parameter, keep_alive.

Password protection for the keystore

edit

Elasticsearch uses a custom on-disk keystore for secure settings such as passwords and SSL certificates. Up until now, this prevented users with command-line access from viewing secure files by listing commands, but nothing prevented such users from changing values in the keystore, or removing values from it. Furthermore, the values were only obfuscated by a hash; no user-specific secret protected the secure settings.

This new feature changes all of that by adding password-protection to the keystore. This is not be a breaking change: if a keystore has no password, there won’t be any new prompts. A user must choose to password-protect their keystore in order to benefit from the new behavior.

A new aggregation: top_metrics

edit

The new top_metrics aggregation "selects" a metric from a document according to a criteria on a given, different field. That criteria is currently the largest or smallest "sort" value. It is fairly similar to top_hits in spirit, but because it is more limited, top_metrics uses less memory and is often faster.

Query speed-up for sorted queries on time-based indices

edit

We’ve optimized sorted, top-documents-only queries run on time-based indices. The optimization stems from the fact that the ranges of (document) timestamps in the shards don’t overlap. It is implemented by rewriting the shard search requests based on the partial results already available from other shards, if it can be determined that the query will not yield any result from the current shard; i.e. we know in advance that the bottom entry of the (sorted) result set after a partial merge is better than the values contained in this current shard.

A new aggregation: boxplot

edit

The interquartile range (IQR) is a common robust measure of statistical dispersion. Compared to the standard deviation, the IQR is less sensitive to outliers in the data, with a breakdown point of 0.25. Along with the median, it is often used in creating a box plot, a simple yet common way to summarize data and identify potential outliers.

The new boxplot aggregation calculates the min, max, and medium as well as the first and third quartiles of a given data set.

AArch64 support

edit

Elasticsearch now provides AArch64 packaging, including bundling an AArch64 JDK distribution. There are some restrictions in place, namely no machine learning support and depending on underlying page sizes, class data sharing is disabled.