What’s new in 7.10

edit

Here are the highlights of what’s new and improved in Elasticsearch 7.10! For detailed information about this release, see the Release notes and Breaking changes.

Other versions: 7.9 | 7.8 | 7.7 | 7.6 | 7.5 | 7.4 | 7.3 | 7.2 | 7.1 | 7.0

Indexing speed improvement

edit

Elasticsearch 7.10 improves indexing speed by up to 20%. We’ve reduced the coordination needed to add entries to the transaction log. This reduction allows for more concurrency and increases the transaction log buffer size from 8KB to 1MB. However, performance gains are lower for full-text search and other analysis-intensive use cases. The heavier the indexing chain, the lower the gains, so indexing chains that involve many fields, ingest pipelines or full-text indexing will see lower gains.

More space-efficient indices

edit

Elasticsearch 7.10 depends on Apache Lucene 8.7, which introduces higher compression of stored fields, the part of the index that notably stores the _source. On the various data sets that we benchmark against, we noticed space reductions between 0% and 10%. This change especially helps on data sets that have lots of redundant data across documents, which is typically the case of the documents that are produced by our Observability solutions, which repeat metadata about the host that produced the data on every document.

Elasticsearch offers the ability to configure the index.codec setting to tell Elasticsearch how aggressively to compress stored fields. Both supported values default and best_compression will get better compression with this change.

Data tiers

edit

7.10 introduces the concept of formalized data tiers within Elasticsearch. Data tiers are a simple, integrated approach that gives users control over optimizing for cost, performance, and breadth/depth of data. Prior to this formalization, many users configured their own tier topology using custom node attributes as well as using ILM to manage the lifecycle and location of data within a cluster.

With this formalization, data tiers (content, hot, warm, and cold) can be explicitly configured using node roles, and indices can be configured to be allocated within a specific tier using index-level data tier allocation filtering. ILM will make use of these tiers to automatically migrate data between nodes as an index goes through the phases of its lifecycle.

Newly created indices abstracted by a data stream will be allocated to the data_hot tier automatically, while standalone indices will be allocated to the data_content tier automatically. Nodes with the pre-existing data role are considered to be part of all tiers.

AUC ROC evaluation metrics for classification analysis

edit

Area under the curve of receiver operating characteristic (AUC ROC) is an evaluation metric that has been available for outlier detection since 7.3 and now is available for classification analysis. AUC ROC represents the performance of the classification process at different predicted probability thresholds. The true positive rate for a specific class is compared against the rate of all the other classes combined at the different threshold levels to create the curve.

Custom feature processors in data frame analytics

edit

Feature processors enable you to extract process features from document fields. You can use these features in model training and model deployment. Custom feature processors provide a mechanism to create features that can be used at search and ingest time and they don’t take up space in the index. This process more tightly couples feature generation with the resulting model. The result is simplified model management as both the features and the model can easily follow the same life cycle.

Points in time (PITs) for search

edit

In 7.10, we’re introducing points in time (PITs), a lightweight way to preserve index state over searches. PITs improve end-user experience by making UIs more reactive.

By default, a search request waits for complete results before returning a response. For example, a search that retrieves top hits and aggregations returns a response only after both top hits and aggregations are computed. However, aggregations are usually slower and more expensive to compute than top hits. Instead of sending a combined request, you can send two separate requests: one for top hits and another one for aggregations. With separate search requests, a UI can display top hits as soon as they’re available and display aggregation data after the slower aggregation request completes. You can use a PIT to ensure both search requests run on the same data and index state.

To use a PIT in a search, you must first explicitly create the PIT using the new open PIT API. PITs get automatically garbage-collected after keep_alive if no follow-up request extends their duration.

POST /my-index-000001/_pit?keep_alive=1m

The API returns a PIT ID you can use in search requests. You can also configure by how long to extend your PIT’s lifespan using the search request’s keep_alive parameter.

POST /_search
{
    "size": 100,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "pit": {
	    "id":  "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
	    "keep_alive": "1m"
    }
}

PITs automatically close when their keep_alive period ends. You can also manually close PITs you no longer need using the close PIT API. Closing a PIT releases the resources needed to maintain the PIT’s index state.

DELETE /_pit
{
    "id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA="
}

For more information about using PITs in search, see Paginate search results with search_after or the PIT API documentation.

Request-level circuit breakers on coordinating nodes

edit

You can now use a coordinating node to account for memory used to perform partial and final reduce of aggregations in the request circuit breaker. The search coordinator adds the memory that it used to save and reduce the results of shard aggregations in the request circuit breaker. Before any partial or final reduce, the memory needed to reduce the aggregations is estimated and a CircuitBreakingException is thrown if exceeds the maximum memory allowed in this breaker.

This size is estimated as roughly 1.5 times the size of the serialized aggregations that need to be reduced. This estimation can be completely off for some aggregations but it is corrected with the real size after the reduce completes. If the reduce is successful, we update the circuit breaker to remove the size of the source aggregations and replace the estimation with the serialized size of the newly reduced result.

EQL: Case-sensitivity and the : operator

edit

In 7.10, we made most EQL operators and functions case-sensitive by default. We’ve also added :, a new case-insensitive equal operator. Designed for security use cases, you can use the : operator to search for strings in Windows event logs and other event data containing a mix of letter cases.

GET /my-index-000001/_eql/search
{
  "query": """
    process where process.executable : "c:\\\\windows\\\\system32\\\\cmd.exe"
  """
}

For more information, see the EQL syntax documentation.

REST API access to system indices is deprecated

edit

We are deprecating REST API access to system indices. Most REST API requests that attempt to access system indices will return the following deprecation warning:

this request accesses system indices: [.system_index_name], but in a future
major version, direct access to system indices will be prevented by default

The following REST API endpoints access system indices as part of their implementation and will not return the deprecation warning:

  • GET _cluster/health
  • GET {index}/_recovery
  • GET _cluster/allocation/explain
  • GET _cluster/state
  • POST _cluster/reroute
  • GET {index}/_stats
  • GET {index}/_segments
  • GET {index}/_shard_stores
  • GET _cat/[indices,aliases,health,recovery,shards,segments]

We are also adding a new metadata flag to track indices. Elasticsearch will automatically add this flag to any existing system indices during upgrade.

New thread pools for system indices

edit

We’ve added two new thread pools for system indices: system_read and system_write. These thread pools ensure system indices critical to the Elastic Stack, such as those used by security or Kibana, remain responsive when a cluster is under heavy query or indexing load.

system_read is a fixed thread pool used to manage resources for read operations targeting system indices. Similarly, system_write is a fixed thread pool used to manage resources for write operations targeting system indices. Both have a maximum number of threads equal to 5 or half of the available processors, whichever is smaller.