What are the current options to migrate from OpenSearch to Elasticsearch®?
OpenSearch is a fork of Elasticsearch 7.10 that has diverged quite a bit from itself lately, resulting in a different set of features and also different performance, as this benchmark shows (hint: it’s currently much slower than Elasticsearch).
Given the differences between the two solutions, restoring a snapshot from OpenSearch is not possible, nor is reindex-from-remote, so our only option is then using something in between that will read from OpenSearch and write to Elasticsearch.
This blog will show you how easy it is to migrate from OpenSearch to Elasticsearch for better performance and less disk usage!
1 billion log lines
We are going to use part of the data set we used for the benchmark, which takes about half a terabyte on disk, including replicas, and spans over a week ( January 1–7, 2023).
We have in total 1,009,165,775 documents that take 453.5GB of space in OpenSearch, including the replicas. That’s 241.2KB per document. This is going to be important later when we enable a couple optimizations in Elasticsearch that will bring this total size way down without sacrificing performance!
This billion log line data set is spread over nine indices that are part of a datastream we are calling logs-myapplication-prod. We have primary shards of about 25GB in size, according to the best practices for optimal shard sizing. A GET _cat/indices show us the indices we are dealing with:
index docs.count pri rep pri.store.size store.size
.ds-logs-myapplication-prod-000049 102519334 1 1 22.1gb 44.2gb
.ds-logs-myapplication-prod-000048 114273539 1 1 26.1gb 52.3gb
.ds-logs-myapplication-prod-000044 111093596 1 1 25.4gb 50.8gb
.ds-logs-myapplication-prod-000043 113821016 1 1 25.7gb 51.5gb
.ds-logs-myapplication-prod-000042 113859174 1 1 24.8gb 49.7gb
.ds-logs-myapplication-prod-000041 112400019 1 1 25.7gb 51.4gb
.ds-logs-myapplication-prod-000040 113362823 1 1 25.9gb 51.9gb
.ds-logs-myapplication-prod-000038 110994116 1 1 25.3gb 50.7gb
.ds-logs-myapplication-prod-000037 116842158 1 1 25.4gb 50.8gb
Both OpenSearch and Elasticsearch clusters have the same configuration: 3 nodes with 64GB RAM and 12 CPU cores. Just like in the benchmark, the clusters are running in Kubernetes.
Moving data from A to B
Typically, moving data from one Elasticsearch cluster to another is easy as a snapshot and restore if the clusters are compatible versions of each other or a reindex from remote if you need real-time synchronization and minimized downtime. These methods do not apply when migrating data from OpenSearch to Elasticsearch because the projects have significantly diverged from the 7.10 fork. However, there is one method that will work: scrolling.
Scrolling
Scrolling involves using an external tool, such as Logstash®, to read data from the source cluster and write it to the destination cluster. This method provides a high degree of customization, allowing us to transform the data during the migration process if needed. Here are a couple advantages of using Logstash:
- Easy parallelization: It’s really easy to write concurrent jobs that can read from different “slices” of the indices, essentially maximizing our throughput.
- Queuing: Logstash automatically queues documents before sending.
- Automatic retries: In the event of a failure or an error during data transmission, Logstash will automatically attempt to resend the data; moreover, it will stop querying the source cluster as often, until the connection is re-established, all without manual intervention.
Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left, similar to how a “cursor” works in relational databases.
A scrolled search takes a snapshot in time by freezing the segments that make the index up until the time the request is made, preventing those segments from merging. As a result, the scroll doesn’t see any changes that are made to the index after the initial search request has been made.
Migration strategies
Reading from A and writing in B in can be slow without optimization because it involves paginating through the results, transferring each batch over the network to Logstash, which will assemble the documents in another batch and then transfer those batches over the network again to Elasticsearch, where the documents will be indexed. So when it comes to such large data sets, we must be very efficient and extract every bit of performance where we can.
Let’s start with the facts — what do we know about the data we need to transfer? We have nine indices in the datastream, each with about 100 million documents. Let’s test with just one of the indices and measure the indexing rate to see how long it takes to migrate. The indexing rate can be seen by activating the monitoring functionality in Elastic® and then navigating to the index you want to inspect.
Scrolling in the deep
The simplest approach for transferring the log lines over would be to make Elasticsearch scroll over the entire data set and check it later when it finishes. Here we will introduce our first two variables: PAGE_SIZE and BATCH_SIZE. The former is how many records we are going to bring from the source every time we query it, and the latter is how many documents are going to be assembled together by Logstash and written to the destination index.
With such a large data set, the scroll slows down as this deep pagination progresses. The indexing rate starts at 6,000 docs/second and steadily descends down to 700 docs/second because the pagination gets very deep. Without any optimization, it would take us 19 days (!) to migrate the 1 billion documents. We can do better than that!
Slice me nice
We can optimize scrolling by using an approach called Sliced scroll, where we split the index in different slices to consume them independently.
Here we will introduce our last two variables: SLICES and WORKERS. The amount of slices cannot be too small as the performance decreases drastically over time, and it can’t be too big as the overhead of maintaining the scrolls would counter the benefits of a smaller search.
Let’s start by migrating a single index (out of the nine we have) with different parameters to see what combination gives us the highest throughput.
SLICES | PAGE_SIZE | WORKERS | BATCH_SIZE | Average Indexing Rate |
3 | 500 | 3 | 500 | 13,319 docs/sec |
3 | 1,000 | 3 | 1,000 | 13,048 docs/sec |
4 | 250 | 4 | 250 | 10,199 docs/sec |
4 | 500 | 4 | 500 | 12,692 docs/sec |
4 | 1,000 | 4 | 1,000 | 10,900 docs/sec |
5 | 500 | 5 | 500 | 12,647 docs/sec |
5 | 1,000 | 5 | 1,000 | 10,334 docs/sec |
5 | 2,000 | 5 | 2,000 | 10,405 docs/sec |
10 | 250 | 10 | 250 | 14,083 docs/sec |
10 | 250 | 4 | 1,000 | 12,014 docs/sec |
10 | 500 | 4 | 1,000 | 10,956 docs/sec |
It looks like we have a good set of candidates for maximizing the throughput for a single index, in between 12K and 14K documents per second. That doesn't mean we have reached our ceiling. Even though search operations are single threaded and every slice will trigger sequential search operations to read data, that does not prevent us from reading several indices in parallel.
By default, the maximum number of open scrolls is 500 — this limit can be updated with the search.max_open_scroll_context cluster setting, but the default value is enough for this particular migration.
Let’s migrate
Preparing our destination indices
We are going to create a datastream called logs-myapplication-reindex to write the data to, but before indexing any data, let’s ensure our index template and index lifecycle management configurations are properly set up. An index template acts as a blueprint for creating new indices, allowing you to define various settings that should be applied consistently across your indices.
Index lifecycle management policy
Index lifecycle management (ILM) is equally vital, as it automates the management of indices throughout their lifecycle. With ILM, you can define policies that determine how long data should be retained, when it should be rolled over into new indices, and when old indices should be deleted or archived. Our policy is really straightforward:
PUT _ilm/policy/logs-myapplication-lifecycle-policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_primary_shard_size": "25gb"
}
}
},
"warm": {
"min_age": "0d",
"actions": {
"forcemerge": {
"max_num_segments": 1
}
}
}
}
}
}
Index template (and saving 23% in disk space)
Since we are here, we’re going to go ahead and enable Synthetic Source, a clever feature that allows us to store and discard the original JSON document while still reconstructing it when needed from the stored fields.
For our example, enabling Synthetic Source resulted in a remarkable 23.4% improvement in storage efficiency , reducing the size required to store a single document from 241.2KB in OpenSearch to just 185KB in Elasticsearch.
Our full index template is therefore:
PUT _index_template/logs-myapplication-reindex
{
"index_patterns": [
"logs-myapplication-reindex"
],
"priority": 500,
"data_stream": {},
"template": {
"settings": {
"index": {
"lifecycle.name": "logs-myapplication-lifecycle-policy",
"codec": "best_compression",
"number_of_shards": "1",
"number_of_replicas": "1",
"query": {
"default_field": [
"message"
]
}
}
},
"mappings": {
"_source": {
"mode": "synthetic"
},
"_data_stream_timestamp": {
"enabled": true
},
"date_detection": false,
"properties": {
"@timestamp": {
"type": "date"
},
"agent": {
"properties": {
"ephemeral_id": {
"type": "keyword",
"ignore_above": 1024
},
"id": {
"type": "keyword",
"ignore_above": 1024
},
"name": {
"type": "keyword",
"ignore_above": 1024
},
"type": {
"type": "keyword",
"ignore_above": 1024
},
"version": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"aws": {
"properties": {
"cloudwatch": {
"properties": {
"ingestion_time": {
"type": "keyword",
"ignore_above": 1024
},
"log_group": {
"type": "keyword",
"ignore_above": 1024
},
"log_stream": {
"type": "keyword",
"ignore_above": 1024
}
}
}
}
},
"cloud": {
"properties": {
"region": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"data_stream": {
"properties": {
"dataset": {
"type": "keyword",
"ignore_above": 1024
},
"namespace": {
"type": "keyword",
"ignore_above": 1024
},
"type": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"ecs": {
"properties": {
"version": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"event": {
"properties": {
"dataset": {
"type": "keyword",
"ignore_above": 1024
},
"id": {
"type": "keyword",
"ignore_above": 1024
},
"ingested": {
"type": "date"
}
}
},
"host": {
"type": "object"
},
"input": {
"properties": {
"type": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"log": {
"properties": {
"file": {
"properties": {
"path": {
"type": "keyword",
"ignore_above": 1024
}
}
}
}
},
"message": {
"type": "match_only_text"
},
"meta": {
"properties": {
"file": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"metrics": {
"properties": {
"size": {
"type": "long"
},
"tmin": {
"type": "long"
}
}
},
"process": {
"properties": {
"name": {
"type": "keyword",
"ignore_above": 1024
}
}
},
"tags": {
"type": "keyword",
"ignore_above": 1024
}
}
}
}
}
Building a custom Logstash image
We are going to use a containerized Logstash for this migration because both clusters are sitting on a Kubernetes infrastructure, so it's easier to just spin up a Pod that will communicate to both clusters.
Since OpenSearch is not an official Logstash input, we must build a custom Logstash image that contains the logstash-input-opensearch plugin. Let’s use the base image from docker.elastic.co/logstash/logstash:8.15.3 and just install the plugin:
FROM docker.elastic.co/logstash/logstash:8.15.3
USER logstash
WORKDIR /usr/share/logstash
RUN bin/logstash-plugin install logstash-input-opensearch
Writing a Logstash pipeline
Now we have our Logstash Docker image, and we need to write a pipeline that will read from OpenSearch and write to Elasticsearch.
The input
input {
opensearch {
hosts => ["os-cluster:9200"]
ssl => true
ca_file => "/etc/logstash/certificates/opensearch-ca.crt"
user => "${OPENSEARCH_USERNAME}"
password => "${OPENSEARCH_PASSWORD}"
index => "${SOURCE_INDEX_NAME}"
slices => "${SOURCE_SLICES}"
size => "${SOURCE_PAGE_SIZE}"
scroll => "5m"
docinfo => true
docinfo_target => "[@metadata][doc]"
}
}
Let’s break down the most important input parameters. The values are all represented as environment variables here:
- hosts: Specifies the host and port of the OpenSearch cluster. In this case, it’s connecting to “os-cluster” on port 9200.
- index: Specifies the index in the OpenSearch cluster from which to retrieve logs. In this case, it’s “logs-myapplication-prod” which is a datastream that contains the actual indices (e.g., .ds-logs-myapplication-prod-000049).
- size: Specifies the maximum number of logs to retrieve in each request.
- scroll: Defines how long a search context will be kept open on the OpenSearch server. In this case, it’s set to “5m,” which means each request must be answered and a new “page” asked within five minutes.
- docinfo and docinfo_target: These settings control whether document metadata should be included in the Logstash output and where it should be stored. In this case, document metadata is being stored in the [@metadata][doc] field — this is important because the document’s _id will be used as the destination id as well.
The ssl and ca_file are highly recommended if you are migrating from clusters that are in a different infrastructure (separate cloud providers). You don’t need to specify a ca_file if your TLS certificates are signed by a public authority, which is likely the case if you are using a SaaS and your endpoint is reachable over the internet. In this case, only ssl => true would suffice. In our case, all our TLS certificates are self-signed, so we must also provide the Certificate Authority (CA) certificate.
The (optional) filter
We could use this to drop or alter the documents to be written to Elasticsearch if we wanted, but we are not going to, as we want to migrate the documents as is. We are only removing extra metadata fields that Logstash includes in all documents, such as "@version" and "host". We are also removing the original "data_stream" as it contains the source data stream name, which might not be the same in the destination.
filter {
mutate {
remove_field => ["@version", "host", "data_stream"]
}
}
The output
The output is really simple — we are going to name our datastream logs-myapplication-reindex and we are using the document id of the original documents in document_id, to ensure there are no duplicate documents. In Elasticsearch, datastream names follow a convention <type>-<dataset>-<namespace> so our logs-myapplication-reindex datastream has “myapplication” as dataset and “prod” as namespace.
elasticsearch {
hosts => "${ELASTICSEARCH_HOST}"
user => "${ELASTICSEARCH_USERNAME}"
password => "${ELASTICSEARCH_PASSWORD}"
document_id => "%{[@metadata][doc][_id]}"
data_stream => "true"
data_stream_type => "logs"
data_stream_dataset => "myapplication"
data_stream_namespace => "prod"
}
Deploying Logstash
We have a few options to deploy Logstash: it can be deployed locally from the command line, as a systemd service, via docker, or on Kubernetes.
Since both of our clusters are deployed in a Kubernetes environment, we are going to deploy Logstash as a Pod referencing our Docker image created earlier. Let’s put our pipeline inside a ConfigMap along with some configuration files (pipelines.yml and config.yml).
In the below configuration, we have SOURCE_INDEX_NAME, SOURCE_SLICES, SOURCE_PAGE_SIZE, LOGSTASH_WORKERS, and LOGSTASH_BATCH_SIZE conveniently exposed as environment variables so you just need to fill them out.
apiVersion: v1
kind: Pod
metadata:
name: logstash-1
spec:
containers:
- name: logstash
image: ugosan/logstash-opensearch-input:8.10.0
imagePullPolicy: Always
env:
- name: SOURCE_INDEX_NAME
value: ".ds-logs-benchmark-dev-000037"
- name: SOURCE_SLICES
value: "10"
- name: SOURCE_PAGE_SIZE
value: "500"
- name: LOGSTASH_WORKERS
value: "4"
- name: LOGSTASH_BATCH_SIZE
value: "1000"
- name: OPENSEARCH_USERNAME
valueFrom:
secretKeyRef:
name: os-cluster-admin-password
key: username
- name: OPENSEARCH_PASSWORD
valueFrom:
secretKeyRef:
name: os-cluster-admin-password
key: password
- name: ELASTICSEARCH_USERNAME
value: "elastic"
- name: ELASTICSEARCH_PASSWORD
valueFrom:
secretKeyRef:
name: es-cluster-es-elastic-user
key: elastic
resources:
limits:
memory: "4Gi"
cpu: "2500m"
requests:
memory: "1Gi"
cpu: "300m"
volumeMounts:
- name: config-volume
mountPath: /usr/share/logstash/config
- name: etc
mountPath: /etc/logstash
readOnly: true
volumes:
- name: config-volume
projected:
sources:
- configMap:
name: logstash-configmap
items:
- key: pipelines.yml
path: pipelines.yml
- key: logstash.yml
path: logstash.yml
- name: etc
projected:
sources:
- configMap:
name: logstash-configmap
items:
- key: pipeline.conf
path: pipelines/pipeline.conf
- secret:
name: os-cluster-http-cert
items:
- key: ca.crt
path: certificates/opensearch-ca.crt
- secret:
name: es-cluster-es-http-ca-internal
items:
- key: tls.crt
path: certificates/elasticsearch-ca.crt
---
apiVersion: v1
kind: ConfigMap
metadata:
name: logstash-configmap
data:
pipelines.yml: |
- pipeline.id: reindex-os-es
path.config: "/etc/logstash/pipelines/pipeline.conf"
pipeline.batch.size: ${LOGSTASH_BATCH_SIZE}
pipeline.workers: ${LOGSTASH_WORKERS}
logstash.yml: |
log.level: info
pipeline.unsafe_shutdown: true
pipeline.ordered: false
pipeline.conf: |
input {
opensearch {
hosts => ["os-cluster:9200"]
ssl => true
ca_file => "/etc/logstash/certificates/opensearch-ca.crt"
user => "${OPENSEARCH_USERNAME}"
password => "${OPENSEARCH_PASSWORD}"
index => "${SOURCE_INDEX_NAME}"
slices => "${SOURCE_SLICES}"
size => "${SOURCE_PAGE_SIZE}"
scroll => "5m"
docinfo => true
docinfo_target => "[@metadata][doc]"
}
}
filter {
mutate {
remove_field => ["@version", "host", "data_stream"]
}
}
output {
elasticsearch {
hosts => "https://es-cluster-es-http:9200"
ssl => true
ssl_certificate_authorities => ["/etc/logstash/certificates/elasticsearch-ca.crt"]
ssl_verification_mode => "full"
user => "${ELASTICSEARCH_USERNAME}"
password => "${ELASTICSEARCH_PASSWORD}"
document_id => "%{[@metadata][doc][_id]}"
data_stream => "true"
data_stream_type => "logs"
data_stream_dataset => "myapplication"
data_stream_namespace => "reindex"
}
}
That’s it.
After a couple hours, we successfully migrated 1 billion documents from OpenSearch to Elasticsearch and even saved 23% plus on disk storage! Now that we have the logs in Elasticsearch how about extracting actual business value from them? Logs contain so much valuable information - we can not only do all sorts of interesting things with AIOPS, like Automatically Categorize those logs, but also extract business metrics and detect anomalies on them, give it a try.
OpenSearch | Elasticsearch | |||||
Index | docs | size | Index | docs | size | Diff. |
.ds-logs-myapplication-prod-000037 | 116842158 | 27285520870 | logs-myapplication-reindex-000037 | 116842158 | 21998435329 | 21.46% |
.ds-logs-myapplication-prod-000038 | 110994116 | 27263291740 | logs-myapplication-reindex-000038 | 110994116 | 21540011082 | 23.45% |
.ds-logs-myapplication-prod-000040 | 113362823 | 27872438186 | logs-myapplication-reindex-000040 | 113362823 | 22234641932 | 22.50% |
.ds-logs-myapplication-prod-000041 | 112400019 | 27618801653 | logs-myapplication-reindex-000041 | 112400019 | 22059453868 | 22.38% |
.ds-logs-myapplication-prod-000042 | 113859174 | 26686723701 | logs-myapplication-reindex-000042 | 113859174 | 21093766108 | 23.41% |
.ds-logs-myapplication-prod-000043 | 113821016 | 27657006598 | logs-myapplication-reindex-000043 | 113821016 | 22059454752 | 22.52% |
.ds-logs-myapplication-prod-000044 | 111093596 | 27281936915 | logs-myapplication-reindex-000044 | 111093596 | 21559513422 | 23.43% |
.ds-logs-myapplication-prod-000048 | 114273539 | 28111420495 | logs-myapplication-reindex-000048 | 114273539 | 22264398939 | 23.21% |
.ds-logs-myapplication-prod-000049 | 102519334 | 23731274338 | logs-myapplication-reindex-000049 | 102519334 | 19307250001 | 20.56% |
Interested in trying Elasticsearch? Start our 14-day free trial.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.