Preload Elasticsearch with your data set
Recently, I got a question on the discussion forum on how to modify the official Docker image to provide a ready-to-use Elasticsearch® cluster that contains data already.
Honestly, this is not ideal because you'd have to hack the way the Elasticsearch service starts by providing a forked version of the entrypoint.sh
. Unfortunately, this will make your life harder when it comes to maintenance and upgrades. Instead, I found that it would be better to use other solutions to achieve the same goal.
Setting up the problem
To start with this idea, we will consider that we are using the Elasticsearch Docker image and following the documentation:
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.7.0
docker network create elastic
docker run --name es01 --net elastic -p 9200:9200 -it docker.elastic.co/elasticsearch/elasticsearch:8.7.0
Note that we are not mounting the data-dir here, so the data
for this cluster will be ephemeral and will disappear once the node has shut down.
After it starts, we can check with the generated password that it's running well:
curl -s -k -u elastic:CHANGEME https://localhost:9200 | jq
This gives:
{
"name": "697bf734a5d5",
"cluster_name": "docker-cluster",
"cluster_uuid": "cMISiT__RSWkoKDYql1g4g",
"version": {
"number": "8.7.0",
"build_flavor": "default",
"build_type": "docker",
"build_hash": "09520b59b6bc1057340b55750186466ea715e30e",
"build_date": "2023-03-27T16:31:09.816451435Z",
"build_snapshot": false,
"lucene_version": "9.5.0",
"minimum_wire_compatibility_version": "7.17.0",
"minimum_index_compatibility_version": "7.0.0"
},
"tagline": "You Know, for Search"
}
So, we want to have a data set already available. Let's take the sample data set I often use while demoing Elasticsearch: the person data set. I created a generator to create some fake data.
First, let's download the injector:
wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/injector/injector/8.7/injector-8.7.jar
Then, we will generate our data set on disk using the following options:
mkdir data
java -jar injector-8.7.jar --console --silent > data/persons.json
We have 1000000
JSON documents, and the data set should look like this:
head -2 data/persons.json
{"name":"Charlene Mickael","dateofbirth":"2000-11-01","gender":"female","children":3,"marketing":{"cars":1236,"shoes":null,"toys":null,"fashion":null,"music":null,"garden":null,"electronic":null,"hifi":1775,"food":null},"address":{"country":"Italy","zipcode":"80100","city":"Ischia","countrycode":"IT","location":{"lon":13.935138341699972,"lat":40.71842684204817}}}
{"name":"Kim Hania","dateofbirth":"1998-05-18","gender":"male","children":4,"marketing":{"cars":null,"shoes":null,"toys":132,"fashion":null,"music":null,"garden":null,"electronic":null,"hifi":null,"food":null},"address":{"country":"Germany","zipcode":"9998","city":"Berlin","countrycode":"DE","location":{"lon":13.164834451298645,"lat":52.604673827377155}}}
Using a Shell script
Here, we have 1m documents, so we cannot really send it as-is using a bulk request. Instead, we need to:
#!/bin/bash
ELASTIC_PASSWORD=CHANGEME
mkdir tmp
echo "Split the source in 10000 items"
split -d -l10000 ../data/persons.json tmp/part
BULK_REQUEST_FILE="tmp/bulk_request.ndjson"
FILES="tmp/part*"
for f in $FILES
do
rm $BULK_REQUEST_FILE
echo "Preparing $f file..."
while read p; do
echo -e '{"index":{}}' >> $BULK_REQUEST_FILE
echo -e "$p" >> $BULK_REQUEST_FILE
done <$f
echo "Calling Elasticsearch Bulk API"
curl -XPOST -s -k -u elastic:$ELASTIC_PASSWORD https://localhost:9200/person/_bulk -H 'Content-Type: application/json' --data-binary "@$BULK_REQUEST_FILE" | jq '"Bulk executed in \(.took) ms with errors=\(.errors)"'
done
This basically prints:
Preparing tmp/part00 file...
Calling Elasticsearch Bulk API
"Bulk executed in 1673 ms with errors=false"
Preparing tmp/part01 file...
Calling Elasticsearch Bulk API
"Bulk executed in 712 ms with errors=false"
...
Preparing tmp/part99 file...
Calling Elasticsearch Bulk API
"Bulk executed in 366 ms with errors=false"
On my machine, it took more than eight minutes to run it, where most of the time is spent on writing the bulk requests. There's probably a lot of room for improvement, but I must confess that I'm not that good at writing shell scripts — Ha. You guessed that already, huh?
Using Logstash
Logstash® can do a similar job of what we have done manually but can also provide much more features, such as error handling and monitoring. Plus, we don't even need to write the code. We will be using Docker again here:
docker pull docker.elastic.co/logstash/logstash:8.7.0
Let's write a job for this:
input {
file {
path => "/usr/share/logstash/persons/persons.json"
mode => "read"
codec => json { }
exit_after_read => true
}
}
filter {
mutate {
remove_field => [ "log", "@timestamp", "event", "@version" ]
}
}
output {
elasticsearch {
hosts => "${ELASTICSEARCH_URL}"
index => "person"
user => "elastic"
password => "${ELASTIC_PASSWORD}"
ssl_certificate_verification => false
}
}
We can now run the job:
docker run --rm -it --name ls01 --net elastic \
-v $(pwd)/../data/:/usr/share/logstash/persons/:ro \
-v $(pwd)/pipeline/:/usr/share/logstash/pipeline/:ro \
-e XPACK_MONITORING_ENABLED=false \
-e ELASTICSEARCH_URL="https://es01:9200" \
-e ELASTIC_PASSWORD="CHANGEME" \
docker.elastic.co/logstash/logstash:8.7.0
On my machine, it took less than two minutes to run it.
Using docker compose
Instead of running everything manually, you can easily use the docker compose
command to run everything as needed and provide your users with a ready-to-use cluster.
Here is a simple .env
file:
ELASTIC_PASSWORD=CHANGEME
STACK_VERSION=8.7.0
ES_PORT=9200
And the docker-compose.yml
:
version: "2.2"
services:
es01:
image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
ports:
- ${ES_PORT}:9200
environment:
- node.name=es01
- cluster.initial_master_nodes=es01
- ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
- bootstrap.memory_lock=true
ulimits:
memlock:
soft: -1
hard: -1
healthcheck:
test:
[
"CMD-SHELL",
"curl -s -k https://localhost:9200 | grep -q 'missing authentication credentials'",
]
interval: 10s
timeout: 10s
retries: 120
logstash:
depends_on:
es01:
condition: service_healthy
image: docker.elastic.co/logstash/logstash:${STACK_VERSION}
volumes:
- type: bind
source: ../data
target: /usr/share/logstash/persons
read_only: true
- type: bind
source: pipeline
target: /usr/share/logstash/pipeline
read_only: true
environment:
- ELASTICSEARCH_URL=https://es01:9200
- ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
- XPACK_MONITORING_ENABLED=false
We still have our persons.json
file in the ../data
dir. It's mounted as /usr/share/logstash/persons/persons.json
as it was also in the previous example. So, we are using the same pipeline/persons.conf
file as seen before.
To run this, we can now just type:
docker compose up
And wait for the with-compose-logstash-1
container to exit:
with-compose-logstash-1 | [2023-04-21T15:17:55,335][INFO ][logstash.runner ] Logstash shut down.
with-compose-logstash-1 exited with code 0
This indicates that our service is now ready to run and fully loaded with our sample data set.
Using snapshot and restore
You can also use the Create Snapshot API to back up an existing dataset living in Elasticsearch to a shared filesystem or to S3, for example, and then restore it to your new cluster using the Restore API.
Let say you have already registered a repository named sample
. You can create the snapshot with:
# We force merge the segments first
POST /person/_forcemerge?max_num_segments=1
# Snapshot the data
PUT /_snapshot/sample/persons
{
"indices": "person",
"include_global_state": false
}
So, any time that you are starting a new cluster, you can just restore the snapshot with:
POST /_snapshot/sample/persons/_restore
You just need to be careful with this method in that the snapshot you have can still be restored in your cluster when you upgrade it to a new major release. For example, if you have created a snapshot with version 6.3, you won't be able to restore it in 8.2. See Snapshot Index Compatibility for more details. But, good news! With Archive Indices, Elasticsearch now has the ability to access older snapshot repositories (going back to version 5). You just need to be aware of some of the restrictions. To guarantee that your snapshot will always be fully compatible, you might want to snapshot your index again with the most recent version using the same script. Note that the Force Merge API call is important in that case as it will rewrite all the segments using the latest Elasticsearch and Lucene versions.
Using a mounted directory
Remember when we started the cluster?
docker run --name es01 --net elastic -p 9200:9200 -it docker.elastic.co/elasticsearch/elasticsearch:8.7.0
We did not use a bind mount for the data
and the config
dirs. But, we can actually do this with:
docker run --name es01 --net elastic -p 9200:9200 -it -v persons-data:/usr/share/elasticsearch/data -v persons-config:/usr/share/elasticsearch/config docker.elastic.co/elasticsearch/elasticsearch:8.7.0
We can inspect the Docker volumes that have just been created:
docker volume inspect persons-data persons-config
[
{
"CreatedAt": "2023-05-09T10:20:14Z",
"Driver": "local",
"Labels": null,
"Mountpoint": "/var/lib/docker/volumes/persons-data/_data",
"Name": "persons-data",
"Options": null,
"Scope": "local"
},
{
"CreatedAt": "2023-05-09T10:19:51Z",
"Driver": "local",
"Labels": null,
"Mountpoint": "/var/lib/docker/volumes/persons-config/_data",
"Name": "persons-config",
"Options": null,
"Scope": "local"
}
]
You can reuse this Mountpoint later again if you want to start your Elasticsearch node using the same command line.
If you need to share your volumes with other users, you can back up the data from /var/lib/docker/volumes/persons-config/
and /var/lib/docker/volumes/persons-data/
to /tmp/volume-backup
:
docker run --rm -it -v /tmp/volume-backup:/backup -v /var/lib/docker:/docker alpine:edge tar cfz /backup/persons.tgz /docker/volumes/persons-config /docker/volumes/persons-data
Then, you can just share the /tmp/volume-backup/persons.tgz
file with the other users and let them restore it.
docker volume create persons-config
docker volume create persons-data
docker run --rm -it -v /tmp/volume-backup:/backup -v /var/lib/docker:/docker alpine:edge tar xfz /backup/persons.tgz -C /
And start the container again:
docker run --name es01 --net elastic -p 9200:9200 -it -v persons-data:/usr/share/elasticsearch/data -v persons-config:/usr/share/elasticsearch/config docker.elastic.co/elasticsearch/elasticsearch:8.7.0
Using Elastic Cloud
Of course, instead of starting and managing a local Elasticsearch instance by yourself, you can provision a new Elasticsearch Cloud instance using a snapshot you did previously. The following code supposes that you have already defined an API key.
POST /api/v1/deployments?validate_only=false
{
"resources": {
"elasticsearch": [
{
"region": "gcp-europe-west1",
"plan": {
"cluster_topology": [
{
"zone_count": 2,
"elasticsearch": {
"node_attributes": {
"data": "hot"
}
},
"instance_configuration_id": "gcp.es.datahot.n2.68x10x45",
"node_roles": [
"master",
"ingest",
"transform",
"data_hot",
"remote_cluster_client",
"data_content"
],
"id": "hot_content",
"size": {
"resource": "memory",
"value": 8192
}
}
],
"elasticsearch": {
"version": "8.7.1"
},
"deployment_template": {
"id": "gcp-storage-optimized-v5"
},
"transient": {
"restore_snapshot": {
"snapshot_name": "__latest_success__",
"source_cluster_id": "CLUSTER_ID"
}
}
},
"ref_id": "main-elasticsearch"
}
],
"kibana": [
{
"elasticsearch_cluster_ref_id": "main-elasticsearch",
"region": "gcp-europe-west1",
"plan": {
"cluster_topology": [
{
"instance_configuration_id": "gcp.kibana.n2.68x32x45",
"zone_count": 1,
"size": {
"resource": "memory",
"value": 1024
}
}
],
"kibana": {
"version": "8.7.1"
}
},
"ref_id": "main-kibana"
}
]
},
"settings": {
"autoscaling_enabled": false
},
"name": "persons",
"metadata": {
"system_owned": false
}
}
Just replace CLUSTER_ID
with the source cluster where you took the snapshot from.
Once the cluster is up, you have a fully functional instance that's available on the internet with the default data set you want.
Once you're done, you can easily shut down the deployment using:
POST /api/v1/deployments/DEPLOYMENT_ID/_shutdown
Here, again, just replace DEPLOYMENT_ID
with the deployment ID you saw when you created the deployment.
Conclusion
As usual, with Elastic® and Elasticsearch specifically, you have many ways to achieve your goals. Although there are probably other ways to do this, I've listed some of them here:
- Using a Shell script: You don't really need any third-party tool, but this requires some code to be written. The code looks trivial and is OK when you just run that from time to time. If you need it to be safer like having a catch and retry feature, then that will make you create and maintain even more code.
- Using Logstash: It's super flexible as you can also send the data to other destinations than Elasticsearch or use multiple filters to modify or enrich the source data set. It's a bit slower to start, but it shouldn't be an issue for testing purposes.
- Using docker compose: One of my favorite ways. You just run
docker compose up
and voilà!, you're done after a few minutes. But this can take a while and use hardware resources. - Using snapshot and restore: Faster than the previous methods as the data is already indexed but is less flexible since the snapshot needs to be compatible with the cluster you are restoring. In general, I always prefer injecting the data again because everything is fresh, and I can benefit from all the improvements from Elasticsearch and Lucene under the hood.
- Using mounted directory: Like a snapshot but more local. I honestly prefer using Elastic APIs than mounting an existing directory manually.
- Using Elastic Cloud: It's in my opinion the easiest way to share a data set with other people like customers or internal testers. It's all set, secure, and ready-to-use with proper SSL certificates.
Depending on your taste and your constraints, you can pick one of those ideas and adapt it for your needs. If you have ideas to share, please tell us on Twitter or on the discussion forum. A lot of great ideas and features are coming from the community. Share yours!
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.