- Elastic Cloud Enterprise - Elastic Cloud on your Infrastructure: other versions:
- Introducing Elastic Cloud Enterprise
- Preparing your installation
- Installing Elastic Cloud Enterprise
- Identify the deployment scenario
- Install ECE on a public cloud
- Install ECE on your own premises
- Alternative: Install ECE with Ansible
- Log into the Cloud UI
- Install ECE on additional hosts
- Migrate ECE to Podman hosts
- Post-installation steps
- Configuring your installation
- System deployments configuration
- Configure deployment templates
- Tag your allocators
- Edit instance configurations
- Create instance configurations
- Create deployment templates
- Configure system deployment templates
- Configure index management for templates
- Updating custom templates to support
node_roles
and autoscaling - Updating custom templates to support Integrations Server
- Default instance configurations
- Include additional Kibana plugins
- Manage snapshot repositories
- Manage licenses
- Change the ECE API URL
- Change endpoint URLs
- Enable custom endpoint aliases
- Configure allocator affinity
- Change allocator disconnect timeout
- Migrate ECE on Podman hosts to SELinux in
enforcing
mode
- Securing your installation
- Monitoring your installation
- Administering your installation
- Working with deployments
- Create a deployment
- Access Kibana
- Adding data to Elasticsearch
- Migrating data
- Ingesting data from your application
- Ingest data with Node.js on Elastic Cloud Enterprise
- Ingest data with Python on Elastic Cloud Enterprise
- Ingest data from Beats to Elastic Cloud Enterprise with Logstash as a proxy
- Ingest data from a relational database into Elastic Cloud Enterprise
- Ingest logs from a Python application using Filebeat
- Ingest logs from a Node.js web application using Filebeat
- Manage data from the command line
- Administering deployments
- Change your deployment configuration
- Maintenance mode
- Terminate a deployment
- Restart a deployment
- Restore a deployment
- Delete a deployment
- Migrate to index lifecycle management
- Disable an Elasticsearch data tier
- Access the Elasticsearch API console
- Work with snapshots
- Restore a snapshot across clusters
- Upgrade versions
- Editing your user settings
- Deployment autoscaling
- Configure Beats and Logstash with Cloud ID
- Keep your clusters healthy
- Keep track of deployment activity
- Secure your clusters
- Deployment heap dumps
- Deployment thread dumps
- Traffic Filtering
- Connect to your cluster
- Manage your Kibana instance
- Manage your APM & Fleet Server (7.13+)
- Manage your APM Server (versions before 7.13)
- Manage your Integrations Server
- Switch from APM to Integrations Server payload
- Enable logging and monitoring
- Enable cross-cluster search and cross-cluster replication
- Access other deployments of the same Elastic Cloud Enterprise environment
- Access deployments of another Elastic Cloud Enterprise environment
- Access deployments of an Elasticsearch Service organization
- Access clusters of a self-managed environment
- Enabling CCS/R between Elastic Cloud Enterprise and ECK
- Edit or remove a trusted environment
- Migrate the cross-cluster search deployment template
- Enable App Search
- Enable Enterprise Search
- Enable Graph (versions before 5.0)
- Troubleshooting
- RESTful API
- Authentication
- API calls
- How to access the API
- API examples
- Setting up your environment
- A first API call: What deployments are there?
- Create a first Deployment: Elasticsearch and Kibana
- Applying a new plan: Resize and add high availability
- Updating a deployment: Checking on progress
- Applying a new deployment configuration: Upgrade
- Enable more stack features: Add Enterprise Search to a deployment
- Dipping a toe into platform automation: Generate a roles token
- Customize your deployment
- Remove unwanted deployment templates and instance configurations
- Secure your settings
- API reference
- Changes to index allocation and API
- Script reference
- Release notes
- Elastic Cloud Enterprise 3.7.3
- Elastic Cloud Enterprise 3.7.2
- Elastic Cloud Enterprise 3.7.1
- Elastic Cloud Enterprise 3.7.0
- Elastic Cloud Enterprise 3.6.2
- Elastic Cloud Enterprise 3.6.1
- Elastic Cloud Enterprise 3.6.0
- Elastic Cloud Enterprise 3.5.1
- Elastic Cloud Enterprise 3.5.0
- Elastic Cloud Enterprise 3.4.1
- Elastic Cloud Enterprise 3.4.0
- Elastic Cloud Enterprise 3.3.0
- Elastic Cloud Enterprise 3.2.1
- Elastic Cloud Enterprise 3.2.0
- Elastic Cloud Enterprise 3.1.1
- Elastic Cloud Enterprise 3.1.0
- Elastic Cloud Enterprise 3.0.0
- Elastic Cloud Enterprise 2.13.4
- Elastic Cloud Enterprise 2.13.3
- Elastic Cloud Enterprise 2.13.2
- Elastic Cloud Enterprise 2.13.1
- Elastic Cloud Enterprise 2.13.0
- Elastic Cloud Enterprise 2.12.4
- Elastic Cloud Enterprise 2.12.3
- Elastic Cloud Enterprise 2.12.2
- Elastic Cloud Enterprise 2.12.1
- Elastic Cloud Enterprise 2.12.0
- Elastic Cloud Enterprise 2.11.2
- Elastic Cloud Enterprise 2.11.1
- Elastic Cloud Enterprise 2.11.0
- Elastic Cloud Enterprise 2.10.1
- Elastic Cloud Enterprise 2.10.0
- Elastic Cloud Enterprise 2.9.2
- Elastic Cloud Enterprise 2.9.1
- Elastic Cloud Enterprise 2.9.0
- Elastic Cloud Enterprise 2.8.1
- Elastic Cloud Enterprise 2.8.0
- Elastic Cloud Enterprise 2.7.2
- Elastic Cloud Enterprise 2.7.1
- Elastic Cloud Enterprise 2.7.0
- Elastic Cloud Enterprise 2.6.2
- Elastic Cloud Enterprise 2.6.1
- Elastic Cloud Enterprise 2.6.0
- Elastic Cloud Enterprise 2.5.1
- Elastic Cloud Enterprise 2.5.0
- Elastic Cloud Enterprise 2.4.3
- Elastic Cloud Enterprise 2.4.2
- Elastic Cloud Enterprise 2.4.1
- Elastic Cloud Enterprise 2.4.0
- Elastic Cloud Enterprise 2.3.2
- Elastic Cloud Enterprise 2.3.1
- Elastic Cloud Enterprise 2.3.0
- Elastic Cloud Enterprise 2.2.3
- Elastic Cloud Enterprise 2.2.2
- Elastic Cloud Enterprise 2.2.1
- Elastic Cloud Enterprise 2.2.0
- Elastic Cloud Enterprise 2.1.1
- Elastic Cloud Enterprise 2.1.0
- Elastic Cloud Enterprise 2.0.1
- Elastic Cloud Enterprise 2.0.0
- Elastic Cloud Enterprise 1.1.5
- Elastic Cloud Enterprise 1.1.4
- Elastic Cloud Enterprise 1.1.3
- Elastic Cloud Enterprise 1.1.2
- Elastic Cloud Enterprise 1.1.1
- Elastic Cloud Enterprise 1.1.0
- Elastic Cloud Enterprise 1.0.2
- Elastic Cloud Enterprise 1.0.1
- Elastic Cloud Enterprise 1.0.0
- What’s new with the Elastic Stack
- About this product
Rebuilding a broken Zookeeper quorum
editRebuilding a broken Zookeeper quorum
editThis article covers an advanced recovery method involving directly modifying Zookeeper. This process can potentially corrupt your data. Elastic recommends only following this outline after receiving confirmation by Elastic Support.
This article describes how to recover a broken Zookeeper leader or follower within Elastic Cloud Enterprise.
When to recover
editWhen an ECE director host’s Zookeeper status cannot be determined healthy using the Verify Zookeeper sync status command or from Elastic Cloud Enterprise > Platform > Settings, then you might need to recover Zookeeper.
This situation might surface when recovering the Elastic Cloud Enterprise director host from a full disk issue.
A healthy Zookeeper quorum returns a sync status similar to the following. Any other responses require further investigation.
$ # Zookeeper leader with id:10 $ echo mntr | nc 127.0.0.1 2191 zk_server_state leader # ... zk_followers 2 zk_synced_followers 2 $ # Zookeeper follower with id:11 $ echo mntr | nc 127.0.0.1 2192 zk_server_state follower $ # Zookeeper follower with id:12 $ echo mntr | nc 127.0.0.1 2193 zk_server_state follower
Back up data directories
editBefore recovering the Zookeeper leader or follower, back up all Elastic Cloud Enterprise hosts' Zookeeper data directories. Normally this is only applicable to director hosts, but may apply to other hosts during migrations.
Perform the following steps on each host to back up the Zookeeper data directory:
-
Extract the Zookeeper
/data
directory path:docker inspect --format '{{ range .Mounts }}{{ .Source }} {{ end }}' frc-zookeeper-servers-zookeeper | grep --color=auto \"zookeeper/data\"
-
Make a copy or backup of the emitted directory. For example, if data directory is
/mnt/data/elastic/172.16.0.30/services/zookeeper/data
, then run the following command:cp -R /mnt/data/elastic/172.16.0.30/services/zookeeper/data /mnt/data/elastic/ZK_data_backup
Determine the Zookeeper leader
editIf a Zookeeper quorum is broken, you must establish the best Zookeeper leader to use for recovery before you start the recovery proces.
The simplest way to check is using the Zookeeper sync status command.
If this command is not reporting any leaders, then perform the following actions on each director host:
- SSH into the host.
-
Enter the Docker
frc-zookeeper-servers-zookeeper
container and check its/app/logs/zookeeper.log
logs forLEADING
:$ docker exec -it frc-zookeeper-servers-zookeeper bash root@XXXXX:/# cat /app/logs/zookeeper.log | grep 'LEADING'
This command will return results similar to the following:
INFO [QuorumPeer[myid=10](plain=0.0.0.0:2191)(secure=disabled):o.a.z.s.q.QuorumPeer@1549] - LEADING INFO [QuorumPeer[myid=10](plain=0.0.0.0:2191)(secure=disabled):o.a.z.s.q.Leader@588] - LEADING - LEADER ELECTION TOOK - 225 MS
- If multiple directors report this log, then determine the one with the latest timestamp, which will contain the latest Zookeeper state.
Recover Zookeeper nodes
editIn the following recovery steps, the steps for the determined leader are marked with [leader]
, and the steps for all other
Zookeepers are marked with [followers]
. The [leader]
should be recovered as needed before
its [followers]
. Steps marked [followers]
should be performed on each
follower director host, and steps marked [director]
should be performed only on problematic director hosts.
Recover the Zookeeper Leader
editRestart the Zookeeper container
editTo recover the Zookeeper leader, you should first try to restart the Docker Zookeeper container. Restarting the container is often enough to trigger the leader to resync its connection to its followers.
Within a SSH session of Zookeeper hosts, run the following command:
docker restart frc-zookeeper-servers-zookeeper
Wait a few minutes for state to attempt to sync across leader and followers, then verify the Zookeeper sync status to see if the quorum has recovered.
If the Zookeeper leader is still not recovered, proceed to the next section.
Manually set the Zookeeper leader
editIf restarting the Zookeeper container does not recover the leader, you can manually set the leader and rebuild the quorum.
-
[followers]
Shut down the Docker Runner and Zookeeper containers:docker stop frc-runners-runner docker stop frc-zookeeper-servers-zookeeper
-
[leader]
Stop the Zookeeper service within the Docker container. Note this is stopping the service within the Docker container and not stopping the Zookeeper Docker container itself:docker exec -it frc-zookeeper-servers-zookeeper sv stop zookeeper
-
[leader]
Enter the Docker Zookeeper container and determine its Zookeeper ID:$ docker exec -it frc-zookeeper-servers-zookeeper bash root@XXXXX:/# cat /app/data/myid 10
-
[leader]
In the directory/app/managed/
, modify the Zookeeper filereplicated.cfg.dynamic
:- Remove the lines referencing other Zookeeper hosts.
-
If multiple lines reference
localhost
, then remove all but the one containing the Zookeeper ID from the previous step.
-
[leader]
Restart the Docker Zookeeper and Director containers:docker restart frc-zookeeper-servers-zookeeper docker restart frc-directors-director
-
[leader]
Check the Zookeeper sync status. The response should now show this director host as the Zookeeper leader. - Confirm that Elastic Cloud Enterprise is now also able to check the Zookeeper status and make changes.
-
[followers]
Restart the Docker Zookeeper, Director, and Runner containers:docker restart frc-zookeeper-servers-zookeeper docker restart frc-directors-director docker restart frc-runners-runner
-
Verify that the Zookeeper sync status reports an odd number for
zk_quorum_size
and that no Zookeeper hosts are marked aslost
.
Recover the Zookeeper follower
editZookeeper followers can sometimes refuse a [leader]
election or become state
corrupted. The following steps can be used to recover a broken or corrupted
Zookeeper [follower]
. These steps should only be considered after confirming
a Zookeeper leader, as the [follower]
will be reset to copy the state from
[leader]
.
On the [follower]
, do the following:
-
Get the director host’s Zookeeper
/data
directory path:docker inspect --format '{{ range .Mounts }}{{ .Source }} {{ end }}' frc-zookeeper-servers-zookeeper | grep --color=auto \"zookeeper/data\"
-
Stop the Docker Runner and Zookeeper containers:
docker stop frc-runners-runner docker stop frc-zookeeper-servers-zookeeper
-
Under the determined
/data
directory, remove the sub-directorydata/version-NUMBER
, replacing theNUMBER
placeholder./mnt/data/elastic/MY_IP/services/zookeeper/data$ rm -R ./version-NUMBER/
Make sure that
myid
file exists and is retained. -
Start the Runner container, which will auto-start the Docker Zookeeper container.
docker start frc-runners-runner
-
Wait a few minutes for Zookeeper states to sync. Then check the Zookeeper sync status to confirm the following:
-
zk_server_state follower
-
zk_outstanding_requests 0
-
-
Confirm that the
[leader]
recognizes the added[follower]
by checking the Zookeeper sync status for an incrementedzk_synced_followers
count.
On this page