Troubleshooting container engines

edit

This article describes how to troubleshoot container engine services in Elastic Cloud Enterprise. We refer to Docker by default, as it’s the most common container engine, but these steps are also valid for Podman. You can simply replace docker in the commands with podman as needed.

Do not restart the Docker daemon unless directly prescribed by Elastic Support upon reviewing an Elastic Cloud Enterprise diagnostic, as historically Docker can leave residual orphan processes. We also advise against running any variation of Docker’s prune to avoid accidental data loss.

Use supported configuration

edit

Make sure to use a combination of Linux operating systems and container engine version that is supported, following our official Support matrix. Using unsupported combinations can cause a plethora of either intermediate or potentially permanent issues with you Elastic Cloud Enterprise environment, such as failures to create system deployments, to upgrade workload deployments, proxy timeouts, data loss, and more.

Troubleshoot unhealthy containers

edit

While troubleshooting the stability of an Elastic Cloud Enterprise host, you may encounter unhealthy Docker containers as reported by ps.

System containers reporting unhealthy is infrequent and usually only occurs after an unexpected occurance or issues while performing operating system maintenance. If operating system maintenance does need performed, kindly pivot to our perform host maintenance guide.

Restart deployment instances

edit

If the unhealthy Docker container is a Deployment’s instance, name formatting fac-{cluster_id}-instance-{node_id}, we recommend restarting the instance from the Elastic Cloud Enterprise UI via its pause and resume mechanism rather than via Docker.

If the unhealthy status returns, we recommend investigating via our troubleshooting bootlooping guide.

This should indicate an issue with the Elasticsearch configuration rather than any Docker-level problem. An isolated exception effecting air-gapped environments is if the expected Docker image does not yet exist on the Allocator in which case its logs would report Unable to pull image.

Restart service containers

edit

While troubleshooting unhealthy Elastic Cloud Enterprise system containers (name prefix frc-), some may be restarted while others should not.

Elastic Cloud Enterprise’s runners will automatically create or restart missing system containers. If you’re attempting to permanently remove a system container by removing its role from the host, you’d instead update runner roles. If eligible system containers return to an unhealthy status after restart, we recommend reviewing their start-up Docker logs.

It is safe to restart the following via Docker stop followed by Docker rm on:

  • frc-allocator-metricbeats-allocator-metricbeat
  • frc-allocators-allocator
  • frc-beats-runners-beats-runner
  • frc-constructors-constructor
  • frc-proxies-proxyv2

It is safe to restart the following via Docker restart:

  • frc-client-forwarders-client-forwarder
  • frc-directors-director
  • frc-services-forwarders-services-forwarder

It is not safe to restart the following without explicit steps from Elastic Support upon reviewing an Elastic Cloud Enterprise diagnostic:

  • any container name prefixing fac-
  • frc-runners-runner
  • frc-zookeeper-servers-zookeeper

For unhealthy Zookeeper, instead see verify Zookeeper sync status and resolving Zookeeper quorum.

For any Elastic Cloud Enterprise system container not listed, kindly reach out to Elastic Support for advisement.