Troubleshoot common Elastic Cloud on Kubernetes issues

Simplify monitoring with AutoOps

AutoOps is a monitoring tool that simplifies cluster management through performance recommendations, resource utilization visibility, and real-time issue detection with resolution paths. Learn more about AutoOps.

Operator crashes on startup with `OOMKilled`

On very large Kubernetes clusters with many hundreds of resources (pods, secrets, config maps, and so on), the operator may fail to start with its pod getting terminated with a OOMKilled reason:

		kubectl -n elastic-system \
  get pods -o=jsonpath='{.items[].status.containerStatuses}' | jq

		[
  {
    "containerID": "containerd://...",
    "image": "docker.elastic.co/eck/eck-operator:3.3.0",
    "imageID": "docker.elastic.co/eck/eck-operator@sha256:...",
    "lastState": {
      "terminated": {
        "containerID": "containerd://...",
        "exitCode": 137,
        "finishedAt": "2022-07-04T09:47:02Z",
        "reason": "OOMKilled",
        "startedAt": "2022-07-04T09:46:43Z"
      }
    },
    "name": "manager",
    "ready": false,
    "restartCount": 2,
    "started": false,
    "state": {
      "waiting": {
        "message": "back-off 20s restarting failed container=manager pod=elastic-operator-0_elastic-system(57de3efd-57e0-4c1e-8151-72b0ac4d6b14)",
        "reason": "CrashLoopBackOff"
      }
    }
  }
]
		
	

This is an issue with the controller-runtime framework on top of which the operator is built. Even though the operator is only interested in the resources created by itself, the framework code needs to gather information about all relevant resources in the Kubernetes cluster in order to provide the filtered view of cluster state required by the operator. On very large clusters, this information gathering can use up a lot of memory and exceed the default resource limit defined for the operator pod.

The default memory limit for the operator pod is set to 1 Gi. You can increase (or decrease) this limit to a value suited to your cluster as follows:

		kubectl patch sts elastic-operator -n elastic-system -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager", "resources":{"limits":{"memory":"2Gi"}}}]}}}}'
		
	

Note

Set limits (spec.containers[].resources.limits) that match requests (spec.containers[].resources.requests) to prevent operator’s Pod from being terminated during node-pressure eviction.

Timeout when submitting a resource manifest

When submitting a ECK resource manifest, you may encounter an error message similar to the following:

		Error from server (Timeout): error when creating "elasticsearch.yaml": Timeout: request did not complete within requested timeout 30s
		
	

This error is usually an indication of a problem communicating with the validating webhook. If you are running ECK on a private Google Kubernetes Engine (GKE) cluster, you may need to add a firewall rule allowing port 9443 from the API server. Another possible cause for failure is if a strict network policy is in effect. Refer to the webhook troubleshooting documentation for more details and workarounds.

Copying secrets with Owner References

Copying the Elasticsearch Secrets generated by ECK (for instance, the certificate authority or the elastic user) into another namespace wholesale can trigger a Kubernetes bug which can delete all of the Elasticsearch-related resources, for example, the data volumes. Since ECK 1.3.1, OwnerReference was removed both from Elasticsearch Secrets containing public certificates and the Secret holding the elastic user credentials. These secrets are likely to be copied. If Secrets were copied in other namespaces before ECK 1.3.1, make sure you manually remove the OwnerReference, as these Secrets might still be affected, even if ECK has been upgraded.

For example, a source secret might be:

		kubectl get secret quickstart-es-elastic-user -o yaml
apiVersion: v1
data:
  elastic: NGw2Q2REMjgwajZrMVRRS0hxUDVUUTU0
kind: Secret
metadata:
  creationTimestamp: "2020-06-09T19:11:41Z"
  labels:
    common.k8s.elastic.co/type: elasticsearch
    eck.k8s.elastic.co/credentials: "true"
    elasticsearch.k8s.elastic.co/cluster-name: quickstart
  name: quickstart-es-elastic-user
  namespace: default
  ownerReferences:
  - apiVersion: elasticsearch.k8s.elastic.co/v1
    blockOwnerDeletion: true
    controller: true
    kind: Elasticsearch
    name: quickstart
    uid: c7a9b436-aa07-4341-a2cc-b33b3dfcbe29
  resourceVersion: "13048277"
  selfLink: /api/v1/namespaces/default/secrets/quickstart-es-elastic-user
  uid: 04cdf334-77d3-4de6-a2e8-7a2b23366a27
type: Opaque
		
	

To copy it to a different namespace, strip the metadata.ownerReferences field as well as the object-specific data:

		apiVersion: v1
data:
  elastic: NGw2Q2REMjgwajZrMVRRS0hxUDVUUTU0
kind: Secret
metadata:
  labels:
    common.k8s.elastic.co/type: elasticsearch
    eck.k8s.elastic.co/credentials: "true"
    elasticsearch.k8s.elastic.co/cluster-name: quickstart
  name: quickstart-es-elastic-user
  namespace: default
type: Opaque
		
	

Failure to do so can cause data loss.

Scale down of Elasticsearch master-eligible Pods seems stuck

If a master-eligible Elasticsarch Pod was never successfully scheduled and the Elasticsearch cluster is running version 7.8 or earlier, ECK may fail to scale down the Pod. To find out whether you are affected, check if the Pod in question is pending:

		> kubectl get pods
pod/<cluster-name>-es-<nodeset>-1                    0/1     Pending   0          26m    <none>        <none>

Check the operator logs for an error similar to:

		"unable to add to voting_config_exclusions: 400 Bad Request: add voting config exclusions request for [<cluster-name>-es-<nodeset>-1] matched no master-eligible nodes",
		
	

To work around this issue, scale down the underlying StatefulSet manually. First, identify the affected StatefulSet and the number of Pods that are ready (symbolized by m in this example):

		> kubectl get sts -l elasticsearch.k8s.elastic.co/cluster-name=<cluster-name>
NAME                       READY   AGE
<cluster-name>-es-<nodeset>   m/n     44h
		
	

Then, scale down the StatefulSet to the right size m, removing the pending Pod:

> kubectl scale --replicas=m  sts/<cluster-name>-es-<nodeset>

Warning

Do not use this method to scale down Pods that have already joined the Elasticsearch cluster, as additional data loss protection that ECK applies is sidestepped.

Pods are not replaced after a configuration update

The update of an existing Elasticsearch cluster configuration can fail because the operator is unable to apply the changes required while replacing the pods of a given Elasticsearch cluster.

A key indicator is when the Phase of the Elasticsearch resource is in ApplyingChanges state for too long:

		kubectl get es

NAME                  HEALTH   NODES    VERSION   PHASE            AGE
elasticsearch-sample  yellow   2        7.9.2     ApplyingChanges  36m

Possible causes include:

The Elasticsearch cluster is not healthy

		kubectl get elasticsearch

NAME                                                              HEALTH   NODES   VERSION   PHASE   AGE
elasticsearch.elasticsearch.k8s.elastic.co/elasticsearch-sample   yellow   1       7.9.2     Ready   3m50s

In this case, you have to check and fix your shard allocations. The cluster health, cat shards, and get Elasticsearch APIs can assist in tracking the shard recover process.

Scheduling issues

The scheduling fails with the following message:

		kubectl get events --sort-by='{.lastTimestamp}' | tail

LAST SEEN   TYPE      REASON             OBJECT                        MESSAGE
10s         Warning   FailedScheduling   pod/quickstart-es-default-2   0/3 nodes are available: 3 Insufficient memory.

As an alternative, to get more specific information about a given pod, you can use the following command:

		kubectl get pod elasticsearch-sample-es-default-2  -o go-template="{{.status}}"
map[conditions:[map[lastProbeTime:<nil> lastTransitionTime:2020-12-07T09:31:06Z message:0/3 nodes are available: 3 Insufficient cpu. reason:Unschedulable status:False type:PodScheduled]] phase:Pending qosClass:Guaranteed]

The operator is not able to restart some nodes

		kubectl -n elastic-system logs statefulset.apps/elastic-operator | tail

{"log.level":"info","@timestamp":"2020-11-19T17:34:48.769Z","log.logger":"driver","message":"Cannot restart some nodes for upgrade at this time","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"default","es_name":"quickstart","failed_predicates":{"do_not_restart_healthy_node_if_MaxUnavailable_reached":["quickstart-es-default-1","quickstart-es-default-0"]}}

A pod is stuck in a Pending status:

		kubectl get pods

NAME                      READY   STATUS    RESTARTS   AGE
quickstart-es-default-0   1/1     Running   0          146m
quickstart-es-default-1   1/1     Running   0          146m
quickstart-es-default-2   0/1     Pending   0          134m
		
	

In this case, you have to add more K8s nodes, or free up resources.

For more information, check Troubleshooting methods.

ECK operator upgrade stays pending when using OLM

When using Operator Lifecycle Manager (OLM) to install and upgrade the ECK operator an upgrade of ECK will not complete on older versions of OLM. This is due to an issue in OLM itself that is fixed in version 0.16.0 or later. OLM is also used behind the scenes when you install ECK as a Red Hat Certified Operator on OpenShift or as a community operator through operatorhub.io.

		> oc get csv
NAME                           DISPLAY                        VERSION     REPLACES                   PHASE
elastic-cloud-eck.v1.3.1       Elasticsearch (ECK) Operator   1.3.1       elastic-cloud-eck.v1.3.0   Replacing
elastic-cloud-eck.v1.4.0       Elasticsearch (ECK) Operator   1.4.0       elastic-cloud-eck.v1.3.1   Pending
		
	

If you are using one of the affected versions of OLM and upgrading OLM to a newer version is not possible then ECK can still be upgraded by uninstalling and reinstalling it. This can be done by removing the Subscription and both ClusterServiceVersion resources and adding them again. On OpenShift the same workaround can be performed in the UI by clicking on "Uninstall Operator" and then reinstalling it through OperatorHub.

If you upgraded Elasticsearch to the wrong version

If you accidentally upgrade one of your Elasticsearch clusters to a version that does not exist or a version to which a direct upgrade is not possible from your currently deployed version, a validation will prevent you from going back to the previous version. The reason for this validation is that ECK will not allow downgrades as this is not supported by Elasticsearch and once the data directory of Elasticsearch has been upgraded there is no way back to the old version without a snapshot restore.

These two upgrading scenarios, however, are exceptions because Elasticsearch never started up successfully. If you annotate the Elasticsearch resource with eck.k8s.elastic.co/disable-downgrade-validation=true ECK allows you to go back to the old version at your own risk. If you also attempted an upgrade of other related Elastic Stack applications at the same time you can use the same annotation to go back. Remove the annotation afterwards to prevent accidental downgrades and reduced availability.

Reconfigure stack config policy based role mappings after an upgrade to 8.15.3 from 8.14.x or 8.15.x

You have role mappings defined in a StackConfigPolicy, and you upgraded from 8.14.x or 8.15.x, to 8.15.3.

Examples: - 8.14.2 → 8.15.2 → 8.15.3 - 8.14.2 → 8.15.3 - 8.15.2 → 8.15.3

The best option is to upgrade to 8.16.0, to fix the problem automatically. If this is not possible and you are stuck on 8.15.3, you have to perform two manual steps in order to correctly reconfigure role mappings because due to a bug the role mappings were duplicated.

Force reload the StackConfigPolicy configuration

Force reload the StackConfigPolicy configuration containing the role mappings definition, by adding metadata to any of the mappings:

		apiVersion: stackconfigpolicy.k8s.elastic.co/v1alpha1
kind: StackConfigPolicy
spec:
  elasticsearch:
    securityRoleMappings:
      <roleName>:
        metadata:
          force_reload: anything
		
	

add a dummy metadata to force reload the config

Check that the role mapping is now in the cluster state:

		GET /_cluster/state/metadata?filter_path=metadata.role_mappings.role_mappings
{"metadata":{"role_mappings":{"role_mappings":[{"enabled":true,"roles":["superuser"],"rules":{"all":[{"field":{"realm.name":"oidc1"}},{"field":{"username":"*"}}]},"metadata":{"force_reload":"dummy"}}]}}}

Remove duplicated role mappings exposed via the API

Start by listing all the role mappings defined in your StackConfigPolicy:

		kubectl get scp <scpName> -o json | jq '.spec.elasticsearch.securityRoleMappings | to_entries[].key' -r
<roleName>

Delete each role:

		DELETE /_security/role_mapping/<roleName>
{"found": true}

Check that the role mapping was deleted:

		GET /_security/role_mapping/<roleName>
{}

Volume expansion failed

If you attempted an expansion of an Elasticsearch data volume via its volume claim template, you may have encountered scenarios where the operation failed. For example older versions of the Azure Disk CSI driver did not allow volume expansion without shutting down the Virtual Machine to which the volume was attached. If you try to adjust the volume claim template back to the original size you will encounter an error:

		Failed to apply spec change: handle volume expansion: decreasing storage size is not supported: an attempt was made to decrease storage size for claim elasticsearch-data
		
	

In this scenario the best course of action is to rename the existing nodeSet to a new name while simultaneously updating the volume claim template to the original size. This operation will bring a new StatefulSet online while moving all existing data to the new volumes, and will delete the old StatefulSet and its volumes once the operation is complete.