Common Issues

edit

This page describes how to resolve common problems you might encounter with Alerting.

Rules with small check intervals run late

edit

Problem

Rules with a small check interval, such as every two seconds, run later than scheduled.

Solution

Rules run as background tasks at a cadence defined by their check interval. When a Rule check interval is smaller than the Task Manager poll_interval, the rule will run late.

Either tweak the Kibana Task Manager settings or increase the check interval of the rules in question.

For more details, see Tasks with small schedule intervals run late.

Rules with the inconsistent cadence

edit

Problem

Scheduled rules run at an inconsistent cadence, often running late.

Actions run long after the status of a rule changes, sending a notification of the change too late.

Solution

Rules and actions run as background tasks by each Kibana instance at a default rate of ten tasks every three seconds. When diagnosing issues related to alerting, focus on the tasks that begin with alerting: and actions:.

Alerting tasks always begin with alerting:. For example, the alerting:.index-threshold tasks back the index threshold stack rule. Action tasks always begin with actions:. For example, the actions:.index tasks back the index action.

For more details on monitoring and diagnosing task execution in Task Manager, see Health monitoring.

Connectors have TLS errors when executing actions

edit

Problem

When executing actions, a connector gets a TLS socket error when connecting to the server.

Solution

Configuration options are available to specialize connections to TLS servers, including ignoring server certificate validation, and providing certificate authority data to verify servers using custom certificates. For more details, see Action settings.

Rules take a long time to run

edit

Problem

Rules are taking a long time to execute and are impacting the overall health of your deployment.

By default, only users with a superuser role can query the [preview] This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features. Kibana event log because it is a system index. To enable additional users to execute this query, assign read privileges to the .kibana-event-log* index.

Solution

Query for a list of rule ids, bucketed by their execution times:

GET /.kibana-event-log*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1d", 
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "event.action": {
              "value": "execute"
            }
          }
        },
        {
          "term": {
            "event.provider": {
              "value": "alerting" 
            }
          }
        }
      ]
    }
  },
  "runtime_mappings": { 
    "event.duration_in_seconds": {
      "type": "double",
      "script": {
        "source": "emit(doc['event.duration'].value / 1E9)"
      }
    }
  },
  "aggs": {
    "ruleIdsByExecutionDuration": {
      "histogram": {
        "field": "event.duration_in_seconds",
        "min_doc_count": 1,
        "interval": 1 
      },
      "aggs": {
        "ruleId": {
          "nested": {
            "path": "kibana.saved_objects"
          },
          "aggs": {
            "ruleId": {
              "terms": {
                "field": "kibana.saved_objects.id",
                "size": 10 
              }
            }
          }
        }
      }
    }
  }
}

This queries for rules executed in the last day. Update the values of lte and gte to query over a different time range.

Use event.provider: actions to query for long-running action executions.

Execution durations are stored as nanoseconds. This adds a runtime field to convert that duration into seconds.

This interval buckets the event.duration_in_seconds runtime field into 1 second intervals. Update this value to change the granularity of the buckets. If you are unable to use runtime fields, make sure this aggregation targets event.duration and use nanoseconds for the interval.

This retrieves the top 10 rule ids for this duration interval. Update this value to retrieve more rule ids.

This query returns the following:

{
  "took" : 322,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 326,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "ruleIdsByExecutionDuration" : {
      "buckets" : [
        {
          "key" : 0.0, 
          "doc_count" : 320,
          "ruleId" : {
            "doc_count" : 320,
            "ruleId" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "1923ada0-a8f3-11eb-a04b-13d723cdfdc5",
                  "doc_count" : 140
                },
                {
                  "key" : "15415ecf-cdb0-4fef-950a-f824bd277fe4",
                  "doc_count" : 130
                },
                {
                  "key" : "dceeb5d0-6b41-11eb-802b-85b0c1bc8ba2",
                  "doc_count" : 50
                }
              ]
            }
          }
        },
        {
          "key" : 30.0, 
          "doc_count" : 6,
          "ruleId" : {
            "doc_count" : 6,
            "ruleId" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "41893910-6bca-11eb-9e0d-85d233e3ee35",
                  "doc_count" : 6
                }
              ]
            }
          }
        }
      ]
    }
  }
}

Most rule execution durations fall within the first bucket (0 - 1 seconds).

A single rule with id 41893910-6bca-11eb-9e0d-85d233e3ee35 took between 30 and 31 seconds to execute.

Use the Get Rule API to retrieve additional information about rules that take a long time to execute.

Rule cannot decrypt apiKey

edit

Problem:

The rule fails to execute and has an Unable to decrypt attribute "apiKey" error.

Solution:

This error happens when the xpack.encryptedSavedObjects.encryptionKey value used to create the rule does not match the value used during rule execution. Depending on the scenario, there are different ways to solve this problem:

If the value in xpack.encryptedSavedObjects.encryptionKey was manually changed, and the previous encryption key is still known.

Ensure any previous encryption key is included in the keys used for decryption only.

If another Kibana instance with a different encryption key connects to the cluster.

The other Kibana instance might be trying to run the rule using a different encryption key than what the rule was created with. Ensure the encryption keys among all the Kibana instances are the same, and setting decryption only keys for previously used encryption keys.

If other scenarios don’t apply.

Generate a new API key for the rule by disabling then enabling the rule.