Alerting production considerations
editAlerting production considerations
editAlerting runs both rule checks and actions as persistent background tasks managed by the Task Manager.
When relying on rules and actions as mission critical services, make sure you follow the production considerations for Task Manager.
Running background rule checks and actions
editKibana uses background tasks to run rules and actions, distributed across all Kibana instances in the cluster.
By default, each Kibana instance polls for work at three second intervals, and can run a maximum of ten concurrent tasks. These tasks are then run on the Kibana server.
Rules are recurring background tasks which are rescheduled according to the check interval on completion. Actions are non-recurring background tasks which are deleted on completion.
For more details on Task Manager, see Running background tasks.
Rule and action tasks can run late or at an inconsistent schedule. This is typically a symptom of the specific usage of the cluster in question.
You can address such issues by tweaking the Task Manager settings or scaling the deployment to better suit your use case.
For detailed guidance, see Alerting Troubleshooting.
Scaling guidance
editAs rules and actions leverage background tasks to perform the majority of work, scaling Alerting is possible by following the Task Manager Scaling Guidance.
When estimating the required task throughput, keep the following in mind:
- Each rule uses a single recurring task that is scheduled to run at the cadence defined by its check interval.
- Each action uses a single task. However, because actions are taken per instance, alerts can generate a large number of non-recurring tasks.
It is difficult to predict how much throughput is needed to ensure all rules and actions are executed at consistent schedules. By counting rules as recurring tasks and actions as non-recurring tasks, a rough throughput can be estimated as a tasks per minute measurement.
Predicting the buffer required to account for actions depends heavily on the rule types you use, the amount of alerts they might detect, and the number of actions you might choose to assign to action groups. With that in mind, regularly monitor the health of your Task Manager instances.
Event log index lifecycle management
editThis functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
Alerts and actions log activity in a set of "event log" indices. These indices are configured with an index lifecycle management (ILM) policy, which you can customize. The default policy rolls over the index when it reaches 50GB, or after 30 days. Indices over 90 days old are deleted.
The name of the index policy is kibana-event-log-policy
. Kibana creates the index policy on startup, if it doesn’t already exist. The index policy can be customized for your environment, but Kibana never modifies the index policy after creating it.
Because Kibana uses the documents to display historic data, you should set the delete phase longer than you would like the historic data to be shown. For example, if you would like to see one month’s worth of historic data, you should set the delete phase to at least one month.
For more information on index lifecycle management, see: Index Lifecycle Policies.
Circuit breakers
editThere are several scenarios where running alerting rules and actions can start to negatively impact the overall health of a Kibana instance either by clogging up Task Manager throughput or by consuming so much CPU/memory that other operations cannot complete in a reasonable amount of time. There are several configurable circuit breakers to help minimize these effects.
Rules with very short intervals
editRunning large numbers of rules at very short intervals can quickly clog up Task Manager throughput, leading to higher schedule drift. Use xpack.alerting.rules.minimumScheduleInterval.value
to set a minimum schedule interval for rules. The default (and recommended) value for this configuration is 1m
. Use xpack.alerting.rules.minimumScheduleInterval.enforce
to specify whether to strictly enforce this minimum. While the default value for this setting is false
to maintain backwards compatibility with existing rules, set this to true
to prevent new and updated rules from running at an interval below the minimum.
Rules that run for a long time
editRules that run for a long time typically do so because they are issuing resource-intensive Elasticsearch queries or performing CPU-intensive processing. This can block the event loop, making Kibana inaccessible while the rule runs. By default, rule processing is cancelled after 5m
but this can be overriden using the xpack.alerting.rules.run.timeout
configuration. This value can also be configured per rule type using xpack.alerting.rules.run.ruleTypeOverrides
. For example, the following configuration sets the global timeout value to 1m
while allowing Index Threshold rules to run for 10m
before being cancelled.
xpack.alerting.rules.run: timeout: '1m' ruleTypeOverrides: - id: '.index-threshold' timeout: '10m'
When a rule run is cancelled, any alerts and actions that were generated during the run are discarded. This behavior is controlled by the xpack.alerting.cancelAlertsOnRuleTimeout
configuration, which defaults to true
. Set this to false
to receive alerts and actions after the timeout, although be aware that these may be incomplete and possibly inaccurate.
Rules that spawn too many actions
editRules that spawn too many actions can quickly clog up Task Manager throughput. This can occur if:
- A rule configured with a single action generates many alerts. For example, if a rule configured to run a single email action generates 100,000 alerts, then 100,000 actions will be scheduled during a run.
- A rule configured with multiple actions generates alerts. For example, if a rule configured to run an email action, a server log action and a webhook action generates 30,000 alerts, then 90,000 actions will be scheduled during a run.
Use xpack.alerting.rules.run.actions.max
to limit the maximum number of actions a rule can generate per run. This value can also be configured by connector type using xpack.alerting.rules.run.actions.connectorTypeOverrides
. For example, the following config sets the global maximum number of actions to 100 while allowing rules with Email actions to generate up to 200 actions.
xpack.alerting.rules.run: actions: max: 100 connectorTypeOverrides: - id: '.email' max: 200