Alerting production considerations
editAlerting production considerations
editAlerting runs both alert checks and actions as persistent background tasks managed by the Task Manager.
When relying on alerts and actions as mission critical services, make sure you follow the production considerations for Task Manager.
Running background alert checks and actions
editKibana uses background tasks to run alerts and actions, distributed across all Kibana instances in the cluster.
By default, each Kibana instance polls for work at three second intervals, and can run a maximum of ten concurrent tasks. These tasks are then run on the Kibana server.
Alerts are recurring background tasks which are rescheduled according to the check interval on completion. Actions are non-recurring background tasks which are deleted on completion.
For more details on Task Manager, see Running background tasks.
Alert and action tasks can run late or at an inconsistent schedule. This is typically a symptom of the specific usage of the cluster in question.
You can address such issues by tweaking the Task Manager settings or scaling the deployment to better suit your use case.
For detailed guidance, see Alerting Troubleshooting.
Scaling Guidance
editAs alerts and actions leverage background tasks to perform the majority of work, scaling Alerting is possible by following the Task Manager Scaling Guidance.
When estimating the required task throughput, keep the following in mind:
- Each alert uses a single recurring task that is scheduled to run at the cadence defined by its check interval.
- Each action uses a single task. However, because actions are taken per instance, alerts can generate a large number of non-recurring tasks.
It is difficult to predict how much throughput is needed to ensure all alerts and actions are executed at consistent schedules. By counting alerts as recurring tasks and actions as non-recurring tasks, a rough throughput can be estimated as a tasks per minute measurement.
Predicting the buffer required to account for actions depends heavily on the alert types you use, the amount of alert Instances they might detect, and the number of actions you might choose to assign to action groups. With that in mind, regularly monitor the health of your Task Manager instances.