Production considerations

edit

Kibana alerting runs both alert checks and actions as persistent background tasks managed by the Kibana Task Manager. This has two major benefits:

  • Persistence: all task state and scheduling is stored in Elasticsearch, so if you restart Kibana, alerts and actions will pick up where they left off. Task definitions for alerts and actions are stored in the index specified by xpack.task_manager.index. The default is .kibana_task_manager. You must have at least one replica of this index for production deployments. If you lose this index, all scheduled alerts and actions are lost.
  • Scaling: multiple Kibana instances can read from and update the same task queue in Elasticsearch, allowing the alerting and action load to be distributed across instances. In cases where a Kibana instance no longer has capacity to run alert checks or actions, capacity can be increased by adding additional Kibana instances.

Running background alert checks and actions

edit

Kibana background tasks are managed by:

  • Polling an Elasticsearch task index for overdue tasks at 3 second intervals. You can change this interval using the xpack.task_manager.poll_interval setting.
  • Tasks are then claiming them by updating them in the Elasticsearch index, using optimistic concurrency control to prevent conflicts. Each Kibana instance can run a maximum of 10 concurrent tasks, so a maximum of 10 tasks are claimed each interval.
  • Tasks are run on the Kibana server.
  • In the case of alerts which are recurring background checks, upon completion the task is scheduled again according to the check interval.

Because by default tasks are polled at 3 second intervals and only 10 tasks can run concurrently per Kibana instance, it is possible for alert and action tasks to be run late. This can happen if:

  • Alerts use a small check interval. The lowest interval possible is 3 seconds, though intervals of 30 seconds or higher are recommended.
  • Many alerts or actions must be run at once. In this case pending tasks will queue in Elasticsearch, and be pulled 10 at a time from the queue at 3 second intervals.
  • Long running tasks occupy slots for an extended time, leaving fewer slots for other tasks.

For details on the settings that can influence the performance and throughput of Task Manager, see Task Manager Settings.