Task Manager

edit

Kibana Task Manager is leveraged by features such as Alerting, Actions, and Reporting to run mission critical work as persistent background tasks. These background tasks distribute work across multiple Kibana instances. This has three major benefits:

  • Persistence: All task state and scheduling is stored in Elasticsearch, so if you restart Kibana, tasks will pick up where they left off.
  • Scaling: Multiple Kibana instances can read from and update the same task queue in Elasticsearch, allowing the work load to be distributed across instances. If a Kibana instance no longer has capacity to run tasks, you can increase capacity by adding additional Kibana instances.
  • Load Balancing: Task Manager is equipped with a reactive self-healing mechanism, which allows it to reduce the amount of work it executes in reaction to an increased load related error rate in Elasticsearch. Additionally, when Task Manager experiences an increase in recurring tasks, it attempts to space out the work to better balance the load.

Task definitions for alerts and actions are stored in the index specified by xpack.task_manager.index. The default is .kibana_task_manager.

You must have at least one replica of this index for production deployments.

If you lose this index, all scheduled alerts and actions are lost.

Running background tasks

edit

Kibana background tasks are managed as follows:

  • An Elasticsearch task index is polled for overdue tasks at 3-second intervals. You can change this interval using the xpack.task_manager.poll_interval setting.
  • Tasks are claimed by updating them in the Elasticsearch index, using optimistic concurrency control to prevent conflicts. Each Kibana instance can run a maximum of 10 concurrent tasks, so a maximum of 10 tasks are claimed each interval.
  • Tasks are run on the Kibana server.
  • Task Manager ensures that tasks:

    • Are only executed once
    • Are retried when they fail (if configured to do so)
    • Are rescheduled to run again at a future point in time (if configured to do so)

It is possible for tasks to run late or at an inconsistent schedule.

This is usually a symptom of the specific usage or scaling strategy of the cluster in question.

To address these issues, tweak the Kibana Task Manager settings or the cluster scaling strategy to better suit the unique use case.

For details on the settings that can influence the performance and throughput of Task Manager, see Task Manager Settings.

For detailed troubleshooting guidance, see Troubleshooting.

Deployment considerations

edit

Elasticsearch and Kibana instances use the system clock to determine the current time. To ensure schedules are triggered when expected, synchronize the clocks of all nodes in the cluster using a time service such as Network Time Protocol.

Scaling guidance

edit

How you deploy Kibana largely depends on your use case. Predicting the throughout a deployment might require to support Task Management is difficult because features can schedule an unpredictable number of tasks at a variety of scheduled cadences.

However, there is a relatively straight forward method you can follow to produce a rough estimate based on your expected usage.

Default scale

edit

By default, Kibana polls for tasks at a rate of 10 tasks every 3 seconds. This means that you can expect a single Kibana instance to support up to 200 tasks per minute (200/tpm).

In practice, a Kibana instance will only achieve the upper bound of 200/tpm if the duration of task execution is below the polling rate of 3 seconds. For the most part, the duration of tasks is below that threshold, but it can vary greatly as Elasticsearch and Kibana usage grow and task complexity increases (such as alerts executing heavy queries across large datasets).

By evaluating the workload, you can make a rough estimate as to the required throughput as a tasks per minute measurement.

For example, suppose your current workload reveals a required throughput of 440/tpm. You can address this scale by provisioning 3 Kibana instances, with an upper throughput of 600/tpm. This scale would provide aproximately 25% additional capacity to handle ad-hoc non-recurring tasks and potential growth in recurring tasks.

It is highly recommended that you maintain at least 20% additional capacity, beyond your expected workload, as spikes in ad-hoc tasks is possible at times of high activity (such as a spike in actions in response to an active alert).

For details on monitoring the health of Kibana Task Manager, follow the guidance in Health monitoring.

Scaling horizontally

edit

At times, the sustainable approach might be to expand the throughput of your cluster by provisioning additional Kibana instances. By default, each additional Kibana instance will add an additional 10 tasks that your cluster can run concurrently, but you can also scale each Kibana instance vertically, if your diagnosis indicates that they can handle the additional workload.

Scaling vertically

edit

Other times it, might be preferable to increase the throughput of individual Kibana instances.

Tweak the Max Workers via the xpack.task_manager.max_workers setting, which allows each Kibana to pull a higher number of tasks per interval. This could impact the performance of each Kibana instance as the workload will be higher.

Tweak the Poll Interval via the xpack.task_manager.poll_interval setting, which allows each Kibana to pull scheduled tasks at a higher rate. This could impact the performance of the Elasticsearch cluster as the workload will be higher.

Choosing a scaling strategy

edit

Each scaling strategy comes with its own considerations, and the appropriate strategy largely depends on your use case.

Scaling Kibana instances vertically causes higher resource usage in each Kibana instance, as it will perform more concurrent work. Scaling Kibana instances horizontally requires a higher degree of coordination, which can impact overall performance.

A recommended strategy is to follow these steps:

  1. Produce a rough throughput estimate as a guide to provisioning as many Kibana instances as needed. Include any growth in tasks that you predict experiencing in the near future, and a buffer to better address ad-hoc tasks.
  2. After provisioning a deployment, assess whether the provisioned Kibana instances achieve the required throughput by evaluating the Health monitoring as described in Insufficient throughtput to handle the scheduled workload.
  3. If the throughput is insufficient, and Kibana instances exhibit low resource usage, incrementally scale vertically while monitoring the impact of these changes.
  4. If the throughput is insufficient, and Kibana instances are exhibiting high resource usage, incrementally scale horizontally by provisioning new Kibana instances and reassess.

Task Manager, like the rest of the Elastic Stack, is designed to scale horizontally. Take advantage of this ability to ensure mission critical services, such as Alerting and Reporting, always have the capacity they need.

Scaling horizontally requires a higher degree of coordination between Kibana instances. One way Task Manager coordinates with other instances is by delaying its polling schedule to avoid conflicts with other instances. By using health monitoring to evaluate the date of the last_polling_delay across a deployment, you can estimate the frequency at which Task Manager resets its delay mechanism. A higher frequency suggests Kibana instances conflict at a high rate, which you can address by scaling vertically rather than horizontally, reducing the required coordination.

Rough throughput estimation

edit

Predicting the required throughput a deployment might need to support Task Management is difficult, as features can schedule an unpredictable number of tasks at a variety of scheduled cadences. However, a rough lower bound can be estimated, which is then used as a guide.

Throughput is best thought of as a measurements in tasks per minute.

A default Kibana instance can support up to 200/tpm.

Given a deployment of 100 recurring tasks, estimating the required throughput depends on the scheduled cadence. Suppose you expect to run 50 tasks at a cadence of 10s, the other 50 tasks at 20m. In addition, you expect a couple dozen non-recurring tasks every minute.

A non-recurring task requires a single execution, which means that a single Kibana instance could execute all 100 tasks in less than a minute, using only half of its capacity. As these tasks are only executed once, the Kibana instance will sit idle once all tasks are executed. For that reason, don’t include non-recurring tasks in your tasks per minute calculation. Instead, include a buffer in the final lower bound to incur the cost of ad-hoc non-recurring tasks.

A recurring task requires as many executions as its cadence can fit in a minute. A recurring task with a 10s schedule will require 6/tpm, as it will execute 6 times per minute. A recurring task with a 20m schedule only executes 3 times per hour and only requires a throughput of 0.05/tpm, a number so small it that is difficult to take it into account.

For this reason, we recommend grouping tasks by tasks per minute and tasks per hour, as demonstrated in Evaluate your workload, averaging the per hour measurement across all minutes.

Given the predicted workload, you can estimate a lower bound throughput of 340/tpm (6/tpm * 50 + 3/tph * 50 + 20% buffer). As a default, a Kibana instance provides a throughput of 200/tpm. A good starting point for your deployment is to provision 2 Kibana instances. You could then monitor their performance and reassess as the required throughput becomes clearer.

Although this is a rough estimate, the tasks per minute provides the lower bound needed to execute tasks on time. Once you calculate the rough tasks per minute estimate, add a 20% buffer for non-recurring tasks. How much of a buffer is required largely depends on your use case, so evaluate your workload as it grows to ensure enough of a buffer is provisioned.