Task Manager troubleshooting

edit

Task Manager is used by a wide range of services in Kibana, such as Alerting, Reporting, and Telemetry. Unexpected behavior in these services might be a downstream issue originating in Task Manager.

This page describes how to resolve common problems you might encounter with Task Manager. If your problem isn’t described here, please review open issues in the following GitHub repositories:

Have a question? Contact us in the discuss forum.

Tasks with small schedule intervals run late

edit

Problem:

Tasks are scheduled to run every 2 seconds, but seem to be running late.

Solution:

Task Manager polls for tasks at the cadence specified by the xpack.task_manager.poll_interval setting, which is 3 seconds by default. This means that a task could run late if it uses a schedule that is smaller than this setting.

You can adjust the xpack.task_manager.poll_interval setting. However, this will add additional load to both Kibana and Elasticsearch instances in the cluster, as they will perform more queries.

Tasks run late

edit

Problem:

The most common symptom of an underlying problem in Task Manager is that tasks appear to run late. For instance, recurring tasks might run at an inconsistent cadence, or long after their scheduled time.

Solution:

By default, Kibana polls for tasks at a rate of 10 tasks every 3 seconds.

If many tasks are scheduled to run at the same time, pending tasks will queue in Elasticsearch. Each Kibana instance then polls for pending tasks at a rate of up to 10 tasks at a time, at 3 second intervals. It is possible for pending tasks in the queue to exceed this capacity and run late as a result.

This type of delay is known as drift.The root cause for drift depends on the specific usage, and there are no hard and fast rules for addressing drift.

For example:

  • If drift is caused by an excess of concurrent tasks relative to the available capacity of Kibana instances in the cluster, expand the throughput of the cluster.
  • If drift is caused by long running tasks that overrun their scheduled cadence, reconfigure the tasks in question.

Refer to Diagnose a root cause for drift for step-by-step instructions on identifying the correct resolution.

Drift is often addressed by adjusting the scaling the deployment to better suit your usage. For details on scaling Task Manager, see Scaling guidance.

Diagnose a root cause for drift

edit

The following guide helps you identify a root cause for drift by making sense of the output from the Health monitoring endpoint.

By analyzing the different sections of the output, you can evaluate different theories that explain the drift in a deployment.

Retrieve the latest monitored health stats of a Kibana instance Task Manager:

$ curl -X GET api/task_manager/_health

The API returns the following:

{
  "id": "15415ecf-cdb0-4fef-950a-f824bd277fe4",
  "timestamp": "2021-02-16T11:38:10.077Z",
  "status": "OK",
  "last_update": "2021-02-16T11:38:09.934Z",
  "stats": {
    "configuration": {
      "timestamp": "2021-02-16T11:29:05.055Z",
      "value": {
        "request_capacity": 1000,
        "max_poll_inactivity_cycles": 10,
        "monitored_aggregated_stats_refresh_rate": 60000,
        "monitored_stats_running_average_window": 50,
        "monitored_task_execution_thresholds": {
          "default": {
            "error_threshold": 90,
            "warn_threshold": 80
          },
          "custom": {}
        },
        "poll_interval": 3000,
        "max_workers": 10
      },
      "status": "OK"
    },
    "runtime": {
      "timestamp": "2021-02-16T11:38:09.934Z",
      "value": {
        "polling": {
          "last_successful_poll": "2021-02-16T11:38:09.934Z",
          "last_polling_delay": "2021-02-16T11:29:05.053Z",
          "duration": {
            "p50": 13,
            "p90": 128,
            "p95": 143,
            "p99": 168
          },
          "claim_conflicts": {
            "p50": 0,
            "p90": 0,
            "p95": 0,
            "p99": 0
          },
          "claim_mismatches": {
            "p50": 0,
            "p90": 0,
            "p95": 0,
            "p99": 0
          },
          "result_frequency_percent_as_number": {
            "Failed": 0,
            "NoAvailableWorkers": 0,
            "NoTasksClaimed": 80,
            "RanOutOfCapacity": 0,
            "RunningAtCapacity": 0,
            "PoolFilled": 20
          }
        },
        "drift": {
          "p50": 99,
          "p90": 1245,
          "p95": 1845,
          "p99": 2878
        },
        "load": {
          "p50": 0,
          "p90": 0,
          "p95": 10,
          "p99": 20
        },
        "execution": {
          "duration": {
            "alerting:.index-threshold": {
              "p50": 95,
              "p90": 1725,
              "p95": 2761,
              "p99": 2761
            },
            "alerting:xpack.uptime.alerts.monitorStatus": {
              "p50": 149,
              "p90": 1071,
              "p95": 1171,
              "p99": 1171
            },
            "actions:.index": {
              "p50": 166,
              "p90": 166,
              "p95": 166,
              "p99": 166
            }
          },
          "result_frequency_percent_as_number": {
            "alerting:.index-threshold": {
              "Success": 100,
              "RetryScheduled": 0,
              "Failed": 0,
              "status": "OK"
            },
            "alerting:xpack.uptime.alerts.monitorStatus": {
              "Success": 100,
              "RetryScheduled": 0,
              "Failed": 0,
              "status": "OK"
            },
            "actions:.index": {
              "Success": 10,
              "RetryScheduled": 0,
              "Failed": 90,
              "status": "error"
            }
          }
        }
      },
      "status": "OK"
    },
    "workload": {
      "timestamp": "2021-02-16T11:38:05.826Z",
      "value": {
        "count": 26,
        "task_types": {
          "alerting:.index-threshold": {
            "count": 2,
            "status": {
              "idle": 2
            }
          },
          "actions:.index": {
            "count": 14,
            "status": {
              "idle": 2,
              "running": 2,
              "failed": 10
            }
          },
          "alerting:xpack.uptime.alerts.monitorStatus": {
            "count": 10,
            "status": {
              "idle": 10
            }
          },
        },
        "schedule": [
          ["10s", 2],
          ["1m", 2],
          ["60s", 2],
          ["5m", 2],
          ["60m", 4]
        ],
        "overdue": 0,
        "estimated_schedule_density": [0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 3, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0]
      },
      "status": "OK"
    }
  }
}

Evaluate the Configuration

edit

Theory: Kibana is configured to poll for tasks at a reduced rate.

Diagnosis: Evaluating the health stats, you can see the following output under stats.configuration.value:

{
  "request_capacity": 1000,
  "max_poll_inactivity_cycles": 10,
  "monitored_aggregated_stats_refresh_rate": 60000,
  "monitored_stats_running_average_window": 50,
  "monitored_task_execution_thresholds": {
    "default": {
      "error_threshold": 90,
      "warn_threshold": 80
    },
    "custom": {}
  },
  "poll_interval": 3000, 
  "max_workers": 10 
}

poll_interval is set to the default value of 3000 milliseconds

max_workers is set to the default value of 10 workers

You can infer from this output that the Kibana instance polls for work every 3 seconds and can run 10 concurrent tasks.

Now suppose the output under stats.configuration.value is the following:

{
  "request_capacity": 1000,
  "max_poll_inactivity_cycles": 10,
  "monitored_aggregated_stats_refresh_rate": 60000,
  "monitored_stats_running_average_window": 50,
  "monitored_task_execution_thresholds": {
    "default": {
      "error_threshold": 90,
      "warn_threshold": 80
    },
    "custom": {}
  },
  "poll_interval": 60000, 
  "max_workers": 1 
}

poll_interval is set to 60000 milliseconds, far higher than the default

max_workers is set to 1 worker, far lower than the default

You can infer from this output that the Kibana instance only polls for work once a minute and only picks up one task at a time. This throughput is unlikely to support mission critical services, such as Alerting or Reporting, and tasks will usually run late.

There are two possible reasons for such a configuration:

  • These settings have been configured manually, which can be resolved by reconfiguring these settings. For details, see Task Manager Settings.
  • Kibana has reduced its own throughput in reaction to excessive load on the Elasticsearch cluster.

    Task Manager is equipped with a reactive self-healing mechanism in response to an increase in load related errors in Elasticsearch. This mechanism will increase the poll_interval setting (reducing the rate at which it queries Elasticsearch), and decrease the max_workers (reducing the amount of operations it executes against Elasticsearch). Once the error rate reduces, these settings are incrementally dialed up again, returning them to the configured settings.

    This scenario can be identified by searching the Kibana Server Log for messages such as:

    Max workers configuration is temporarily reduced after Elasticsearch returned 25 "too many request" error(s).

    Deeper investigation into the high error rate experienced by the Elasticsearch cluster is required.

Evaluate the Runtime

edit

Theory: Kibana is not polling as frequently as it should

Diagnosis: Evaluating the health stats, you see the following output under stats.runtime.value.polling:

{
  "last_successful_poll": "2021-02-16T11:38:09.934Z", 
  "last_polling_delay": "2021-02-14T11:29:05.053Z",
  "duration": { 
    "p50": 13,
    "p90": 128,
    "p95": 143,
    "p99": 168
  },
  "claim_conflicts": { 
    "p50": 0,
    "p90": 0,
    "p95": 0,
    "p99": 2
  },
  "claim_mismatches": {
    "p50": 0,
    "p90": 0,
    "p95": 0,
    "p99": 0
  },
  "result_frequency_percent_as_number": { 
    "Failed": 0,
    "NoAvailableWorkers": 0,
    "NoTasksClaimed": 80,
    "RanOutOfCapacity": 0,
    "RunningAtCapacity": 0,
    "PoolFilled": 20
  }
}

Ensure the last successful polling cycle was completed no more than a couple of multiples of poll_interval in the past.

Ensure the duration of polling cycles is usually below 100ms. Longer durations are possible, but unexpected.

Ensure Kibana instances in the cluster are not encountering a high rate of version conflicts.

Ensure the majority of polling cycles result in positive outcomes, such as RunningAtCapacity or PoolFilled.

You can infer from this output that the Kibana instance is polling regularly. This assessment is based on the following:

  • Comparing the last_successful_poll to the timestamp (value of 2021-02-16T11:38:10.077Z) at the root, where you can see the last polling cycle took place 1 second before the monitoring stats were exposed by the health monitoring API.
  • Comparing the last_polling_delay to the timestamp (value of 2021-02-16T11:38:10.077Z) at the root, where you can see the last polling cycle delay took place 2 days ago, suggesting Kibana instances are not conflicting often.
  • The p50 of the duration shows that at least 50% of polling cycles take, at most, 13 millisconds to complete.
  • Evaluating the result_frequency_percent_as_number:

    • 80% of the polling cycles completed without claiming any tasks (suggesting that there aren’t any overdue tasks).
    • 20% completed with Task Manager claiming tasks that were then executed.
    • None of the polling cycles ended up occupying all of the available workers, as RunningAtCapacity has a frequency of 0%, suggesting there is enough capacity in Task Manager to handle the workload.

All of these stats are tracked as a running average, which means that they give a snapshot of a period of time (by default Kibana tracks up to 50 cycles), rather than giving a complete history.

Suppose the output under stats.runtime.value.polling.result_frequency_percent_as_number was the following:

{
  "Failed": 30, 
  "NoAvailableWorkers": 20, 
  "NoTasksClaimed": 10,
  "RanOutOfCapacity": 10, 
  "RunningAtCapacity": 10, 
  "PoolFilled": 20
}

30% of polling cycles failed, which is a high rate.

20% of polling cycles are skipped as Task Manager has no capacity left to run tasks.

10% of polling cycles result in Task Manager claiming more tasks than it has capacity to run.

10% of polling cycles result in Task Manager claiming precisely as many tasks as it has capacity to run.

You can infer from this output that Task Manager is not healthy, as the failure rate is high, and Task Manager is fetching tasks it has no capacity to run. Analyzing the Kibana Server Log should reveal the underlying issue causing the high error rate and capacity issues.

The high NoAvailableWorkers rate of 20% suggests that there are many tasks running for durations longer than the poll_interval. For details on analyzing long task execution durations, see the long running tasks theory.

Theory: Kibana is polling as frequently as it should, but that isn’t often enough to keep up with the workload

Diagnosis: Evaluating the health stats, you can see the following output of drift and load under stats.runtime.value:

{
  "drift": { 
    "p50": 99,
    "p90": 1245,
    "p95": 1845,
    "p99": 2878
  },
  "load": { 
    "p50": 0,
    "p90": 0,
    "p95": 10,
    "p99": 20
  },
}

drift shows us that at least 95% of tasks are running within 2 seconds of their scheduled time.

load shows us that Task Manager is idle at least 90% of the time, and never uses more than 20% of its available workers.

You can infer from these stats that this Kibana has plenty of capacity, and any delays you might be experiencing are unlikely to be addressed by expanding the throughput.

Suppose the output of drift and load was the following:

{
  "drift": { 
    "p50": 2999,
    "p90": 3845,
    "p95": 3845.75,
    "p99": 4078
  },
  "load": { 
    "p50": 80,
    "p90": 100,
    "p95": 100,
    "p99": 100
  }
}

drift shows us that all tasks are running 3 to 4 seconds after their scheduled time.

load shows us that at least half of the time Task Manager is running at a load of 80%.

You can infer from these stats that this Kibana is using most of its capacity, but seems to keep up with the work most of the time. This assessment is based on the following:

  • The p90 of load is at 100%, and p50 is also quite high at 80%. This means that there is little to no room for maneuvering, and a spike of work might cause Task Manager to exceed its capacity.
  • Tasks run soon after their scheduled time, which is to be expected. A poll_interval of 3000 milliseconds would often experience a consistent drift of somewhere between 0 and 3000 milliseconds. A p50 drift of 2999 suggests that there is room for improvement, and you could benefit from a higher throughput.

For details on achieving higher throughput by adjusting your scaling strategy, see Scaling guidance.

Theory: Tasks run for too long, overrunning their schedule

Diagnosis: The Insufficient throughtput to handle the scheduled workload theory analyzed a hypothetical scenario where both drift and load were unusually high.

Suppose an alternate scenario, where drift is high, but load is not, such as the following:

{
    "drift": { 
        "p50": 9799,
        "p90": 83845,
        "p95": 90328,
        "p99": 123845
    },
    "load": { 
        "p50": 40,
        "p90": 75,
        "p95": 80,
        "p99": 100
    }
}

drift shows that most (if not all) tasks are running at least 32 seconds too late.

load shows that, for the most part, you have capacity to run more concurrent tasks.

In the preceding scenario, the tasks are running far too late, but you have sufficient capacity to run more concurrent tasks. A high capacity allows Kibana to run multiple different tasks concurrently. If a task is already running when its next schedule run is due, Kibana will avoid running it a second time, and instead wait for the first execution to complete.

If a task takes longer to execute than the cadence of its schedule, then that task will always overrun and experience a high drift. For example, suppose a task is scheduled to execute every 3 seconds, but takes 6 seconds to complete. It will consistently suffer from a drift of, at least, 3 seconds.

Evaluating the health stats in this hypothetical scenario, you see the following output under stats.runtime.value.execution.duration:

{
  "alerting:.index-threshold": { 
    "p50": 95,
    "p90": 1725,
    "p95": 2761,
    "p99": 2761
  },
  "alerting:.es-query": { 
    "p50": 7149,
    "p90": 40071,
    "p95": 45282,
    "p99": 121845
  },
  "actions:.index": {
    "p50": 166,
    "p90": 166,
    "p95": 166,
    "p99": 166
  }
}

50% of the tasks backing index threshold alerts complete in less than 100 milliseconds.

50% of the tasks backing Elasticsearch query alerts complete in 7 seconds, but at least 10% take longer than 40 seconds.

You can infer from these stats that the high drift the Task Manager is experiencing is most likely due to Elasticsearch query alerts that are running for a long time.

Resolving this issue is context dependent and changes from case to case. In the preceding example above, this would be resolved by modifying the queries in these alerts to make them faster, or improving the Elasticsearch throughput to speed up the exiting query.

Theory: Tasks take multiple attempts to succeed

Diagnosis: A high error rate could cause a task to appear to run late, when in fact it runs on time, but experiences a high failure rate.

Evaluating the preceding health stats, you see the following output under stats.runtime.value.execution.result_frequency_percent_as_number:

{
  "alerting:.index-threshold": { 
    "Success": 100,
    "RetryScheduled": 0,
    "Failed": 0,
    "status": "OK"
  },
  "alerting:xpack.uptime.alerts.monitorStatus": {
    "Success": 100,
    "RetryScheduled": 0,
    "Failed": 0,
    "status": "OK"
  },
  "actions:.index": { 
    "Success": 8,
    "RetryScheduled": 0,
    "Failed": 92,
    "status": "error" 
  }
}

100% of the tasks backing index threshold alerts successfully complete.

92% of the tasks backing ES index actions fail to complete.

The tasks backing ES index actions have exceeded the default monitored_task_execution_thresholds error configuration.

You can infer from these stats that most actions:.index tasks, which back the ES Index Kibana action, fail. Resolving that would require deeper investigation into the Kibana Server Log, where the exact errors are logged, and addressing these specific errors.

Evaluate the Workload

edit

Predicting the required throughput a deplyment might need to support Task Manager is difficult, as features can schedule an unpredictable number of tasks at a variety of scheduled cadences.

Health monitoring provides statistics that make it easier to monitor the adequacy of the existing throughput. By evaluating the workload, the required throughput can be estimated, which is used when following the Task Manager Scaling guidance.

Evaluating the preceding health stats above, you see the following output under stats.workload.value:

{
  "count": 26, 
  "task_types": {
    "alerting:.index-threshold": {
      "count": 2, 
      "status": {
        "idle": 2
      }
    },
    "actions:.index": {
      "count": 14,
      "status": {
        "idle": 2,
        "running": 2,
        "failed": 10 
      }
    },
    "alerting:xpack.uptime.alerts.monitorStatus": {
      "count": 10,
      "status": {
        "idle": 10
      }
    },
  },
  "schedule": [ 
    ["10s", 2],
    ["1m", 2],
    ["90s", 2],
    ["5m", 8]
  ],
  "overdue": 0, 
  "estimated_schedule_density": [  
    0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
    0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
    0, 3, 0, 0, 0, 1, 0, 1, 0, 1,
    0, 0, 0, 1, 0, 0, 1, 1, 1, 0
  ]
}

There are 26 tasks in the system, including regular tasks, recurring tasks, and failed tasks.

There are 2 idle index threshold alert tasks, meaning they are scheduled to run at some point in the future.

Of the 14 tasks backing the ES index action, 10 have failed and 2 are running.

A histogram of all scheduled recurring tasks shows that 2 tasks are scheduled to run every 10 seconds, 2 tasks are scheduled to run once a minute, and so on.

There are no tasks overdue, which means that all tasks that should have run by now have run.

This histogram shows the tasks scheduled to run throughout the upcoming 20 polling cycles. The histogram represents the entire deployment, rather than just this Kibana instance

The workload section summarizes the work load across the cluster, listing the tasks in the system, their types, schedules, and current status.

You can infer from these stats that a default deployment should suffice. This assessment is based on the following:

  • The estimated schedule density is low.
  • There aren’t many tasks in the system relative to the default capacity.

Suppose the output of stats.workload.value looked something like this:

{
  "count": 2191, 
  "task_types": {
    "alerting:.index-threshold": {
      "count": 202,
      "status": {
        "idle": 183,
        "claiming": 2,
        "running": 19
      }
    },
    "alerting:.es-query": {
      "count": 225,
      "status": {
        "idle": 225,
      }
    },
    "actions:.index": {
      "count": 89,
      "status": {
        "idle": 24,
        "running": 2,
        "failed": 63
      }
    },
    "alerting:xpack.uptime.alerts.monitorStatus": {
      "count": 87,
      "status": {
        "idle": 74,
        "running": 13
      }
    },
  },
  "schedule": [ 
    ["10s", 38],
    ["1m", 101],
    ["90s", 55],
    ["5m", 89],
    ["20m", 62],
    ["60m", 106],
    ["1d", 61]
  ],
  "overdue": 0, 
  "estimated_schedule_density": [  
    10, 1, 0, 10, 0, 20, 0, 1, 0, 1,
    9, 0, 3, 10, 0, 0, 10, 10, 7, 0,
    0, 31, 0, 12, 16, 31, 0, 10, 0, 10,
    3, 22, 0, 10, 0, 2, 10, 10, 1, 0
  ]
}

There are 2,191 tasks in the system.

The scheduled tasks are distributed across a variety of cadences.

The schedule density shows that you expect to exceed the default 10 concurrent tasks.

You can infer several important attributes of your workload from this output:

  • There are many tasks in your system and ensuring these tasks run on their scheduled cadence will require attention to the Task Manager throughput.
  • Assessing the high frequency tasks (tasks that recur at a cadence of a couple of minutes or less), you must support a throughput of approximately 400 tasks per minute (38 every 10 seconds + 101 every minute + 55 every 90 seconds).
  • Assessing the medium frequency tasks (tasks that recur at a cadence of an hour or less), you must support an additional throughput of over 2000 tasks per hour (89 every 5 minutes, + 62 every 20 minutes + 106 each hour). You can average the needed throughput for the hour by counting these tasks as an additional 30 to 40 tasks per minute.
  • Assessing the estimated schedule density, there are cycles that are due to run upwards of 31 tasks concurrently, and along side these cycles, there are empty cycles. You can expect Task Manager to load balance these tasks throughout the empty cycles, but this won’t leave much capacity to handle spikes in fresh tasks that might be scheduled in the future.

These rough calculations give you a lower bound to the required throughput, which is at least 440 tasks per minute to ensure recurring tasks are executed, at their scheduled time. This throughput doesn’t account for nonrecurring tasks that might have been scheduled, nor does it account for tasks (recurring or otherwise) that might be scheduled in the future.

Given these inferred attributes, it would be safe to assume that a single Kibana instance with default settings would not provide the required throughput. It is possible that scaling horizontally by adding a couple more Kibana instances will.

For details on scaling Task Manager, see Scaling guidance.