Backlogged task queue

edit

A backlogged task queue can prevent tasks from completing and lead to an unhealthy cluster state. Contributing factors include resource constraints, a large number of tasks triggered at once, and long-running tasks.

Diagnose a backlogged task queue

edit

To identify the cause of the backlog, try these diagnostic actions.

Check the thread pool status
edit

A depleted thread pool can result in rejected requests.

Use the cat thread pool API to monitor active threads, queued tasks, rejections, and completed tasks:

resp = client.cat.thread_pool(
    v=True,
    s="t,n",
    h="type,name,node_name,active,queue,rejected,completed",
)
print(resp)
response = client.cat.thread_pool(
  v: true,
  s: 't,n',
  h: 'type,name,node_name,active,queue,rejected,completed'
)
puts response
const response = await client.cat.threadPool({
  v: "true",
  s: "t,n",
  h: "type,name,node_name,active,queue,rejected,completed",
});
console.log(response);
GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
  • Look for high active and queue metrics, which indicate potential bottlenecks and opportunities to reduce CPU usage.
  • Determine whether thread pool issues are specific to a data tier.
  • Check whether a specific node’s thread pool is depleting faster than others. This might indicate hot spotting.
Inspect hot threads on each node
edit

If a particular thread pool queue is backed up, periodically poll the nodes hot threads API to gauge the thread’s progression and ensure it has sufficient resources:

resp = client.nodes.hot_threads()
print(resp)
response = client.nodes.hot_threads
puts response
const response = await client.nodes.hotThreads();
console.log(response);
GET /_nodes/hot_threads

Although the hot threads API response does not list the specific tasks running on a thread, it provides a summary of the thread’s activities. You can correlate a hot threads response with a task management API response to identify any overlap with specific tasks. For example, if the hot threads response indicates the thread is performing a search query, you can check for long-running search tasks using the task management API.

Identify long-running node tasks
edit

Long-running tasks can also cause a backlog. Use the task management API to check for excessive running_time_in_nanos values:

resp = client.tasks.list(
    pretty=True,
    human=True,
    detailed=True,
)
print(resp)
const response = await client.tasks.list({
  pretty: "true",
  human: "true",
  detailed: "true",
});
console.log(response);
GET /_tasks?pretty=true&human=true&detailed=true

You can filter on a specific action, such as bulk indexing or search-related tasks. These tend to be long-running.

  • Filter on bulk index actions:

    resp = client.tasks.list(
        human=True,
        detailed=True,
        actions="indices:data/write/bulk",
    )
    print(resp)
    const response = await client.tasks.list({
      human: "true",
      detailed: "true",
      actions: "indices:data/write/bulk",
    });
    console.log(response);
    GET /_tasks?human&detailed&actions=indices:data/write/bulk
  • Filter on search actions:

    resp = client.tasks.list(
        human=True,
        detailed=True,
        actions="indices:data/write/search",
    )
    print(resp)
    const response = await client.tasks.list({
      human: "true",
      detailed: "true",
      actions: "indices:data/write/search",
    });
    console.log(response);
    GET /_tasks?human&detailed&actions=indices:data/write/search

Long-running tasks might need to be canceled.

Look for long-running cluster tasks
edit

Use the cluster pending tasks API to identify delays in cluster state synchronization:

resp = client.cluster.pending_tasks()
print(resp)
const response = await client.cluster.pendingTasks();
console.log(response);
GET /_cluster/pending_tasks

Tasks with a high timeInQueue value are likely contributing to the backlog and might need to be canceled.

Recommendations

edit

After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks.

Increase available resources
edit

If tasks are progressing slowly, try reducing CPU usage.

In some cases, you might need to increase the thread pool size. For example, the force_merge thread pool defaults to a single thread. Increasing the size to 2 might help reduce a backlog of force merge requests.

Cancel stuck tasks
edit

If an active task’s hot thread shows no progress, consider canceling the task.

Address hot spotting
edit

If a specific node’s thread pool is depleting faster than others, try addressing uneven node resource utilization, also known as hot spotting. For details on actions you can take, such as rebalancing shards, see Hot spotting.

Resources

edit

Related symptoms: