Backlogged task queue
editBacklogged task queue
editA backlogged task queue can prevent tasks from completing and lead to an unhealthy cluster state. Contributing factors include resource constraints, a large number of tasks triggered at once, and long-running tasks.
Diagnose a backlogged task queue
editTo identify the cause of the backlog, try these diagnostic actions.
Check the thread pool status
editA depleted thread pool can result in rejected requests.
Use the cat thread pool API to monitor active threads, queued tasks, rejections, and completed tasks:
resp = client.cat.thread_pool( v=True, s="t,n", h="type,name,node_name,active,queue,rejected,completed", ) print(resp)
response = client.cat.thread_pool( v: true, s: 't,n', h: 'type,name,node_name,active,queue,rejected,completed' ) puts response
const response = await client.cat.threadPool({ v: "true", s: "t,n", h: "type,name,node_name,active,queue,rejected,completed", }); console.log(response);
GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
-
Look for high
active
andqueue
metrics, which indicate potential bottlenecks and opportunities to reduce CPU usage. - Determine whether thread pool issues are specific to a data tier.
- Check whether a specific node’s thread pool is depleting faster than others. This might indicate hot spotting.
Inspect hot threads on each node
editIf a particular thread pool queue is backed up, periodically poll the nodes hot threads API to gauge the thread’s progression and ensure it has sufficient resources:
resp = client.nodes.hot_threads() print(resp)
response = client.nodes.hot_threads puts response
const response = await client.nodes.hotThreads(); console.log(response);
GET /_nodes/hot_threads
Although the hot threads API response does not list the specific tasks running on a thread,
it provides a summary of the thread’s activities. You can correlate a hot threads response
with a task management API response to identify any overlap with specific tasks. For
example, if the hot threads response indicates the thread is performing a search query
, you can
check for long-running search tasks using the task management API.
Identify long-running node tasks
editLong-running tasks can also cause a backlog. Use the task
management API to check for excessive running_time_in_nanos
values:
resp = client.tasks.list( pretty=True, human=True, detailed=True, ) print(resp)
const response = await client.tasks.list({ pretty: "true", human: "true", detailed: "true", }); console.log(response);
GET /_tasks?pretty=true&human=true&detailed=true
You can filter on a specific action
, such as bulk indexing or search-related tasks.
These tend to be long-running.
-
Filter on bulk index actions:
resp = client.tasks.list( human=True, detailed=True, actions="indices:data/write/bulk", ) print(resp)
const response = await client.tasks.list({ human: "true", detailed: "true", actions: "indices:data/write/bulk", }); console.log(response);
GET /_tasks?human&detailed&actions=indices:data/write/bulk
-
Filter on search actions:
resp = client.tasks.list( human=True, detailed=True, actions="indices:data/write/search", ) print(resp)
const response = await client.tasks.list({ human: "true", detailed: "true", actions: "indices:data/write/search", }); console.log(response);
GET /_tasks?human&detailed&actions=indices:data/write/search
Long-running tasks might need to be canceled.
Look for long-running cluster tasks
editUse the cluster pending tasks API to identify delays in cluster state synchronization:
resp = client.cluster.pending_tasks() print(resp)
const response = await client.cluster.pendingTasks(); console.log(response);
GET /_cluster/pending_tasks
Tasks with a high timeInQueue
value are likely contributing to the backlog and might
need to be canceled.
Recommendations
editAfter identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks.
Increase available resources
editIf tasks are progressing slowly, try reducing CPU usage.
In some cases, you might need to increase the thread pool size. For example, the force_merge
thread pool defaults to a single thread.
Increasing the size to 2 might help reduce a backlog of force merge requests.
Cancel stuck tasks
editIf an active task’s hot thread shows no progress, consider canceling the task.
Address hot spotting
editIf a specific node’s thread pool is depleting faster than others, try addressing uneven node resource utilization, also known as hot spotting. For details on actions you can take, such as rebalancing shards, see Hot spotting.
Resources
editRelated symptoms: