Health Report Pipeline Flow: Worker Utilization

edit

Health Report Pipeline Flow: Worker Utilization

edit

The Pipeline indicator has a flow:worker_utilization probe that is capable of producing one of several diagnoses about blockages in the pipeline.

A pipeline is considered "blocked" when its workers are fully-utilized, because if they are consistently spending 100% of their time processing events, they are unable to pick up new events from the queue. This can cause back-pressure to cascade to upstream services, which can result in data loss or duplicate processing depending on upstream configuration.

The issue typically stems from one or more causes:

  • a downstream resource being blocked,
  • a plugin consuming more resources than expected, and/or
  • insufficient resources being allocated to the pipeline.

To address the issue, observe the Plugin flow rates from the Node Stats API, and identify which plugins have the highest worker_utilization. This will tell you which plugins are spending the most of the pipeline’s worker resources.

  • If the offending plugin connects to a downstream service or another pipeline that is exerting back-pressure, the issue needs to be addressed in the downstream service or pipeline.
  • If the offending plugin connects to a downstream service with high network latency, throughput for the pipeline may be improved by allocating more worker resources to the pipeline.
  • If the offending plugin is a computation-heavy filter such as grok or kv, its configuration may need to be tuned to eliminate wasted computation.

Blocked Pipeline (5 minutes)

edit

A pipeline that has been completely blocked for five minutes or more represents a critical blockage to the flow of events through your pipeline that needs to be addressed immediately to avoid or limit data loss. See above for troubleshooting steps.

Nearly Blocked Pipeline (5 minutes)

edit

A pipeline that has been nearly blocked for five minutes or more may be creating intermittent blockage to the flow of events through your pipeline, which can result in the risk of data loss. See above for troubleshooting steps.

Blocked Pipeline (1 minute)

edit

A pipeline that has been completely blocked for one minute or more represents a high-risk or upcoming blockage to the flow of events through your pipeline that likely needs to be addressed soon to avoid or limit data loss. See above for troubleshooting steps.

Nearly Blocked Pipeline (1 minute)

edit

A pipeline that has been nearly blocked for one minute or more may be creating intermittent blockage to the flow of events through your pipeline, which can result in the risk of data loss. See above for troubleshooting steps.