Tune APM Server
editTune APM Server
editVersion 6.5 of the APM Server introduced a new intake API. If you have not upgraded your APM Agent, you’ll continue to use the deprecated v1 intake API endpoint. If you have upgraded your agent, you’ll use the new v2 intake API endpoint. Understanding which endpoint your APM Agents use to communicate with your APM Server is vital to correctly tuning your data ingestion. Read the intake API changes documentation to learn more.
Tuning topics:
Tune APM Server output parameters for your Elasticsearch cluster
editIf your Elasticsearch cluster is not ingesting the amount of data you expect, you can tweak a few APM Server settings:
-
Adjust
output.elasticsearch.workers
. See tune for indexing speed for an overview. -
Ensure
output.elasticsearch.bulk_max_size
is set to a high value, for example 5120. The default of 50 is very conservative. -
Ensure that
queue.mem.events
is set to a reasonable value compared to your other settings. A good rule of thumb is thatqueue.mem.events
should equaloutput.elasticsearch.worker
multiplied byoutput.elasticsearch.bulk_max_size
.
The output configuration section shows more details.
Adjust internal queue size
editAPM Server uses an internal queue to buffer incoming events.
A larger queue can retain more data if Elasticsearch is unavailable for longer periods,
and it alleviates problems that might result from sudden spikes of traffic.
You can adjust the queue size by overriding queue.mem.events
.
Increasing queue.mem.events
can significantly affect APM Server memory usage.
Adjust concurrent requests [6.5] Deprecated in 6.5.
editThis setting only impacts agents using v1 of the APM Server intake API.
APM Server has a limit to how many requests can be processed concurrently.
This limit is determined by the apm-server.concurrent_requests
setting.
Increasing this value will improve throughput, but it can significantly affect APM Server memory usage.
Add APM Server instances
editIf the APM Server cannot process data quickly enough, you will see request timeouts.
One way to solve this problem is to increase processing power. This can be done by either migrating your APM Server to a more powerful machine or adding more APM Server instances. Having several instances will also increase availability.
Reduce the payload size
editLarge payloads may result in request timeouts. You can reduce the payload size by decreasing the flush interval in the agents. This will cause agents to send smaller and more frequent requests.
Optionally you can also reduce the sample rate or reduce the amount of stacktraces.
Read more in the agents documentation.
Adjust RUM event rate limit
editThis setting impacts agents using v2 of the APM Server intake API. If you’re using an agent that supports v1, see the server configuration changes documentation.
Agents make use of long running requests and flush as many events over a single request as possible. Thus, the rate limiter for RUM is bound to the number of events sent per second, per IP.
If the rate limit is hit while events on an established request are sent, the request is not immediately terminated. The intake of events is only throttled to event_rate.limit
, which means that events are queued and processed slower. Only when the allowed buffer queue is also full, does the request get terminated with a 429 - rate limit exceeded
HTTP response. If an agent tries to establish a new request, but the rate limit is already hit, a 429
will be sent immediately.
Increasing the event_rate.limit
default value will help avoid rate limit exceeded
errors.
Tuning APM Server using both v1 and v2 intake API
editDepending on your agent versions, it is possible that the APM Server may need to communicate with the agents by using both of the intake API versions.
For example, if you have a v1.x Node.js agent, and a v4.x Python agent, the Node.js agent will communicate via the v1 endpoint, and the Python agent will communicate via the v2 endpoint.
In this instance, you’ll need to adjust both the deprecated, and newly introduced configuration options.
Check the agent/server compatibility matrix for compatibility information.