Jonas Kunz

Combining Elastic Universal Profiling with Java APM Services and Traces

Learn how to combine the power of Elastic universal profiling with APM data from Java services to easily pinpoint CPU bottlenecks. Compatible with both OpenTelemetry and the classic Elastic APM Agent.

Combining Elastic Universal Profiling with Java APM Services and Traces

In a previous blog post, we introduced the technical details of how we managed to correlate eBPF profiling data with APM traces. This time, we'll show you how to get this feature up and running to pinpoint CPU bottlenecks in your Java services! The correlation is supported for both OpenTelemetry and the classic Elastic APM Agent. We'll show you how to enable it for both.

Demo Application

For this blog post, we’ll be using the cpu-burner demo application to showcase the correlation capabilities of APM, tracing, and profiling in Elastic. This application was built to continuously execute several CPU-intensive tasks:

  • It computes Fibonacci numbers using the naive, recursive algorithm.
  • It hashes random data with the SHA-2 and SHA-3 hashing algorithms.
  • It performs numerous large background allocations to stress the garbage collector.

The computations of the Fibonacci numbers and the hashing will each be visible as transactions in Elastic: They have been manually instrumented using the OpenTelemetry API.

Setting up Profiling and APM

First, we’ll need to set up the universal profiling host agent on the host where the demo application will run. Starting from version 8.14.0, correlation with APM data is supported and enabled out of the box for the profiler. There is no special configuration needed; we can just follow the standard setup guide. Note that at the time of writing, universal profiling only supports Linux. On Windows, you'll have to use a VM to try the demo. On macOS, you can use colima as docker engine and run the profiling host agent and the demo app in container images.

In addition, we’ll need to instrument our demo application with an APM agent. We can either use the classic Elastic APM agent or the Elastic OpenTelemetry Distribution.

Using the Classic Elastic APM Agent

Starting with version 1.50.0, the classic Elastic APM agent ships with the capability to correlate the traces it captures with the profiling data from universal profiling. We’ll just need to enable it explicitly via the universal_profiling_integration_enabled config option. Here is the standard command line for running the demo application with the setting enabled:

curl -o 'elastic-apm-agent.jar' -L 'https://oss.sonatype.org/service/local/artifact/maven/redirect?r=releases&g=co.elastic.apm&a=elastic-apm-agent&v=LATEST'
java -javaagent:elastic-apm-agent.jar \
-Delastic.apm.service_name=cpu-burner-elastic \
-Delastic.apm.secret_token=XXXXX \
-Delastic.apm.server_url=<elastic-apm-server-endpoint> \
-Delastic.apm.application_packages=co.elastic.demo \
-Delastic.apm.universal_profiling_integration_enabled=true \
-jar ./target/cpu-burner.jar

Using OpenTelemetry

The feature is also available as an OpenTelemetry SDK extension. This means you can use it as a plugin for the vanilla OpenTelemetry agent or add it to your OpenTelemetry SDK if you are not using an agent. In addition, the feature ships by default with the Elastic OpenTelemetry Distribution for Java and can be used via any of the possible usage methods. While the extension is currently Elastic-specific, we are already working with the various OpenTelemetry SIGs on standardizing the correlation mechanism, especially now after the eBPF profiling agent has been contributed.

For this demo, we’ll be using the Elastic OpenTelemetry Distro Java agent to run the extension:

curl -o 'elastic-otel-javaagent.jar' -L 'https://oss.sonatype.org/service/local/artifact/maven/redirect?r=releases&g=co.elastic.otel&a=elastic-otel-javaagent&v=LATEST'
java -javaagent:./elastic-otel-javaagent.jar \
-Dotel.exporter.otlp.endpoint=<elastic-cloud-OTLP-endpoint> \
"-Dotel.exporter.otlp.headers=Authorization=Bearer XXXX" \
-Dotel.service.name=cpu-burner-otel \
-Delastic.otel.universal.profiling.integration.enabled=true \
-jar ./target/cpu-burner.jar

Here, we explicitly enabled the profiling integration feature via the elastic.otel.universal.profiling.integration.enabled property. Note that with an upcoming release of the universal profiling feature, this won’t be necessary anymore! The OpenTelemetry extension will then automatically detect the presence of the profiler and enable the correlation feature based on that.

The demo repository also comes with a Dockerfile, so you can alternatively build and run the app in docker:

docker build -t cpu-burner .
docker run --rm -e OTEL_EXPORTER_OTLP_ENDPOINT=<elastic-cloud-OTLP-endpoint> -e OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer XXXX" cpu-burner

And that’s it for setup; we are now ready to inspect the correlated profiling data!

Analyzing Service CPU Usage

The first thing we can do now is head to the “Flamegraph” view in Universal Profiling and inspect flamegraphs filtered on APM services. Without the APM correlation, universal profiling is limited to filtering on infrastructure concepts, such as hosts, containers, and processes. Below is a screencast showing a flamegraph filtered on the service name of our demo application:

With this filter applied, we get a flamegraph aggregated over all instances of our service. If that is not desired, we could narrow down the filter, e.g. based on the host or container names. Note that the same service-level flamegraph view is also available on the “Universal Profiling” tab in the APM service UI.

The flamegraphs show exactly how the demo application is spending its CPU time, independently of whether it is covered by instrumentation or not. From left to right, we can first see the time spent in application tasks: We can identify the background allocations not covered by APM transactions as well as the SHA-computation and Fibonacci transactions. Interestingly, this application logic only covers roughly 60% of the total CPU time! The remaining time is spent mostly in the G1 garbage collector due to the high allocation rate of our application. The flamegraph shows all G1-related activities and the timing of the individual phases of concurrent tasks. We can easily identify those based on the native function names. This is made possible by universal profiling being capable of profiling and symbolizing the JVM’s C++ code in addition to the Java code.

Pinpointing Transaction Bottlenecks

While the service-level flamegraph already gives good insights on where our transactions consume the most CPU, this is mainly due to the simplicity of the demo application. In real-world applications, it can be much harder to pinpoint that certain stack frames come mostly from certain transactions. For this reason, the APM agent also correlates CPU profiling data from universal profiling on the transaction level.

We can navigate to the “Universal Profiling” tab on the transaction details page to get per-transaction flamegraphs:

For example, let’s have a look at the flamegraph of our transaction computing SHA-2 and SHA-3 hashes of randomly generated data:

Interestingly, the flamegraph uncovers some unexpected results: The transactions spend more time computing the random bytes to be hashed rather than on the hashing itself! So if this were a real-world application, a possible optimization could be to use a more performant random number generator.

In addition, we can see that the MessageDigest.update call for computing the hash values fans out into two different code paths: One is a call into the BouncyCastle cryptography library, the other one is a JVM stub routine, meaning that the JIT compiler has inserted special assembly code for a function.

The flamegraph shown in the screenshot displays the aggregated data for all “shaShenanigans” transactions in the given time filter. We can further filter this down using the transaction filter bar at the top. To make the best use of this, the demo application annotates the transactions with the hashing algorithm used via OpenTelemetry attributes:

public static void shaShenanigans(MessageDigest digest) {
    Span span = tracer.spanBuilder("shaShenanigans")
        .setAttribute("algorithm", digest.getAlgorithm())
        .startSpan();
    ...
    span.end()
}

So, let’s filter our flamegraph based on the used hashing algorithm:

Note that “SHA-256” is the name of the JVM built-in SHA-2 256-bit implementation. This now gives the following flamegraph:

We can see that the BouncyCastle stack frames are gone and MessageDigest.update spends all its time in the JVM stub routines. Therefore, the stub routine is likely hand-crafted assembly from the JVM maintainers for the SHA2 algorithm.

If we instead filter on “SHA3-256”, we get the following result:

Now, as expected, MessageDigest.update spends all its time in the BouncyCastle library for the SHA3 implementation. Note that the hashing here takes up more time in relation to the random data generation, showing that the SHA2 JVM stub routine is significantly faster than the BouncyCastle Java SHA3 implementation.

This filtering is not limited to custom attributes like those shown in this demo. You can filter on any transaction attributes, including latency, HTTP headers, and so on. For example, for typical HTTP applications, it allows analyzing the efficiency of the used JSON serializer based on the payload size. Note that while it is possible to filter on single transaction instances (e.g. based on trace.id), this is not recommended: To allow continuous profiling in production systems, the profiler by default runs with a low sampling rate of 20hz. This means that for typical real-world applications, this will not yield enough data when looking at a single transaction execution. Instead, we gain insights by monitoring multiple executions of a group of transactions over time and aggregating their samples, for example in a flamegraph.

Summary

A common reason for applications to degrade is overly high CPU usage. In this blog post, we showed how to combine universal profiling with APM to find the actual root cause in such cases: We explained how to analyze the CPU time using profiling flamegraphs on service and transaction levels. In addition, we further drilled down into data using custom filters. We used a simple demo application for this purpose, so go ahead and try it yourself with your own, real-world applications to uncover the actual power of the feature!