Troubleshoot your Universal Profiling
editTroubleshoot your Universal Profiling
editThis functionality is in beta and is subject to change. The design and code is less mature than official GA features and is being provided as-is with no warranties. Beta features are not subject to the support SLA of official GA features.
You can use the host-agent logs to find errors.
The following is an example of a healthy host-agent output:
time="..." level=info msg="Starting Prodfiler Host Agent v2.4.0 (revision develop-5cce978a, build timestamp 12345678910)" time="..." level=info msg="Interpreter tracers: perl,php,python,hotspot,ruby,v8" time="..." level=info msg="Automatically determining environment and machine ID ..." time="..." level=warning msg="Environment tester (gcp) failed: failed to get GCP metadata: Get \"http://169.254.169.254/computeMetadata/v1/instance/id\": dial tcp 169.254.169.254:80: i/o timeout" time="..." level=warning msg="Environment tester (azure) failed: failed to get azure metadata: Get \"http://169.254.169.254/metadata/instance/compute?api-version=2020-09-01&format=json\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" time="..." level=warning msg="Environment tester (aws) failed: failed to get aws metadata: EC2MetadataRequestError: failed to get EC2 instance identity document\ncaused by: RequestError: send request failed\ncaused by: Get \"http://169.254.169.254/latest/dynamic/instance-identity/document\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" time="..." level=info msg="Environment: hardware, machine ID: 0xdeadbeefdeadbeef" time="..." level=info msg="Assigned ProjectID: 5" time="..." level=info msg="Start CPU metrics" time="..." level=info msg="Start I/O metrics" time="..." level=info msg="Found tpbase offset: 9320 (via x86_fsbase_write_task)" time="..." level=info msg="Environment variable KUBERNETES_SERVICE_HOST not set" time="..." level=info msg="Supports eBPF map batch operations" time="..." level=info msg="eBPF tracer loaded" time="..." level=info msg="Attached tracer program" time="..." level=info msg="Attached sched monitor"
A host-agent deployment is working if the output of the following command is empty:
head host-agent.log -n 15 | grep "level=error"
If running this command outputs error-level logs, the following are possible causes:
-
The host-agent is running on an unsupported version of the Linux kernel, or its missing kernel features.
If the host-agent is running on an unsupported kernel version, the following is logged:
Host Agent requires kernel version 4.15 or newer but got 3.10.0
If eBPF features are not available in the kernel, the host-agent fails to start, and one of the following is logged:
Failed to probe eBPF syscall
or
Failed to probe tracepoint
-
The host-agent is not able to connect to Elastic Cloud. In this case, a similar message to the following is logged:
Failed to setup gRPC connection (retrying...): context deadline exceeded
Verify the
collection-agent
configuration value is set and is equal to what was printed in Kibana, when clicking to Add Data. -
The secret token is not valid, or it has been changed. In this case, the host-agent shuts down, and logs a similar message to the following:
rpc error: code = Unauthenticated desc = authentication failed
-
The host-agent is unable to send data to your deployment. In this case, a similar message to the following is logged:
Failed to report hostinfo (retrying...): rpc error: code = Unimplemented desc = unknown service collectionagent.CollectionAgent"
This typically means that your Elastic Cloud cluster has not been configured for Universal Profiling. To configure your Elastic Cloud cluster, follow the steps in configure data ingestion.
-
The APM server (part of the backend in Elastic Cloud that receives data from the host-agent) ran out of memory. In this case, a similar message to the following is logged:
Error: failed to invoke XXX(): Unavailable rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway); transport: received unexpected content-type "application/json; charset=UTF-8"
Verify that the APM server is running by navigating to Elastic Cloud → Deployments →
<Deployment Name>
→ Integrations Server in Elastic Cloud. If the Copy endpoint link next to APM is grayed out, you need to restart the APM server by clicking Force Restart under Integrations Server Management.For non-demo workloads, verify that the Integrations Server has at least the recommended 4GB of RAM. You can check this on the Integrations Server page under Instances.
If you’re unable to find a solution to the host-agent failure, you can raise a support request indicating Universal Profiling
and host-agent
as the source of the problem.
Enable verbose logging in host-agent
editDuring the support process, you may be asked to provide debug logs from one of the host-agent installations from your deployment.
To enable debug logs, add the -verbose
command-line flag or the verbose true
setting in the configuration file.
We recommend only enabling debug logs on a single instance of host-agent rather than an entire deployment to limit the amount of logs produced.
Improve load times
editThe amount of data loaded for the flamegraph, topN functions, and traces view can lead to latency when using a slow connection (e.g. DSL or mobile).
Setting the Kibana cluster option server.compression.brotli.enabled: true
reduces the amount of data transferred and should reduce load time.
Troubleshoot host-agent Kubernetes deployments
editWhen the Helm chart installation finishes, the output has instructions on how to check the host-agent pod status and read logs. The following sections provide potential scenarios when host-agent installation is not healthy.
Taints
editKubernetes clusters often include taints and tolerations in their setup. In these cases, a host-agent installation may show no pods or very few pods running, even for a large cluster.
This is because a taint precludes the execution of pods on a node unless the workload has been tolerated.
The Helm chart tolerations
key in the values.yaml
sets the toleration of taints using the official Kubernetes scheduling API
format.
The following examples provide a tolerations
config that you can add to the Helm chart values.yaml
:
-
To deploy the host-agent on all nodes with taint
workload=python:NoExecute
, add the following to thevalues.yaml
:tolerations: - key: "workload" value: "python" effect: "NoExecute"
-
To deploy the host-agent on all nodes tainted with key
production
and effectNoSchedule
(no value provided), add the following to thevalues.yaml
:tolerations: - key: "production" effect: "NoSchedule" operator: Exists
-
To deploy the host-agent on all nodes, tolerating all taints, add the following to the
values.yaml
:tolerations: - effect: NoSchedule operator: Exists - effect: NoExecute operator: Exists
Security policy enforcement
editSome Kubernetes clusters are configured with hardened security add-ons to limit the blast radius of exploited application vulnerabilities.
Different hardening methodologies can impair host-agent operations and may, for example, result in pods continuously restarting after displaying a CrashLoopBackoff
status.
Kubernetes PodSecurityPolicy (deprecated)
editThis Kubernetes API has been deprecated, but some still use it. A PodSecurityPolicy (PSP) may explicitly prevent the execution of privileged
containers across the entire cluster.
Since host-agent needs privileges in most kernels/CRI, you need to build a PSP to allow the host-agent DaemonSet to run.
Kubernetes policy engines
editRead more about Kubernetes policy engines in the SIG-Security documentation.
The following tools may prevent the execution of host-agent pods as the Helm chart builds a cluster role and binds it into the host-agent service account (we use it for container metadata):
- Open Policy Agent Gatekeeper
- Kyverno
- Fairwinds Polaris
If you have a policy engine in place, configure it to allow the host-agent execution and RBAC configs.
Network configurations
editIn some instances, your host-agent pods may be running fine, but they will not connect to the remote data collector gRPC interface and stay in the startup phase, while trying to connect periodically.
The following are potential causes:
-
Kubernetes
NetworkPolicies
define connectivity rules that prevent all outgoing traffic unless explicitly allow-listed. - Cloud or datacenter provider network rules are restricting egress traffic to allowed destinations only (ACLs).
OS-level security
editThese settings are not part of Kubernetes and may have been included in the node setup. They can prevent the host-agent from working properly, as they intercept syscalls from the host-agent to the kernel and modify or block them.
If you have implemented security hardening (some providers listed below), you should know the privileges the host-agent needs.
- gVisor on GKE
- seccomp filters
- AppArmor LSM
Submit a support request
editYou can submit a support request from the support request page in the Elastic Cloud console.
In the support request, specify if your issue deals with the host-agent or the Kibana app.
Send feedback
editIf troubleshooting and support are not fixing your issues, or you have any other feedback that you want to share about the
product, send the Universal Profiling team an email at profiling-feedback@elastic.co
.