Troubleshoot your Universal Profiling

edit

Troubleshoot your Universal Profiling

edit

This functionality is in beta and is subject to change. The design and code is less mature than official GA features and is being provided as-is with no warranties. Beta features are not subject to the support SLA of official GA features.

You can use the host-agent logs to find errors.

The following is an example of a healthy host-agent output:

time="..." level=info msg="Starting Prodfiler Host Agent v2.4.0 (revision develop-5cce978a, build timestamp 12345678910)"
time="..." level=info msg="Interpreter tracers: perl,php,python,hotspot,ruby,v8"
time="..." level=info msg="Automatically determining environment and machine ID ..."
time="..." level=warning msg="Environment tester (gcp) failed: failed to get GCP metadata: Get \"http://169.254.169.254/computeMetadata/v1/instance/id\": dial tcp 169.254.169.254:80: i/o timeout"
time="..." level=warning msg="Environment tester (azure) failed: failed to get azure metadata: Get \"http://169.254.169.254/metadata/instance/compute?api-version=2020-09-01&format=json\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
time="..." level=warning msg="Environment tester (aws) failed: failed to get aws metadata: EC2MetadataRequestError: failed to get EC2 instance identity document\ncaused by: RequestError: send request failed\ncaused by: Get \"http://169.254.169.254/latest/dynamic/instance-identity/document\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
time="..." level=info msg="Environment: hardware, machine ID: 0xdeadbeefdeadbeef"
time="..." level=info msg="Assigned ProjectID: 5"
time="..." level=info msg="Start CPU metrics"
time="..." level=info msg="Start I/O metrics"
time="..." level=info msg="Found tpbase offset: 9320 (via x86_fsbase_write_task)"
time="..." level=info msg="Environment variable KUBERNETES_SERVICE_HOST not set"
time="..." level=info msg="Supports eBPF map batch operations"
time="..." level=info msg="eBPF tracer loaded"
time="..." level=info msg="Attached tracer program"
time="..." level=info msg="Attached sched monitor"

A host-agent deployment is working if the output of the following command is empty:

head host-agent.log -n 15 | grep "level=error"

If running this command outputs error-level logs, the following are possible causes:

  • The host-agent is running on an unsupported version of the Linux kernel, or its missing kernel features.

    If the host-agent is running on an unsupported kernel version, the following is logged:

    Host Agent requires kernel version 4.15 or newer but got 3.10.0

    If eBPF features are not available in the kernel, the host-agent fails to start, and one of the following is logged:

    Failed to probe eBPF syscall

    or

    Failed to probe tracepoint
  • The host-agent is not able to connect to Elastic Cloud. In this case, a similar message to the following is logged:

    Failed to setup gRPC connection (retrying...): context deadline exceeded

    Verify the collection-agent configuration value is set and is equal to what was printed in Kibana, when clicking to Add Data.

  • The secret token is not valid, or it has been changed. In this case, the host-agent shuts down, and logs a similar message to the following:

    rpc error: code = Unauthenticated desc = authentication failed
  • The host-agent is unable to send data to your deployment. In this case, a similar message to the following is logged:

    Failed to report hostinfo (retrying...): rpc error: code = Unimplemented desc = unknown service collectionagent.CollectionAgent"

    This typically means that your Elastic Cloud cluster has not been configured for Universal Profiling. To configure your Elastic Cloud cluster, follow the steps in configure data ingestion.

  • The APM server (part of the backend in Elastic Cloud that receives data from the host-agent) ran out of memory. In this case, a similar message to the following is logged:

    Error: failed to invoke XXX(): Unavailable rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway); transport: received unexpected content-type "application/json; charset=UTF-8"

    Verify that the APM server is running by navigating to Elastic Cloud → Deployments → <Deployment Name> → Integrations Server in Elastic Cloud. If the Copy endpoint link next to APM is grayed out, you need to restart the APM server by clicking Force Restart under Integrations Server Management.

    For non-demo workloads, verify that the Integrations Server has at least the recommended 4GB of RAM. You can check this on the Integrations Server page under Instances.

If you’re unable to find a solution to the host-agent failure, you can raise a support request indicating Universal Profiling and host-agent as the source of the problem.

Enable verbose logging in host-agent

edit

During the support process, you may be asked to provide debug logs from one of the host-agent installations from your deployment.

To enable debug logs, add the -verbose command-line flag or the verbose true setting in the configuration file.

We recommend only enabling debug logs on a single instance of host-agent rather than an entire deployment to limit the amount of logs produced.

Improve load times

edit

The amount of data loaded for the flamegraph, topN functions, and traces view can lead to latency when using a slow connection (e.g. DSL or mobile).

Setting the Kibana cluster option server.compression.brotli.enabled: true reduces the amount of data transferred and should reduce load time.

Troubleshoot host-agent Kubernetes deployments

edit

When the Helm chart installation finishes, the output has instructions on how to check the host-agent pod status and read logs. The following sections provide potential scenarios when host-agent installation is not healthy.

Taints
edit

Kubernetes clusters often include taints and tolerations in their setup. In these cases, a host-agent installation may show no pods or very few pods running, even for a large cluster.

This is because a taint precludes the execution of pods on a node unless the workload has been tolerated. The Helm chart tolerations key in the values.yaml sets the toleration of taints using the official Kubernetes scheduling API format.

The following examples provide a tolerations config that you can add to the Helm chart values.yaml:

  • To deploy the host-agent on all nodes with taint workload=python:NoExecute, add the following to the values.yaml:

    tolerations:
    - key: "workload"
      value: "python"
      effect: "NoExecute"
  • To deploy the host-agent on all nodes tainted with key production and effect NoSchedule (no value provided), add the following to the values.yaml:

    tolerations:
      - key: "production"
        effect: "NoSchedule"
        operator: Exists
  • To deploy the host-agent on all nodes, tolerating all taints, add the following to the values.yaml:

    tolerations:
      - effect: NoSchedule
        operator: Exists
      - effect: NoExecute
        operator: Exists
Security policy enforcement
edit

Some Kubernetes clusters are configured with hardened security add-ons to limit the blast radius of exploited application vulnerabilities. Different hardening methodologies can impair host-agent operations and may, for example, result in pods continuously restarting after displaying a CrashLoopBackoff status.

Kubernetes PodSecurityPolicy (deprecated)
edit

This Kubernetes API has been deprecated, but some still use it. A PodSecurityPolicy (PSP) may explicitly prevent the execution of privileged containers across the entire cluster.

Since host-agent needs privileges in most kernels/CRI, you need to build a PSP to allow the host-agent DaemonSet to run.

Kubernetes policy engines
edit

Read more about Kubernetes policy engines in the SIG-Security documentation.

The following tools may prevent the execution of host-agent pods as the Helm chart builds a cluster role and binds it into the host-agent service account (we use it for container metadata):

  • Open Policy Agent Gatekeeper
  • Kyverno
  • Fairwinds Polaris

If you have a policy engine in place, configure it to allow the host-agent execution and RBAC configs.

Network configurations
edit

In some instances, your host-agent pods may be running fine, but they will not connect to the remote data collector gRPC interface and stay in the startup phase, while trying to connect periodically.

The following are potential causes:

  • Kubernetes NetworkPolicies define connectivity rules that prevent all outgoing traffic unless explicitly allow-listed.
  • Cloud or datacenter provider network rules are restricting egress traffic to allowed destinations only (ACLs).
OS-level security
edit

These settings are not part of Kubernetes and may have been included in the node setup. They can prevent the host-agent from working properly, as they intercept syscalls from the host-agent to the kernel and modify or block them.

If you have implemented security hardening (some providers listed below), you should know the privileges the host-agent needs.

  • gVisor on GKE
  • seccomp filters
  • AppArmor LSM

Submit a support request

edit

You can submit a support request from the support request page in the Elastic Cloud console.

In the support request, specify if your issue deals with the host-agent or the Kibana app.

Send feedback

edit

If troubleshooting and support are not fixing your issues, or you have any other feedback that you want to share about the product, send the Universal Profiling team an email at profiling-feedback@elastic.co.