The 3 pillars of observability: Unified logs, metrics, and traces

Understanding telemetry signals for better decision-making, improved performance, and enhanced customer experiences

Magnifying_glass_(1).jpg

Telemetry signals have evolved significantly over the years — if you blinked, you could have missed it. In fact, much of the common wisdom about observability needs a refresh. If your observability solution doesn’t consider the current state of telemetry, you might need an upgrade. Monitoring and observability in software systems have transformed in several phases alongside technological advancements, more complex systems, and new approaches to monitoring.

As always, data is at the heart of the story. The evolution of telemetry closely follows the evolution of computing. As the need for data expands from an indicator of mechanical performance to an indicator (or several indicators) of business performance, so do telemetry signals. Computing has evolved to distributed cloud environments, so monitoring practices must evolve to meet the complexity of these architectures and sprawling digital environments.

Enter: Observability — the ability to not only observe your environment but also understand it. 

What are the 3 pillars of observability?

Three telemetry signals are the foundational pillars of observability: metrics, logs, and traces. But for modern observability, these might not be enough. So, we propose a new, fourth pillar: profiling

Metrics, traces, logs, and profiles are the cogs that enable you to do observability right — to monitor performance and productivity. In turn, it produces actionable insights in favor of profitability. They’re foundations for a reason. Misunderstand them, and you’re gone with the wind. Let’s dig in.

What are metrics? (Or, what’s happening?)

Metrics are the raw numeric data collected from various sources, such as hardware, application software, and websites, measuring known knowns. They provide information about resource usage, performance, and user behavior. Metrics are the original telemetry signals of monitoring and can be categorized into several types:

  1. Host metrics: CPU, memory, and disk usage

  2. Application metrics: Response times, error rates, and request rates

  3. Network performance metrics: Throughput, uptime, latency, availability, and more

  4. Server pool metrics: Scaling-adjustment metrics, number of running instances, and total number of instances 

  5. External dependencies metrics: Service status, response speed, availability, and error rates

These metrics are necessary for setting alerting, warning, and error condition thresholds. They enable teams to surveil system and network performance overall and identify issues when they arise. In this way, metrics inform a reactive stance in monitoring.

What are logs? (Why is it happening?)

Logs are structured and unstructured data from your infrastructure, applications, networks, and systems made up of timestamped entries relating to specific events. Different types of devices and systems emit logs:

  • Network devices 

  • Operating systems 

  • Applications

  • IoT devices

  • Third-party applications

Each one of these devices generates its own type of log, be it a system log, which includes events, such as connection attempts, errors, and configuration changes, or an application log that records anything from software changes to CRUD operations, application authentication, and more. Storing logs is complicated due to the number of log sources (consider how many of the listed devices, applications, and systems above your organization runs on), inconsistent formats and timestamps, and — with increased requests, use, and traffic — behemoth volumes of log data. 

Logs have evolved from being used primarily for troubleshooting problems to now serving many functions, such as recording the actions of users, optimizing system and network performance, and providing data that is useful for security monitoring investigations. When debugging issues and events, IT practitioners will refer to logs to pinpoint and remedy them. While logs themselves are flexible and filled with information — most of it unnecessary, generating a lot of noise — log data stores are siloed and not inherently correlated.

What are traces? (Where is it happening?)

Traces are the first signal to show the application from a user’s perspective by recording the actions a user performs while using an application or service. Distributed tracing is a way of monitoring requests as they make their way through distributed architecture from front to back-end and then front, again. In this way, traces are necessary for distributed, cloud-native applications that rely on complex technologies, such as Kubernetes.

By pulling together siloed data, traces help discover bottlenecks and debug and monitor applications that use multiple applications, such as disks, networks, and mutual exclusions (mutex). Traces also lay the groundwork for monitoring to take a proactive stance. For example, SREs, ITOps, and DevOps teams can analyze the trace data to discover the most useful metrics or logs related to any given issue and mitigate it in the future.  

Metrics, logs, and traces each provide a different set of data that can answer a user’s questions about their system and/or application. Without correlating the data, however, users face a variety of blind spots. One more piece of information can help fill the gap. Enter the next step in the evolution of telemetry signals: profiling.

What are profiles?

Profiles are collections of stack traces to which a code performance issue is associated, representing the number of times that stack trace was encountered. They offer a deeper level of insight than other signals, enabling the identification of obscure issues related to data structures, memory allocation, and code visibility at both the kernel and userspace levels. The action of gathering and analyzing profiles is referred to as profiling. 

Profiling provides an unprecedented breadth and depth of visibility, unlocking the ability to uncover the unknown unknowns within your system. In so doing, it’s the final piece of comprehensive system-wide visibility. That’s why it’s our fourth pillar of observability.

Connection between observability and telemetry signals

Telemetry signals are the foundation of observability. Observability is achieved by collecting, analyzing, and visualizing telemetry signals to understand the internal state of a system.

Phase 1: Basic monitoring and observability

Before the cloud, monitoring was manual, local, and limited. As computers moved from a specialized business sphere — yes, computers were niche office machines not but 50 years ago — to homes and offices everywhere, monitoring also evolved as a crucial IT practice and the first step toward observability.

To meet the increase of data circulating intraserver and intranetwork, IT engineers developed monitoring tools that automate detection. With the arrival of open source monitoring tools, basic network and desktop monitoring moves from local to web-based. Now, telemetry signals are about more than monitoring operations and local traffic — they are necessary business metrics, which is how you should think of metrics, logs, and traces today. Are people able to make purchases online without glitches? How fast is the connection? Have there been any data breaches? The answers to these questions are directly related to profitability through customer experience and satisfaction. 

And since websites now receive customer data, security monitoring is an increasingly important component of any monitoring solution.

Phase 2: Application monitoring and observability

In a quest for increased scalability, developers move from monolithic applications to service-oriented architectures and on to microservices and serverless. 

Moving from tightly coupled components to modular ones while improving application flexibility so that developers can quickly scale means there is a significant increase in the observability signals necessary for monitoring. The need to monitor not only the infrastructure but also the application itself adds response times, throughput, and error rates to the list of telemetry signals required for effective monitoring. Application performance monitoring (APM) has emerged as a new, necessary practice and set of tools that will become crucial to observability.

Phase 3: Monitoring in the cloud and observability

The transition to cloud-native technologies adds complexity to monitoring practices in the same way that the evolution of application development paradigms did — the volume and variety of telemetry signals keeps growing. 

In distributed architectures, new telemetry signals are necessary to ensure businesses are meeting customer expectations for fast, seamless, and secure online experiences. Here, we meet the great three: metrics, logs, and traces. These signals capture the underlying activity of an operating system and its applications, and together, ultimately enable businesses to achieve modern observability.

Phase 4: The evolution of telemetry signals for observability

Microservices, serverless, cloud-native. In a world driven by data, understanding the data that drives the data is crucial. From meeting profitability goals to productivity targets, metrics, logs, and traces are the links in an operations chain — they contain the data your business needs to deliver on its SLAs.

While metrics, logs, traces, and now, profiles all play crucial roles in monitoring practices, manually reviewing this data for debugging is, if not impossible, ineffective. And though automation relieves some pressure, it does generate a significant amount of alert noise. The variety of data types and sources also makes managing the flow of information and correlating it for actionable insights that much more difficult. 

So, we’re back to data (though we never left it) telling the story. With teams left in a constant state of ‘playing catch-up,’ monitoring must evolve to meet the demands of an extraordinary data influx and the need for information inextricably linked to business performance. The solution? A unified telemetry data platform.

Modern observability

Modern observability is the practice of observing the overall behavior of a distributed system to understand its internal state. An effective modern observability solution reunites metrics, logs, and traces on one unified telemetry platform, enabling IT practitioners to gain a holistic view of their tech environment and generate actionable insights that impact system resilience and productivity. 

With data- and information-heavy capabilities, a unified telemetry data platform requires storing as much data as possible to extract insights and analytics when needed or desired. As a result, an effective modern observability solution encourages both proactive and reactive investigations. Correlating data — metrics, logs, traces, and profiling — enables teams to address known knowns and uncover unknown unknowns. In other words, a successful modern observability solution positions teams to quickly answer the “why.”

So, how do you achieve modern observability? Consider open standards and OpenTelemetry to organize data and analyze it. OpenTelemetry helps SREs generate observability data in consistent (open standards) data formats for easier analysis and storage while minimizing incompatibility between vendor data types. Most industry analysts believe that OpenTelemetry will become the de facto standard for observability data in the next five years. Unified data collection with consistent data standards across all your telemetry signals means you will be future-proofing your observability. In broader terms, you’ll also need to:

  1. Create a baseline: Establish consistency in your formats and data for all your telemetry data.

  2. Filter data: Abstract out vendor inconsistencies for end users.

  3. Consolidate your tools: Ensure that you have an overarching observability strategy.

Telemetry data comes in many forms — cloud-native apps produce a variety of it while a serverless environment or virtual machine will produce another kind. What’s more, this data is produced at high velocity. So, achieving observability means unifying data. This ensures relevant, accurate, and fast insights.

The future of telemetry signals is all about predictability

It’s predictable: As the tech landscape evolves, so too will the telemetry signals required to ensure that systems are operating optimally and that resources are spent efficiently. Today, telemetry signals are plugged into sophisticated systems that leverage AI/ML for predictive insights and automated responses. This trend is here to stay. 

Sprawling, distributed environments mean monitoring has become difficult, read: ineffective. As a result, monitoring must expand its reach into the realm of observability. To achieve observability, organizations are consolidating monitoring tools onto a unified platform to alleviate some of the strain and obtain a holistic view of system performance. Basic performance metrics remain vital for troubleshooting and optimization while an iterative approach helps improve application resiliency. And by uniting siloed data, organizations can shift from reactive to proactive monitoring, relying on both real-time and historical telemetry data. 

With the rise of generative AI, companies are feeling the pressure to secure their systems more than ever. So, integrating security signals into observability frameworks is becoming more important with a focus on real-time threat detection and response. And as generative AI technologies evolve, we’re likely to see self-healing systems driven by advanced automated remediation capabilities. 

But before any of these next levels are unlocked, organizations have to lay the groundwork by building strong data foundations with open standards — after all, the single constant is change. Who wants vendor lock-in?

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.