The 3 Pillars of Observability: What to Do When Your System is a Black Box

Hey readers,

I once spent six hours debugging an error in our new microservices architecture. A user would click “Purchase,” and a generic “Something went wrong” error would pop up. The request was passing through five different services, and from the outside, the whole system was a black box. We had no idea where the failure was happening. We were flying blind.

That’s the dark side of distributed systems. When they work, they’re beautiful. When they break, they can be nearly impossible to debug without the right tools.

This is where Observability comes in. Observability is the practice of instrumenting your systems to give you the data you need to understand what’s happening on the inside. It’s about turning that black box into a glass box.

To diagnose a sick patient, a doctor doesn’t just guess. They collect data: they check vital signs, read the patient’s history, and trace the path of their symptoms. Observability does the same for your software, and it rests on three pillars.

1. Logging (The Patient’s Diary)

Logs are detailed, timestamped records of discrete events that happened within a service. They are the most common and oldest form of instrumentation.

Imagine you’ve asked a patient to keep a diary of their symptoms. It might read:

10:00 AM: Woke up feeling fine.
10:15 AM: Received a request to process payment for order #123.
10:16 AM: ERROR: Failed to connect to the payment provider.
10:17 AM: Sent a 'failed' response back to the order service.

Logs are incredibly useful for drilling down into a specific error or a specific request. They give you the ground-level, detailed story of what happened.

The Challenge: In a microservices architecture, you might have hundreds of services all producing their own log files. Searching through all of them is impossible. This is why tools like Splunk, Datadog, or the ELK Stack (Elasticsearch, Logstash, Kibana) are used to aggregate all logs into one central, searchable place.

2. Metrics (The Vital Signs)

Metrics are high-level, numeric data aggregated over a period of time. They are your system’s vital signs.

A doctor monitoring a patient isn’t reading a diary; they’re looking at a chart showing heart rate, blood pressure, and temperature over the last 24 hours. They are looking for trends, spikes, and anomalies.

In software, metrics answer questions like:

What is our average CPU usage? (Is the patient’s heart racing?)
How many errors are we seeing per minute? (Is the patient’s temperature rising?)
What is the 95th percentile response time for our API? (How long is it taking for the patient to respond to questions?)
The Power of Metrics: Metrics are much more efficient to store and query than logs. They are perfect for building dashboards and, most importantly, for setting up alerts. You don’t want to find out your site is down from a user; you want an alert to fire the moment your error rate spikes. This is what tools like Prometheus and Grafana are for.

3. Tracing (Following the Patient’s Journey)

Metrics can tell you that something is wrong, and logs can give you details about a specific service, but how do you understand a request’s full journey through your complex system? That’s the job of Distributed Tracing.

A trace is a complete story of a single request as it moves through all the different services. It’s like following a patient on their entire journey through the hospital.

The trace would show:

The request started at the API Gateway. (10ms)
It was sent to the Order Service. (50ms)
The Order Service called the User Service. (25ms)
The Order Service then called the Payment Service. (500ms) -> Here’s the problem!
The Payment Service returned an error.

The “Aha!” Moment: With a trace, you can instantly visualize the entire call graph, see where the latency is, and pinpoint exactly which service is the source of the error. It turns a six-hour debugging session into a six-minute one. Tools like Jaeger and Zipkin are designed for this.

What’s the next move?

Challenge: Look at the logs for a simple application you’ve worked on. Are they structured (like JSON, which is machine-readable) or unstructured (just plain text)? Structured logs are far more powerful because they can be easily searched and filtered.

Now, think about what key metrics you would want to monitor for that application. Don’t just think about system metrics like CPU. Think about business metrics. Number of user signups per hour? Average value of a shopping cart? Number of failed payments? These are the numbers that truly tell you if your application is healthy.

Thanks for reading!

Bou~codes and Naima from 10xdev blog.