The Observability Pyramid: Logs, Metrics, Traces in 2026

The three pillars of observability are talked about a lot. Which one to reach for depends on the question you’re answering.

Metrics: for “is it broken and how much”

Aggregated numerical data over time. Good for:

Dashboards and alerts
Trends (is latency increasing week-over-week?)
Capacity planning

Not good for:

Explaining why a specific request was slow
Finding causality between events

Stack: Prometheus + Grafana remains the default. OpenTelemetry Metrics if you want vendor-neutral instrumentation.

Logs: for “what happened”

Discrete events with context. Good for:

Debugging specific failures
Audit trails
Security investigations

Not good for:

Real-time alerting (too noisy)
Aggregation across high-cardinality dimensions (expensive)

Stack: Loki + Promtail for log aggregation. ClickHouse for high-volume structured logs.

Traces: for “why was this slow”

Distributed request flows across services. Good for:

Latency analysis
Understanding service dependencies
Finding hidden bottlenecks

Not good for:

Aggregate analysis (traces are per-request)
Long-term storage (volume)

Stack: OpenTelemetry SDK + Jaeger or Tempo backend.

The pyramid

┌─ Metrics ─────┐
│  (always on)  │
├───────────────┤
│     Logs      │  (sampled in prod)
├───────────────┤
│    Traces     │  (head-sampled)
└───────────────┘

Metrics are always on. Logs are sampled in production to control cost. Traces are head-sampled (typically 1-5% of requests get traced).

OpenTelemetry is the standard

If starting fresh in 2026, use OpenTelemetry for everything:

Unified instrumentation (one SDK per language)
Export to any backend (Prometheus, Jaeger, DataDog, Honeycomb)
Vendor neutral — migrate later without code changes

The SDK has matured significantly. No longer “stable API is coming soon.”

Correlation

The real power comes from correlation: a metric alerts → click through to related logs → from a slow log line, pivot to the trace.

To enable this:

Include trace_id in log messages
Tag metrics with service/environment labels that match log fields
Use exemplars to link metrics to specific traces

Without correlation, the three pillars are three disconnected tools. With correlation, you move fluidly between views during an incident.

Cost reality

At scale, observability can cost more than the infrastructure it monitors. Tricks:

Sample aggressively: 1% tracing is usually enough
Downsample: keep 1-second resolution for 24h, 1-minute for 7d, 5-minute for 30d
Use native histograms (Prometheus) for efficient percentile storage
Structured logs (JSON) compress better than unstructured
Tiered storage: hot tier (fast queries) for recent data, cold tier (S3) for old

What I skip

APM tools with proprietary agents (vendor lock-in, confusing pricing)
Real User Monitoring unless you have frontend teams that need it
Synthetic monitoring of every possible path (test the critical ones)

Start here

For a small team:

Prometheus + Grafana for metrics (1 week setup)
Loki for logs (1 day setup)
Tempo for traces (1 week setup + instrumentation)

Budget ~1 engineer-month to get solid coverage. After that, you iterate based on real incidents.

Metrics: for “is it broken and how much”#

Logs: for “what happened”#

Traces: for “why was this slow”#

The pyramid#

OpenTelemetry is the standard#

Correlation#

Cost reality#

What I skip#

Start here#

Metrics: for “is it broken and how much”

Logs: for “what happened”

Traces: for “why was this slow”

The pyramid

OpenTelemetry is the standard

Correlation

Cost reality

What I skip

Start here