The three pillars of observability are talked about a lot. Which one to reach for depends on the question you’re answering.
Metrics: for “is it broken and how much”
Aggregated numerical data over time. Good for:
- Dashboards and alerts
- Trends (is latency increasing week-over-week?)
- Capacity planning
Not good for:
- Explaining why a specific request was slow
- Finding causality between events
Stack: Prometheus + Grafana remains the default. OpenTelemetry Metrics if you want vendor-neutral instrumentation.
Logs: for “what happened”
Discrete events with context. Good for:
- Debugging specific failures
- Audit trails
- Security investigations
Not good for:
- Real-time alerting (too noisy)
- Aggregation across high-cardinality dimensions (expensive)
Stack: Loki + Promtail for log aggregation. ClickHouse for high-volume structured logs.
Traces: for “why was this slow”
Distributed request flows across services. Good for:
- Latency analysis
- Understanding service dependencies
- Finding hidden bottlenecks
Not good for:
- Aggregate analysis (traces are per-request)
- Long-term storage (volume)
Stack: OpenTelemetry SDK + Jaeger or Tempo backend.
The pyramid
┌─ Metrics ─────┐
│ (always on) │
├───────────────┤
│ Logs │ (sampled in prod)
├───────────────┤
│ Traces │ (head-sampled)
└───────────────┘
Metrics are always on. Logs are sampled in production to control cost. Traces are head-sampled (typically 1-5% of requests get traced).
OpenTelemetry is the standard
If starting fresh in 2026, use OpenTelemetry for everything:
- Unified instrumentation (one SDK per language)
- Export to any backend (Prometheus, Jaeger, DataDog, Honeycomb)
- Vendor neutral — migrate later without code changes
The SDK has matured significantly. No longer “stable API is coming soon.”
Correlation
The real power comes from correlation: a metric alerts → click through to related logs → from a slow log line, pivot to the trace.
To enable this:
- Include trace_id in log messages
- Tag metrics with service/environment labels that match log fields
- Use exemplars to link metrics to specific traces
Without correlation, the three pillars are three disconnected tools. With correlation, you move fluidly between views during an incident.
Cost reality
At scale, observability can cost more than the infrastructure it monitors. Tricks:
- Sample aggressively: 1% tracing is usually enough
- Downsample: keep 1-second resolution for 24h, 1-minute for 7d, 5-minute for 30d
- Use native histograms (Prometheus) for efficient percentile storage
- Structured logs (JSON) compress better than unstructured
- Tiered storage: hot tier (fast queries) for recent data, cold tier (S3) for old
What I skip
- APM tools with proprietary agents (vendor lock-in, confusing pricing)
- Real User Monitoring unless you have frontend teams that need it
- Synthetic monitoring of every possible path (test the critical ones)
Start here
For a small team:
- Prometheus + Grafana for metrics (1 week setup)
- Loki for logs (1 day setup)
- Tempo for traces (1 week setup + instrumentation)
Budget ~1 engineer-month to get solid coverage. After that, you iterate based on real incidents.