Monitoring

The Observability Pyramid: Logs, Metrics, Traces in 2026

The three pillars of observability are talked about a lot. Which one to reach for depends on the question you’re answering. Metrics: for “is it broken and how much” Aggregated numerical data over time. Good for: Dashboards and alerts Trends (is latency increasing week-over-week?) Capacity planning Not good for: Explaining why a specific request was slow Finding causality between events Stack: Prometheus + Grafana remains the default. OpenTelemetry Metrics if you want vendor-neutral instrumentation. ...

Grafana Dashboards That Don't Suck: Principles and Anti-Patterns

Most Grafana dashboards are bad. Too many panels, unclear queries, inconsistent color schemes, no clear purpose. Here are the principles I apply now. Rule 1: Every dashboard has one question Start by writing down: “What question does this dashboard answer?” Good: “Is the order service healthy right now?” “How is the nightly ETL job progressing?” “What is the cost trend for our compute in the last 30 days?” Bad: “Production metrics” “Database overview” If you can’t state the question in one sentence, you don’t know what the dashboard is for. ...

Alert Fatigue: Prometheus Rules That Actually Help

Most alerts are noise. The hardest part of monitoring is deciding what NOT to alert on. Here is the framework I use. Rule 1: Every alert must be actionable If you get paged and there is nothing to do, the alert should not exist. Either fix the root cause, automate the response, or let it be a metric trend instead of a page. Rule 2: Alert on user-visible symptoms Instead of HighCPUUsage, prefer HighRequestLatency. CPU usage high with good latency means the system is working as designed. Latency high means users are hurting. ...