Most alerts are noise. The hardest part of monitoring is deciding what NOT to alert on. Here is the framework I use.

Rule 1: Every alert must be actionable

If you get paged and there is nothing to do, the alert should not exist. Either fix the root cause, automate the response, or let it be a metric trend instead of a page.

Rule 2: Alert on user-visible symptoms

Instead of HighCPUUsage, prefer HighRequestLatency. CPU usage high with good latency means the system is working as designed. Latency high means users are hurting.

Rule 3: SLO-based alerting

Define SLOs upfront, then alert on burn rate. Multi-window, multi-burn-rate alerts catch both fast and slow failures with minimal noise.

Rule 4: Severity levels matter

  • Critical (page immediately): user-visible outage, data loss risk
  • Warning (ticket for next business day): degraded service, upcoming capacity issue
  • Info (log only): interesting but not actionable

Most rules I write end up as warning or info. Very few are critical.

Rule 5: Include context in annotations

An oncall engineer at 3am does not remember anything. Every alert should link to a runbook and dashboard.

Rule 6: Review and prune quarterly

Look at fired alerts. How many led to actual action? How many were silenced? How many fired during known maintenance? The ratio of noise to signal is a real metric you should track.