Alert Fatigue: Prometheus Rules That Actually Help

Most alerts are noise. The hardest part of monitoring is deciding what NOT to alert on. Here is the framework I use.

Rule 1: Every alert must be actionable

If you get paged and there is nothing to do, the alert should not exist. Either fix the root cause, automate the response, or let it be a metric trend instead of a page.

Rule 2: Alert on user-visible symptoms

Instead of HighCPUUsage, prefer HighRequestLatency. CPU usage high with good latency means the system is working as designed. Latency high means users are hurting.

Rule 3: SLO-based alerting

Define SLOs upfront, then alert on burn rate. Multi-window, multi-burn-rate alerts catch both fast and slow failures with minimal noise.

Rule 4: Severity levels matter

Critical (page immediately): user-visible outage, data loss risk
Warning (ticket for next business day): degraded service, upcoming capacity issue
Info (log only): interesting but not actionable

Most rules I write end up as warning or info. Very few are critical.

Rule 5: Include context in annotations

An oncall engineer at 3am does not remember anything. Every alert should link to a runbook and dashboard.

Rule 6: Review and prune quarterly

Look at fired alerts. How many led to actual action? How many were silenced? How many fired during known maintenance? The ratio of noise to signal is a real metric you should track.

Rule 1: Every alert must be actionable#

Rule 2: Alert on user-visible symptoms#

Rule 3: SLO-based alerting#

Rule 4: Severity levels matter#

Rule 5: Include context in annotations#

Rule 6: Review and prune quarterly#