Alert Fatigue: Prometheus Rules That Actually Help

Mon, 10 Jun 2024 00:00:00 +0000

Most alerts are noise. The hardest part of monitoring is deciding what NOT to alert on. Here is the framework I use.

Rule 1: Every alert must be actionable

If you get paged and there is nothing to do, the alert should not exist. Either fix the root cause, automate the response, or let it be a metric trend instead of a page.

Rule 2: Alert on user-visible symptoms

Instead of HighCPUUsage, prefer HighRequestLatency. CPU usage high with good latency means the system is working as designed. Latency high means users are hurting.

Prometheus on Besterry — Linux & DevOps Notes

Alert Fatigue: Prometheus Rules That Actually Help

Rule 1: Every alert must be actionable

Rule 2: Alert on user-visible symptoms