Prometheus

Most alerts are noise. The hardest part of monitoring is deciding what NOT to alert on. Here is the framework I use. Rule 1: Every alert must be actionable If you get paged and there is nothing to do, the alert should not exist. Either fix the root cause, automate the response, or let it be a metric trend instead of a page. Rule 2: Alert on user-visible symptoms Instead of HighCPUUsage, prefer HighRequestLatency. CPU usage high with good latency means the system is working as designed. Latency high means users are hurting. ...