Kubernetes Troubleshooting: The First 10 Minutes of an Outage
When PagerDuty wakes you up about a Kubernetes cluster issue, the first 10 minutes matter. Here is the runbook I work through before anything else. Get your bearings First, confirm what’s actually broken from the user side. Check the status page or synthetic monitor. Many “outages” are monitoring issues, not real problems. Cluster-level check kubectl get nodes kubectl top nodes Look for NotReady nodes and resource pressure. If multiple nodes are down, the problem is probably infrastructure — check the cloud provider console. ...