When PagerDuty wakes you up about a Kubernetes cluster issue, the first 10 minutes matter. Here is the runbook I work through before anything else.

Get your bearings

First, confirm what’s actually broken from the user side. Check the status page or synthetic monitor. Many “outages” are monitoring issues, not real problems.

Cluster-level check

kubectl get nodes
kubectl top nodes

Look for NotReady nodes and resource pressure. If multiple nodes are down, the problem is probably infrastructure — check the cloud provider console.

Workload state

kubectl get pods -A --field-selector=status.phase!=Running
kubectl get events -A --sort-by='.lastTimestamp' | tail -50

Pending pods mean scheduler issues (resource limits, taints, affinities). CrashLoopBackOff means application problems — get logs. ImagePullBackOff means registry issues.

Application logs

kubectl logs -n myns mypod --previous
kubectl logs -n myns mypod --tail=200

--previous gets logs from the pod instance that crashed, which is usually what you want during an outage.

Network diagnosis

kubectl run debug --image=nicolaka/netshoot --rm -it -- bash

From inside the debug pod you have tcpdump, dig, nslookup, curl, netstat. Test DNS resolution, try hitting service endpoints directly, check if policies are blocking traffic.

The classic three

Most K8s outages reduce to one of:

  1. Resource exhaustion (OOM kills, CPU throttling, disk full on nodes)
  2. Configuration drift (bad config map update, secret rotation broken)
  3. Network policy changes (recent NetworkPolicy or Istio rule blocking traffic)

Check recent kubectl describe deploy output for config version hashes. Compare to what’s in your GitOps repo.

Escalation triggers

  • More than half the cluster nodes affected → call cloud provider
  • Data plane OK but control plane degraded → issue with kube-apiserver or etcd, page the platform team
  • Persistent volume claim issues → could be storage provider, page storage team

Don’t heroically debug alone for more than 20 minutes. Escalate, document what you’ve tried.