Incident Response Playbook That Actually Gets Used

Most incident playbooks end up as wiki pages nobody reads during an actual incident. Here’s what survives contact with a real 3am pager. The first five minutes One person is the Incident Commander (IC). If that’s not clear, declare yourself IC. Acknowledge the page Post in #incidents: “Incident: [brief]. I am IC.” Start a timeline document (even just a text file) Check public status page — update if user-visible Don’t dig into the problem yet. Set up the command structure first. ...

December 22, 2024 · 3 min · Besterry

The Observability Pyramid: Logs, Metrics, Traces in 2026

The three pillars of observability are talked about a lot. Which one to reach for depends on the question you’re answering. Metrics: for “is it broken and how much” Aggregated numerical data over time. Good for: Dashboards and alerts Trends (is latency increasing week-over-week?) Capacity planning Not good for: Explaining why a specific request was slow Finding causality between events Stack: Prometheus + Grafana remains the default. OpenTelemetry Metrics if you want vendor-neutral instrumentation. ...

December 10, 2024 · 3 min · Besterry

PostgreSQL Backup Strategies: Not All Backups Are Equal

A backup you can’t restore isn’t a backup. After losing data once (fortunately from a test environment), here’s the framework I apply now. The three levels of recovery Point-in-time recovery (PITR): Restore to any second in the last N days. Requires WAL archiving + base backups. Daily snapshots: Restore to yesterday’s 3am state. Simple, cheap, 24h RPO. Logical dumps: Restore specific tables or data subsets. Useful for selective recovery. Most production databases should have all three. ...

August 18, 2024 · 2 min · Besterry

Kubernetes Troubleshooting: The First 10 Minutes of an Outage

When PagerDuty wakes you up about a Kubernetes cluster issue, the first 10 minutes matter. Here is the runbook I work through before anything else. Get your bearings First, confirm what’s actually broken from the user side. Check the status page or synthetic monitor. Many “outages” are monitoring issues, not real problems. Cluster-level check kubectl get nodes kubectl top nodes Look for NotReady nodes and resource pressure. If multiple nodes are down, the problem is probably infrastructure — check the cloud provider console. ...

July 22, 2024 · 2 min · Besterry

Alert Fatigue: Prometheus Rules That Actually Help

Most alerts are noise. The hardest part of monitoring is deciding what NOT to alert on. Here is the framework I use. Rule 1: Every alert must be actionable If you get paged and there is nothing to do, the alert should not exist. Either fix the root cause, automate the response, or let it be a metric trend instead of a page. Rule 2: Alert on user-visible symptoms Instead of HighCPUUsage, prefer HighRequestLatency. CPU usage high with good latency means the system is working as designed. Latency high means users are hurting. ...

June 10, 2024 · 2 min · Besterry