Sre on Besterry — Linux & DevOps Notes

Incident Response Playbook That Actually Gets Used

Sun, 22 Dec 2024 00:00:00 +0000

Most incident playbooks end up as wiki pages nobody reads during an actual incident. Here’s what survives contact with a real 3am pager.

The first five minutes

One person is the Incident Commander (IC). If that’s not clear, declare yourself IC.

Acknowledge the page
Post in #incidents: “Incident: [brief]. I am IC.”
Start a timeline document (even just a text file)
Check public status page — update if user-visible

Don’t dig into the problem yet. Set up the command structure first.

The Observability Pyramid: Logs, Metrics, Traces in 2026

Tue, 10 Dec 2024 00:00:00 +0000

The three pillars of observability are talked about a lot. Which one to reach for depends on the question you’re answering.

Metrics: for “is it broken and how much”

Aggregated numerical data over time. Good for:

Dashboards and alerts
Trends (is latency increasing week-over-week?)
Capacity planning

Not good for:

Explaining why a specific request was slow
Finding causality between events

Stack: Prometheus + Grafana remains the default. OpenTelemetry Metrics if you want vendor-neutral instrumentation.

PostgreSQL Backup Strategies: Not All Backups Are Equal

Sun, 18 Aug 2024 00:00:00 +0000

A backup you can’t restore isn’t a backup. After losing data once (fortunately from a test environment), here’s the framework I apply now.

The three levels of recovery

Point-in-time recovery (PITR): Restore to any second in the last N days. Requires WAL archiving + base backups.
Daily snapshots: Restore to yesterday’s 3am state. Simple, cheap, 24h RPO.
Logical dumps: Restore specific tables or data subsets. Useful for selective recovery.

Most production databases should have all three.

Kubernetes Troubleshooting: The First 10 Minutes of an Outage

Mon, 22 Jul 2024 00:00:00 +0000

When PagerDuty wakes you up about a Kubernetes cluster issue, the first 10 minutes matter. Here is the runbook I work through before anything else.

Get your bearings

First, confirm what’s actually broken from the user side. Check the status page or synthetic monitor. Many “outages” are monitoring issues, not real problems.

Cluster-level check

kubectl get nodes
kubectl top nodes

Look for NotReady nodes and resource pressure. If multiple nodes are down, the problem is probably infrastructure — check the cloud provider console.

Alert Fatigue: Prometheus Rules That Actually Help

Mon, 10 Jun 2024 00:00:00 +0000

Most alerts are noise. The hardest part of monitoring is deciding what NOT to alert on. Here is the framework I use.

Rule 1: Every alert must be actionable

If you get paged and there is nothing to do, the alert should not exist. Either fix the root cause, automate the response, or let it be a metric trend instead of a page.

Rule 2: Alert on user-visible symptoms

Instead of HighCPUUsage, prefer HighRequestLatency. CPU usage high with good latency means the system is working as designed. Latency high means users are hurting.