Posts

Incident Response Playbook That Actually Gets Used

Most incident playbooks end up as wiki pages nobody reads during an actual incident. Here’s what survives contact with a real 3am pager. The first five minutes One person is the Incident Commander (IC). If that’s not clear, declare yourself IC. Acknowledge the page Post in #incidents: “Incident: [brief]. I am IC.” Start a timeline document (even just a text file) Check public status page — update if user-visible Don’t dig into the problem yet. Set up the command structure first. ...

The Observability Pyramid: Logs, Metrics, Traces in 2026

The three pillars of observability are talked about a lot. Which one to reach for depends on the question you’re answering. Metrics: for “is it broken and how much” Aggregated numerical data over time. Good for: Dashboards and alerts Trends (is latency increasing week-over-week?) Capacity planning Not good for: Explaining why a specific request was slow Finding causality between events Stack: Prometheus + Grafana remains the default. OpenTelemetry Metrics if you want vendor-neutral instrumentation. ...

Rust vs Go for CLI Tools: A Practical Comparison

After writing CLI tools in both Rust and Go over the last few years, here are the things that actually matter when choosing between them. Startup time Go wins. A trivial Go program starts in ~1-5ms. A trivial Rust program also starts in ~1-5ms. Both are negligible for CLI tools. (The old argument about Go’s startup was mostly about JVM-vs-Go, not Go-vs-Rust.) Binary size Out of the box: Go: 5-15 MB for a small program Rust: 2-8 MB for a small program (with LTO and strip) After aggressive optimization: ...

tcpdump Filters Cheatsheet for When the Network is On Fire

tcpdump has a weird little filter language (BPF syntax) that I never remember under pressure. This page is my cheatsheet. Basic syntax tcpdump -i <interface> -n <filter> -n don't resolve addresses/ports -i interface (eth0, any, lo) -v verbose (-vv, -vvv more) -w write to file for later wireshark -r read from file -c N stop after N packets -s 0 capture full packet (not truncated) Host and network filters host 192.0.2.1 # to or from src host 192.0.2.1 # from only dst host 192.0.2.1 # to only net 192.0.2.0/24 # subnet src net 192.0.2.0/24 # subnet as source Port filters port 443 # source or dest port 443 src port 443 # source only dst port 443 # dest only portrange 50000-60000 # range Protocol filters tcp # TCP only udp # UDP only icmp # ICMP only arp # ARP tcp port 443 # combine 'tcp[tcpflags] & tcp-syn != 0' # TCP with SYN flag TCP flag combinations # SYN only (connection attempts) 'tcp[tcpflags] == tcp-syn' # SYN-ACK 'tcp[tcpflags] == tcp-syn|tcp-ack' # RST (connection resets) 'tcp[tcpflags] & tcp-rst != 0' # FIN (connection closes) 'tcp[tcpflags] & tcp-fin != 0' Combining filters host 192.0.2.1 and tcp port 443 'host 192.0.2.1 and (port 80 or port 443)' 'not arp and not port 22' Boolean operators: and, or, not (or &&, ||, !). ...

Self-Host vs SaaS: The Actual Tradeoffs

The “self-host everything” movement has passionate advocates on both sides. Reality is nuanced. Here’s the framework I use when deciding. Cost isn’t the main factor Many self-host advocates lead with cost savings. Usually it’s misleading: SaaS at small scale is often free or cheap ($0-50/mo) Self-hosting on cheap VPS starts around $5/mo But self-hosting eats engineer time — 2-10 hours/month for maintenance At $100/hr engineering time, self-hosting often costs MORE than SaaS Cost-wise, self-hosting wins when you’re either: ...

Grafana Dashboards That Don't Suck: Principles and Anti-Patterns

Most Grafana dashboards are bad. Too many panels, unclear queries, inconsistent color schemes, no clear purpose. Here are the principles I apply now. Rule 1: Every dashboard has one question Start by writing down: “What question does this dashboard answer?” Good: “Is the order service healthy right now?” “How is the nightly ETL job progressing?” “What is the cost trend for our compute in the last 30 days?” Bad: “Production metrics” “Database overview” If you can’t state the question in one sentence, you don’t know what the dashboard is for. ...

Terraform State Locking: Why You Need It and How It Goes Wrong

Terraform state without locking is a bug waiting to happen. Two engineers running apply simultaneously can corrupt state in ways that take hours to untangle. Here’s what I learned after one such incident. Why state locking matters Terraform reads state, computes a plan, and writes new state. Without locking, two concurrent runs can: Both read the same initial state Both compute their plans based on it Both write conflicting state — last one wins Now state doesn’t match real infrastructure The symptoms are weird: resources exist but Terraform wants to create them again. Or state references resources that were already destroyed. ...

ZFS on Linux: Six Months of Production Use

Migrated our build server array from ext4+mdadm to ZFS on Linux six months ago. Here’s what I learned. Why ZFS Checksumming catches silent data corruption (we found 14 affected files on the old array) Snapshots are cheap and instant (100ms for a 10TB dataset) Compression often makes things faster — less I/O, more CPU Send/receive for efficient replication No separate mdadm/LVM layer to debug Pool design For the build server, 6 x 4TB NVMe in RAIDZ2: ...

PostgreSQL Backup Strategies: Not All Backups Are Equal

A backup you can’t restore isn’t a backup. After losing data once (fortunately from a test environment), here’s the framework I apply now. The three levels of recovery Point-in-time recovery (PITR): Restore to any second in the last N days. Requires WAL archiving + base backups. Daily snapshots: Restore to yesterday’s 3am state. Simple, cheap, 24h RPO. Logical dumps: Restore specific tables or data subsets. Useful for selective recovery. Most production databases should have all three. ...

Modern TLS Cipher Configuration in 2026

Configuring TLS ciphers used to involve copying a magic list from Mozilla SSL Configurator and moving on. In 2026 the landscape has shifted enough that revisiting is worth it. What changed TLS 1.3 is now supported by 95%+ of clients. Serving TLS 1.0 or 1.1 is an active liability. OpenSSL 3.x became the default on most modern distros. Some older ciphers are simply gone. Post-quantum hybrid key exchange (X25519-Kyber768) started rolling out in Chrome and Firefox. Perfect Forward Secrecy is universally expected. No more RSA key exchange. Recommended nginx config ssl_protocols TLSv1.2 TLSv1.3; # TLS 1.3 cipher suites (nginx picks automatically) ssl_ciphers 'TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256'; ssl_prefer_server_ciphers off; ssl_ecdh_curve X25519:secp521r1:secp384r1; ssl_prefer_server_ciphers off is correct for modern deployments — clients know better than servers which ciphers perform well on their hardware. ...