<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Sre on Besterry — Linux &amp; DevOps Notes</title><link>https://besterry.com/tags/sre/</link><description>Recent content in Sre on Besterry — Linux &amp; DevOps Notes</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sun, 22 Dec 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://besterry.com/tags/sre/index.xml" rel="self" type="application/rss+xml"/><item><title>Incident Response Playbook That Actually Gets Used</title><link>https://besterry.com/posts/incident-response-playbook/</link><pubDate>Sun, 22 Dec 2024 00:00:00 +0000</pubDate><guid>https://besterry.com/posts/incident-response-playbook/</guid><description>&lt;p&gt;Most incident playbooks end up as wiki pages nobody reads during an actual incident. Here&amp;rsquo;s what survives contact with a real 3am pager.&lt;/p&gt;
&lt;h2 id="the-first-five-minutes"&gt;The first five minutes&lt;/h2&gt;
&lt;p&gt;One person is the Incident Commander (IC). If that&amp;rsquo;s not clear, declare yourself IC.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Acknowledge the page&lt;/li&gt;
&lt;li&gt;Post in #incidents: &amp;ldquo;Incident: [brief]. I am IC.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Start a timeline document (even just a text file)&lt;/li&gt;
&lt;li&gt;Check public status page — update if user-visible&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Don&amp;rsquo;t dig into the problem yet. Set up the command structure first.&lt;/p&gt;</description></item><item><title>The Observability Pyramid: Logs, Metrics, Traces in 2026</title><link>https://besterry.com/posts/observability-pyramid/</link><pubDate>Tue, 10 Dec 2024 00:00:00 +0000</pubDate><guid>https://besterry.com/posts/observability-pyramid/</guid><description>&lt;p&gt;The three pillars of observability are talked about a lot. Which one to reach for depends on the question you&amp;rsquo;re answering.&lt;/p&gt;
&lt;h2 id="metrics-for-is-it-broken-and-how-much"&gt;Metrics: for &amp;ldquo;is it broken and how much&amp;rdquo;&lt;/h2&gt;
&lt;p&gt;Aggregated numerical data over time. Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dashboards and alerts&lt;/li&gt;
&lt;li&gt;Trends (is latency increasing week-over-week?)&lt;/li&gt;
&lt;li&gt;Capacity planning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Not good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Explaining why a specific request was slow&lt;/li&gt;
&lt;li&gt;Finding causality between events&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Stack: Prometheus + Grafana remains the default. OpenTelemetry Metrics if you want vendor-neutral instrumentation.&lt;/p&gt;</description></item><item><title>PostgreSQL Backup Strategies: Not All Backups Are Equal</title><link>https://besterry.com/posts/postgres-backup-strategies/</link><pubDate>Sun, 18 Aug 2024 00:00:00 +0000</pubDate><guid>https://besterry.com/posts/postgres-backup-strategies/</guid><description>&lt;p&gt;A backup you can&amp;rsquo;t restore isn&amp;rsquo;t a backup. After losing data once (fortunately from a test environment), here&amp;rsquo;s the framework I apply now.&lt;/p&gt;
&lt;h2 id="the-three-levels-of-recovery"&gt;The three levels of recovery&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Point-in-time recovery (PITR)&lt;/strong&gt;: Restore to any second in the last N days. Requires WAL archiving + base backups.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Daily snapshots&lt;/strong&gt;: Restore to yesterday&amp;rsquo;s 3am state. Simple, cheap, 24h RPO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logical dumps&lt;/strong&gt;: Restore specific tables or data subsets. Useful for selective recovery.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Most production databases should have all three.&lt;/p&gt;</description></item><item><title>Kubernetes Troubleshooting: The First 10 Minutes of an Outage</title><link>https://besterry.com/posts/k8s-troubleshooting/</link><pubDate>Mon, 22 Jul 2024 00:00:00 +0000</pubDate><guid>https://besterry.com/posts/k8s-troubleshooting/</guid><description>&lt;p&gt;When PagerDuty wakes you up about a Kubernetes cluster issue, the first 10 minutes matter. Here is the runbook I work through before anything else.&lt;/p&gt;
&lt;h2 id="get-your-bearings"&gt;Get your bearings&lt;/h2&gt;
&lt;p&gt;First, confirm what&amp;rsquo;s actually broken from the user side. Check the status page or synthetic monitor. Many &amp;ldquo;outages&amp;rdquo; are monitoring issues, not real problems.&lt;/p&gt;
&lt;h2 id="cluster-level-check"&gt;Cluster-level check&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;kubectl get nodes
kubectl top nodes
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look for NotReady nodes and resource pressure. If multiple nodes are down, the problem is probably infrastructure — check the cloud provider console.&lt;/p&gt;</description></item><item><title>Alert Fatigue: Prometheus Rules That Actually Help</title><link>https://besterry.com/posts/prometheus-alerts/</link><pubDate>Mon, 10 Jun 2024 00:00:00 +0000</pubDate><guid>https://besterry.com/posts/prometheus-alerts/</guid><description>&lt;p&gt;Most alerts are noise. The hardest part of monitoring is deciding what NOT to alert on. Here is the framework I use.&lt;/p&gt;
&lt;h2 id="rule-1-every-alert-must-be-actionable"&gt;Rule 1: Every alert must be actionable&lt;/h2&gt;
&lt;p&gt;If you get paged and there is nothing to do, the alert should not exist. Either fix the root cause, automate the response, or let it be a metric trend instead of a page.&lt;/p&gt;
&lt;h2 id="rule-2-alert-on-user-visible-symptoms"&gt;Rule 2: Alert on user-visible symptoms&lt;/h2&gt;
&lt;p&gt;Instead of HighCPUUsage, prefer HighRequestLatency. CPU usage high with good latency means the system is working as designed. Latency high means users are hurting.&lt;/p&gt;</description></item></channel></rss>