Kubernetes Troubleshooting: The First 10 Minutes of an Outage

When PagerDuty wakes you up about a Kubernetes cluster issue, the first 10 minutes matter. Here is the runbook I work through before anything else. Get your bearings First, confirm what’s actually broken from the user side. Check the status page or synthetic monitor. Many “outages” are monitoring issues, not real problems. Cluster-level check kubectl get nodes kubectl top nodes Look for NotReady nodes and resource pressure. If multiple nodes are down, the problem is probably infrastructure — check the cloud provider console. ...

July 22, 2024 · 2 min · Besterry

Alert Fatigue: Prometheus Rules That Actually Help

Most alerts are noise. The hardest part of monitoring is deciding what NOT to alert on. Here is the framework I use. Rule 1: Every alert must be actionable If you get paged and there is nothing to do, the alert should not exist. Either fix the root cause, automate the response, or let it be a metric trend instead of a page. Rule 2: Alert on user-visible symptoms Instead of HighCPUUsage, prefer HighRequestLatency. CPU usage high with good latency means the system is working as designed. Latency high means users are hurting. ...

June 10, 2024 · 2 min · Besterry

Reducing Container Image Size: Multi-Stage Builds and Alpine

Small images boot faster, save bandwidth, and have smaller attack surface. Here are the techniques that actually work. Multi-stage builds The single biggest win. Build in one stage, copy only the artifacts to a minimal runtime stage. A Go binary of 15 MB ends up in a 17 MB image. Compare to a naive golang:1.22 image at 900+ MB. Base image choice From smallest to largest for Go/Rust static binaries: ...

May 20, 2024 · 1 min · Besterry

Useful bpftrace One-Liners for System Debugging

bpftrace makes the kernel event space accessible from a bash one-liner. Here are the scripts I keep reaching for. Count syscalls by process bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' Distribution of file read sizes bpftrace -e 'tracepoint:syscalls:sys_enter_read { @ = hist(args->count); }' TCP retransmissions by remote address bpftrace -e ' kprobe:tcp_retransmit_skb { $sk = (struct sock *)arg0; $daddr = $sk->__sk_common.skc_daddr; @[ntop($daddr)] = count(); }' Process creation stream bpftrace -e 'tracepoint:sched:sched_process_exec { printf("%s\n", str(args->filename)); }' When to use bpftrace vs perf vs strace strace: simple, but adds significant overhead. Fine for debugging a single misbehaving process. perf: best for sampling-based profiling (CPU time, cache misses). Low overhead. bpftrace: best for event-driven tracing across the whole system. Tiny overhead if used sparingly. All three should be in your toolbox.

May 2, 2024 · 1 min · Besterry

WireGuard vs AmneziaWG: When Obfuscation Matters

Plain WireGuard is simple and fast. AmneziaWG adds obfuscation to the handshake. When do you need which? Plain WireGuard is enough when You control both endpoints, no DPI is filtering your traffic, and the main concern is performance and simplicity. WireGuard shines for: Site-to-site VPN between your own servers Remote access to a home lab Point-to-point tunnels on a LAN The handshake is small, fast, and provably secure. It uses Noise framework primitives and 1 RTT. ...

April 15, 2024 · 2 min · Besterry

SSH Hardening Checklist for Public VPS

Every public-facing server gets port-scanned within minutes of going online. Default SSH settings are decent but not great. Here is the checklist I run through on every new VPS. Disable password authentication In /etc/ssh/sshd_config: PasswordAuthentication no PubkeyAuthentication yes ChallengeResponseAuthentication no KbdInteractiveAuthentication no Restrict root login PermitRootLogin prohibit-password This allows root login with key but not password, which is fine for automation. For stricter setups, use no and sudo from an unprivileged user. ...

April 1, 2024 · 1 min · Besterry

Docker Network Debugging: nsenter and tcpdump Patterns

When a container cannot reach something, the instinct is often to exec into it and curl. But most slim containers lack curl, dig, tcpdump, or even ping. A better pattern: use nsenter from the host. Enter the container network namespace Get the container PID: docker inspect -f '{{.State.Pid}}' myapp Then: sudo nsenter -t PID -n bash You are now in the container network namespace, but with the host binaries. tcpdump, ip, ss, dig — all work. ...

March 20, 2024 · 2 min · Besterry

nginx Performance Tuning: Practical Notes from Production

After running nginx on everything from 512 MB VPS instances to multi-socket bare metal, here are the settings I’ve found actually matter. worker_processes and worker_connections Start with worker_processes auto;. worker_processes auto; worker_rlimit_nofile 65535; events { worker_connections 4096; use epoll; multi_accept on; } Keepalive tuning http { keepalive_timeout 30s; keepalive_requests 1000; upstream backend { server 10.0.0.1:8080; keepalive 32; } } Buffer sizes client_body_buffer_size 128k; client_max_body_size 50m; proxy_buffer_size 8k; proxy_buffers 8 8k; gzip and brotli gzip on; gzip_comp_level 5; gzip_types text/plain text/css application/json; brotli on; brotli_comp_level 4; brotli_types text/plain text/css application/json; Measurement None of this matters if you don’t measure. Install nginx-module-vts or expose stub_status, feed metrics to Prometheus, and compare before/after for any changes.

March 5, 2024 · 1 min · Besterry

systemd Timers vs Cron: When to Use Which

Cron has been the standard scheduler on Unix for decades. systemd timers are newer, more powerful, but also more verbose. Cron wins when Cron is perfect for one-line scripts that need to run on a simple schedule. Writing: 0 3 * * * /usr/local/bin/backup.sh is fast, requires no other files, and works on every Unix-like system since the 1970s. systemd timers win when You want any of these: Logging integrated with journalctl Dependencies on other units (After=network-online.target) Resource limits (MemoryMax=, CPUQuota=) Randomized delays to avoid thundering herd (RandomizedDelaySec=) The ability to manually trigger with systemctl start Catch-up behavior after system was off (Persistent=true) Minimal systemd timer example /etc/systemd/system/backup.service: ...

February 17, 2024 · 1 min · Besterry

Linux Networking Deep Dive: From Socket to Wire

Every time a packet leaves your Linux machine, it travels through a surprisingly long sequence of stages. Understanding this path helps enormously when debugging network issues. The socket layer When your application calls send() or write() on a socket, the kernel’s socket layer takes over. For a TCP socket this means handing the data to tcp_sendmsg(), which in turn enqueues it into the socket’s send buffer. You can observe the send queue depth with ss -tipm: ...

February 10, 2024 · 2 min · Besterry