Debugging

tcpdump Filters Cheatsheet for When the Network is On Fire

tcpdump has a weird little filter language (BPF syntax) that I never remember under pressure. This page is my cheatsheet. Basic syntax tcpdump -i <interface> -n <filter> -n don't resolve addresses/ports -i interface (eth0, any, lo) -v verbose (-vv, -vvv more) -w write to file for later wireshark -r read from file -c N stop after N packets -s 0 capture full packet (not truncated) Host and network filters host 192.0.2.1 # to or from src host 192.0.2.1 # from only dst host 192.0.2.1 # to only net 192.0.2.0/24 # subnet src net 192.0.2.0/24 # subnet as source Port filters port 443 # source or dest port 443 src port 443 # source only dst port 443 # dest only portrange 50000-60000 # range Protocol filters tcp # TCP only udp # UDP only icmp # ICMP only arp # ARP tcp port 443 # combine 'tcp[tcpflags] & tcp-syn != 0' # TCP with SYN flag TCP flag combinations # SYN only (connection attempts) 'tcp[tcpflags] == tcp-syn' # SYN-ACK 'tcp[tcpflags] == tcp-syn|tcp-ack' # RST (connection resets) 'tcp[tcpflags] & tcp-rst != 0' # FIN (connection closes) 'tcp[tcpflags] & tcp-fin != 0' Combining filters host 192.0.2.1 and tcp port 443 'host 192.0.2.1 and (port 80 or port 443)' 'not arp and not port 22' Boolean operators: and, or, not (or &&, ||, !). ...

Kubernetes Troubleshooting: The First 10 Minutes of an Outage

When PagerDuty wakes you up about a Kubernetes cluster issue, the first 10 minutes matter. Here is the runbook I work through before anything else. Get your bearings First, confirm what’s actually broken from the user side. Check the status page or synthetic monitor. Many “outages” are monitoring issues, not real problems. Cluster-level check kubectl get nodes kubectl top nodes Look for NotReady nodes and resource pressure. If multiple nodes are down, the problem is probably infrastructure — check the cloud provider console. ...

Useful bpftrace One-Liners for System Debugging

bpftrace makes the kernel event space accessible from a bash one-liner. Here are the scripts I keep reaching for. Count syscalls by process bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' Distribution of file read sizes bpftrace -e 'tracepoint:syscalls:sys_enter_read { @ = hist(args->count); }' TCP retransmissions by remote address bpftrace -e ' kprobe:tcp_retransmit_skb { $sk = (struct sock *)arg0; $daddr = $sk->__sk_common.skc_daddr; @[ntop($daddr)] = count(); }' Process creation stream bpftrace -e 'tracepoint:sched:sched_process_exec { printf("%s\n", str(args->filename)); }' When to use bpftrace vs perf vs strace strace: simple, but adds significant overhead. Fine for debugging a single misbehaving process. perf: best for sampling-based profiling (CPU time, cache misses). Low overhead. bpftrace: best for event-driven tracing across the whole system. Tiny overhead if used sparingly. All three should be in your toolbox.

Docker Network Debugging: nsenter and tcpdump Patterns

When a container cannot reach something, the instinct is often to exec into it and curl. But most slim containers lack curl, dig, tcpdump, or even ping. A better pattern: use nsenter from the host. Enter the container network namespace Get the container PID: docker inspect -f '{{.State.Pid}}' myapp Then: sudo nsenter -t PID -n bash You are now in the container network namespace, but with the host binaries. tcpdump, ip, ss, dig — all work. ...