A backup you can’t restore isn’t a backup. After losing data once (fortunately from a test environment), here’s the framework I apply now.
The three levels of recovery
- Point-in-time recovery (PITR): Restore to any second in the last N days. Requires WAL archiving + base backups.
- Daily snapshots: Restore to yesterday’s 3am state. Simple, cheap, 24h RPO.
- Logical dumps: Restore specific tables or data subsets. Useful for selective recovery.
Most production databases should have all three.
Base backup + WAL archiving (pgBackRest)
pgBackRest is the standard tool in 2026. Config:
[global]
repo1-path=/var/backup/pgbackrest
repo1-retention-full=2
repo1-retention-diff=6
repo1-retention-archive=7
[main]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432
Weekly full backup, daily diff, continuous WAL archiving. Restore to any minute in the last 7 days.
Testing restores
Schedule a weekly test restore to a temporary PostgreSQL instance. Verify data integrity. Without regular restore testing, you don’t know if your backups work — and by the time you find out, it’s too late.
pg_dump for logical backups
pg_dump -Fc -f mydb.dump mydb
-Fc gives custom format which supports parallel restore:
pg_restore -j 4 -d mydb_new mydb.dump
Use pg_dumpall for roles and tablespaces across the cluster.
Cross-region replication
For HA, consider streaming replication:
primary_conninfo = 'host=primary.example.com port=5432 user=repl'
restore_command = 'pgbackrest archive-get %f %p'
Standby in a different datacenter keeps you running through zone failures.
Mistakes I’ve made
- Keeping backups on the same server as the database (lost everything in one disk failure)
- Forgetting to back up
pg_hba.conf,postgresql.conf, and other config (restore worked but auth was broken) - Not backing up roles (pg_dump by default doesn’t include CREATE ROLE)
- Assuming compression at the filesystem level (ZFS) was “enough” — rebuilding on different filesystem took longer
What I tell junior team members
Your backup system should answer three questions without hesitation:
- What is the RPO? (How much data can we lose?)
- What is the RTO? (How long to restore?)
- When was the last successful test restore?
If any of these are uncertain, that’s the first thing to fix.