A backup you can’t restore isn’t a backup. After losing data once (fortunately from a test environment), here’s the framework I apply now.

The three levels of recovery

  1. Point-in-time recovery (PITR): Restore to any second in the last N days. Requires WAL archiving + base backups.
  2. Daily snapshots: Restore to yesterday’s 3am state. Simple, cheap, 24h RPO.
  3. Logical dumps: Restore specific tables or data subsets. Useful for selective recovery.

Most production databases should have all three.

Base backup + WAL archiving (pgBackRest)

pgBackRest is the standard tool in 2026. Config:

[global]
repo1-path=/var/backup/pgbackrest
repo1-retention-full=2
repo1-retention-diff=6
repo1-retention-archive=7

[main]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

Weekly full backup, daily diff, continuous WAL archiving. Restore to any minute in the last 7 days.

Testing restores

Schedule a weekly test restore to a temporary PostgreSQL instance. Verify data integrity. Without regular restore testing, you don’t know if your backups work — and by the time you find out, it’s too late.

pg_dump for logical backups

pg_dump -Fc -f mydb.dump mydb

-Fc gives custom format which supports parallel restore:

pg_restore -j 4 -d mydb_new mydb.dump

Use pg_dumpall for roles and tablespaces across the cluster.

Cross-region replication

For HA, consider streaming replication:

primary_conninfo = 'host=primary.example.com port=5432 user=repl'
restore_command = 'pgbackrest archive-get %f %p'

Standby in a different datacenter keeps you running through zone failures.

Mistakes I’ve made

  • Keeping backups on the same server as the database (lost everything in one disk failure)
  • Forgetting to back up pg_hba.conf, postgresql.conf, and other config (restore worked but auth was broken)
  • Not backing up roles (pg_dump by default doesn’t include CREATE ROLE)
  • Assuming compression at the filesystem level (ZFS) was “enough” — rebuilding on different filesystem took longer

What I tell junior team members

Your backup system should answer three questions without hesitation:

  1. What is the RPO? (How much data can we lose?)
  2. What is the RTO? (How long to restore?)
  3. When was the last successful test restore?

If any of these are uncertain, that’s the first thing to fix.