Production checklist
Everything to verify before enabling Backstop in a production environment.
Run through this checklist before routing production traffic through the gateway. Each section maps to a specific failure mode we've seen in the field.
Connectivity and auth
- Gateway is reachable from all agent hosts:
curl http://<gateway>/health - Token file contains separate tokens for agents, operators, and admin (no shared tokens)
- Agent tokens have only
query:executeandquery:analyze— no approval or admin scopes -
BACKSTOP_DEV_MODEis not set in production environment -
dev-tokendoes not appear in production token file
Storage and snapshots
- S3 bucket exists and gateway can write to it:
backstop doctor storage-permissions --storage s3://your-bucket --strict - Sidecar is running and heartbeat is fresh: check
sidecar_heartbeat_age_secondsin/metadata/health - At least one valid, checksummed, non-quarantined snapshot exists for each table that agents will modify
- Snapshot age is under your RTO threshold (typically
last_snapshot_age_seconds < 600) -
backstop doctor launch --storage s3://your-bucket --table <critical-table>returnsreadyor every degraded item has an explicit owner and remediation plan
curl -H "Authorization: Bearer ops-token" \
http://localhost:8080/metadata/healthRecovery drills
Run all three drill types at least once before go-live. The production checklist is blocked until each returns clean.
- WAL archive fetch drill passed:
backstop drill wal-archive-fetch --storage s3://... --cluster-id prod - PITR prepare drill passed:
backstop drill pitr-prepare --storage s3://... --cluster-id prod --simulate - Logical backup/restore drill passed:
backstop drill logical-backup-restore --source-db ... --target-db ... - Local OSS lifecycle passed:
npm run e2e - Local OSS PITR/WAL drill passed:
npm run e2e:pitr - Guided table recovery has been practiced with
backstop recoveragainst a disposable target
# All drills, JSON output for CI
backstop doctor launch --storage s3://prod-snapshots --table users --metadata-db /metadata/backstop.db --json
backstop drill wal-archive-fetch --storage s3://prod-snapshots --cluster-id prod --json
backstop drill pitr-prepare --storage s3://prod-snapshots --cluster-id prod --simulate --json
backstop drill logical-backup-restore \
--source-db postgresql://postgres@localhost:5432/mydb \
--target-db postgresql://postgres@localhost:5432/mydb_drill \
--storage s3://prod-snapshots --jsonPolicy configuration
-
BACKSTOP_POLICY_CRITICALis set toapproveorblock(notexecute) -
BACKSTOP_POLICY_HIGHis set appropriately for your risk tolerance - Protected tables list is correct — includes PII and financial data tables
- Protected columns list covers sensitive fields (email, ssn, card_number, etc.)
- Policy decisions have been tested against your expected query patterns
Approval workflow
- At least one operator token is configured with
approval:readandapproval:writescope - Operators know how to check for pending approvals:
GET /pending - Alert routing is configured (PagerDuty, Slack, or similar) so operators are notified of
approval_requiredevents - Approval SLA is defined — how long will agents wait before timing out?
Observability
- Prometheus is scraping
/metrics— verify withcurl http://<gateway>/metrics - Audit log is queryable:
GET /metadata/auditreturns recent events - Alert on
sidecar_status != "healthy"in your monitoring system - Alert on
last_snapshot_age_seconds > 900(15 minutes) - Alert on any
risk_level: CRITICALevent in audit log (for visibility even when approved)
Load and performance
- Gateway has been load-tested at expected query rate
- Database connection pool is sized correctly (gateway reuses connections)
- Snapshot sidecar is not scheduled during peak query windows (snapshots lock tables briefly)
Incident response
- Team knows the emergency pause command:
POST /admin/pause - Team knows the guided recovery command:
backstop recover --db <url> --storage <s3-url> --table <table> - Emergency admin token is stored in a secrets manager (not in code or config files)
- Runbook is written and accessible: see Runbooks
- At least two people have been walked through a recovery drill