Runbooks — Backstop Docs

These runbooks cover the most common operational scenarios. Keep this page bookmarked and accessible before an incident occurs.

Emergency pause

Use when you observe suspicious agent behavior, runaway queries, or any situation where you need to stop all writes immediately.

Pause the gateway

curl -X POST \
  -H "Authorization: Bearer bsp_admin_token" \
  -H "Content-Type: application/json" \
  -d '{"reason": "Suspicious agent behavior — investigating"}' \
  http://localhost:8080/admin/pause

After pausing: all WRITE, HIGH, and CRITICAL queries are rejected. SELECT (SAFE) queries continue.

Verify the pause took effect

curl -H "Authorization: Bearer bsp_admin_token" \
  http://localhost:8080/admin/status

Look for "paused": true in the response.

Investigate

Check the audit log for the triggering event:

curl -H "Authorization: Bearer bsp_ops_token" \
  "http://localhost:8080/metadata/audit?limit=50&risk=CRITICAL"

Check alerts:

curl -H "Authorization: Bearer bsp_ops_token" \
  http://localhost:8080/metadata/alerts

Resume when safe

curl -X POST \
  -H "Authorization: Bearer bsp_admin_token" \
  http://localhost:8080/admin/resume

Restore a table from snapshot

Use after a destructive query executed — whether accidental or approved.

Run guided recovery first

backstop recover \
  --db postgresql://postgres@localhost:5432/mydb \
  --storage s3://prod-snapshots \
  --table users

This restores to users_recovered, validates the result, and prints copyback SQL only after validation passes.

Find the right snapshot
Use this lower-level path when scripting or when you need a specific snapshot ID.
```
backstop snapshots list \
  --db postgresql://postgres@localhost:5432/mydb \
  --storage s3://prod-snapshots \
  --table users
```
Note the snapshot_id of the snapshot taken before the destructive operation.

Dry run first

backstop restore \
  --db postgresql://postgres@localhost:5432/mydb \
  --storage s3://prod-snapshots \
  --snapshot-id snap_a3f9e2c1 \
  --table users \
  --dry-run

The dry run verifies the manifest checksum and reports what would be written. Always run this first.

Execute the restore to a recovered table

backstop restore \
  --db postgresql://postgres@localhost:5432/mydb \
  --storage s3://prod-snapshots \
  --snapshot-id snap_a3f9e2c1 \
  --table users \
  --target-table users_recovered

Do not restore over the original table during first response. Restore to a recovered table, validate, then copy back or rename after review.

Validate before copyback

backstop restore-validate \
  --db postgresql://postgres@localhost:5432/mydb \
  --storage s3://prod-snapshots \
  --snapshot-id snap_a3f9e2c1 \
  --table users \
  --target-table users_recovered

Then generate reviewable copyback SQL:

backstop restore-copyback-plan \
  --source-table users_recovered \
  --target-table users

Point-in-time recovery (PITR)

Use when the destructive operation happened between snapshots or when you need sub-second precision.

Identify the target time

Find the timestamp just before the incident from the audit log:

curl -H "Authorization: Bearer bsp_ops_token" \
  "http://localhost:8080/metadata/audit?limit=100" | jq '.[] | select(.risk_level == "CRITICAL")'

Prepare the restore

backstop pitr prepare-restore \
  --storage s3://prod-snapshots \
  --cluster-id prod \
  --backup-id backup_2026-05-06 \
  --target-dir /var/lib/postgresql/pitr-restore \
  --target-time "2026-05-06 12:29:00+00"

This prepares a PostgreSQL recovery directory with recovery.signal and a restore_command that fetches archived WAL through Backstop.

Start a recovery instance
Point a PostgreSQL instance at the target directory and start it. It will replay WAL up to the target time and then pause in recovery mode.
Verify the recovery completed to the right point:
```
psql postgresql://postgres@localhost:5433/mydb \
  -c "SELECT pg_last_xact_replay_timestamp()"
```

Promote or export

Either promote the recovery instance to take over, or export specific tables back to production:

pg_dump -t users postgresql://postgres@localhost:5433/mydb | \
  psql postgresql://postgres@localhost:5432/mydb

Clear an approval backlog

When a large number of queries are awaiting approval (for example, after a policy change), process them in bulk.

# List all pending
curl -H "Authorization: Bearer bsp_ops_token" \
  http://localhost:8080/pending | jq '.pending[] | {id, sql, risk_level, agent_id}'

# Approve by ID
curl -X POST \
  -H "Authorization: Bearer bsp_ops_token" \
  http://localhost:8080/approve/appr_4f9e2c1a

# Deny by ID
curl -X POST \
  -H "Authorization: Bearer bsp_ops_token" \
  http://localhost:8080/deny/appr_4f9e2c1a

For bulk operations, use a script:

# Approve all pending from a specific agent
curl -H "Authorization: Bearer bsp_ops_token" http://localhost:8080/pending \
  | jq -r '.pending[] | select(.agent_id == "cursor-local") | .id' \
  | while read id; do
      curl -s -X POST \
        -H "Authorization: Bearer bsp_ops_token" \
        "http://localhost:8080/approve/$id"
      echo "Approved $id"
    done

Sidecar not heartbeating

When /metadata/health shows sidecar_status: "stale" or sidecar_heartbeat_age_seconds > 120.

Check sidecar logs: docker logs backstop-sidecar or journalctl -u backstop-sidecar
Verify sidecar can reach the gateway: curl http://<gateway>/health from sidecar host
Check S3 connectivity from sidecar: backstop doctor storage-permissions --storage s3://...
Restart sidecar: docker restart backstop-sidecar

Emergency pause#

Pause the gateway#

Verify the pause took effect#

Investigate#

Resume when safe#

Restore a table from snapshot#

Run guided recovery first#

Find the right snapshot#

Dry run first#

Execute the restore to a recovered table#

Validate before copyback#

Point-in-time recovery (PITR)#

Identify the target time#

Prepare the restore#

Start a recovery instance#

Promote or export#

Clear an approval backlog#

Sidecar not heartbeating#