Runbooks

Step-by-step procedures for common operational scenarios — pauses, restores, approval backlogs, and incident response.

These runbooks cover the most common operational scenarios. Keep this page bookmarked and accessible before an incident occurs.


Emergency pause

Use when you observe suspicious agent behavior, runaway queries, or any situation where you need to stop all writes immediately.

  1. Pause the gateway

    curl -X POST \
      -H "Authorization: Bearer bsp_admin_token" \
      -H "Content-Type: application/json" \
      -d '{"reason": "Suspicious agent behavior — investigating"}' \
      http://localhost:8080/admin/pause

    After pausing: all WRITE, HIGH, and CRITICAL queries are rejected. SELECT (SAFE) queries continue.

  2. Verify the pause took effect

    curl -H "Authorization: Bearer bsp_admin_token" \
      http://localhost:8080/admin/status

    Look for "paused": true in the response.

  3. Investigate

    Check the audit log for the triggering event:

    curl -H "Authorization: Bearer bsp_ops_token" \
      "http://localhost:8080/metadata/audit?limit=50&risk=CRITICAL"

    Check alerts:

    curl -H "Authorization: Bearer bsp_ops_token" \
      http://localhost:8080/metadata/alerts
  4. Resume when safe

    curl -X POST \
      -H "Authorization: Bearer bsp_admin_token" \
      http://localhost:8080/admin/resume

Restore a table from snapshot

Use after a destructive query executed — whether accidental or approved.

  1. Run guided recovery first

    backstop recover \
      --db postgresql://postgres@localhost:5432/mydb \
      --storage s3://prod-snapshots \
      --table users

    This restores to users_recovered, validates the result, and prints copyback SQL only after validation passes.

  2. Find the right snapshot

    Use this lower-level path when scripting or when you need a specific snapshot ID.

    backstop snapshots list \
      --db postgresql://postgres@localhost:5432/mydb \
      --storage s3://prod-snapshots \
      --table users

    Note the snapshot_id of the snapshot taken before the destructive operation.

  3. Dry run first

    backstop restore \
      --db postgresql://postgres@localhost:5432/mydb \
      --storage s3://prod-snapshots \
      --snapshot-id snap_a3f9e2c1 \
      --table users \
      --dry-run

    The dry run verifies the manifest checksum and reports what would be written. Always run this first.

  4. Execute the restore to a recovered table

    backstop restore \
      --db postgresql://postgres@localhost:5432/mydb \
      --storage s3://prod-snapshots \
      --snapshot-id snap_a3f9e2c1 \
      --table users \
      --target-table users_recovered

    Do not restore over the original table during first response. Restore to a recovered table, validate, then copy back or rename after review.

  5. Validate before copyback

    backstop restore-validate \
      --db postgresql://postgres@localhost:5432/mydb \
      --storage s3://prod-snapshots \
      --snapshot-id snap_a3f9e2c1 \
      --table users \
      --target-table users_recovered

    Then generate reviewable copyback SQL:

    backstop restore-copyback-plan \
      --source-table users_recovered \
      --target-table users

Point-in-time recovery (PITR)

Use when the destructive operation happened between snapshots or when you need sub-second precision.

  1. Identify the target time

    Find the timestamp just before the incident from the audit log:

    curl -H "Authorization: Bearer bsp_ops_token" \
      "http://localhost:8080/metadata/audit?limit=100" | jq '.[] | select(.risk_level == "CRITICAL")'
  2. Prepare the restore

    backstop pitr prepare-restore \
      --storage s3://prod-snapshots \
      --cluster-id prod \
      --backup-id backup_2026-05-06 \
      --target-dir /var/lib/postgresql/pitr-restore \
      --target-time "2026-05-06 12:29:00+00"

    This prepares a PostgreSQL recovery directory with recovery.signal and a restore_command that fetches archived WAL through Backstop.

  3. Start a recovery instance

    Point a PostgreSQL instance at the target directory and start it. It will replay WAL up to the target time and then pause in recovery mode.

    Verify the recovery completed to the right point:

    psql postgresql://postgres@localhost:5433/mydb \
      -c "SELECT pg_last_xact_replay_timestamp()"
  4. Promote or export

    Either promote the recovery instance to take over, or export specific tables back to production:

    pg_dump -t users postgresql://postgres@localhost:5433/mydb | \
      psql postgresql://postgres@localhost:5432/mydb

Clear an approval backlog

When a large number of queries are awaiting approval (for example, after a policy change), process them in bulk.

# List all pending
curl -H "Authorization: Bearer bsp_ops_token" \
  http://localhost:8080/pending | jq '.pending[] | {id, sql, risk_level, agent_id}'

# Approve by ID
curl -X POST \
  -H "Authorization: Bearer bsp_ops_token" \
  http://localhost:8080/approve/appr_4f9e2c1a

# Deny by ID
curl -X POST \
  -H "Authorization: Bearer bsp_ops_token" \
  http://localhost:8080/deny/appr_4f9e2c1a

For bulk operations, use a script:

# Approve all pending from a specific agent
curl -H "Authorization: Bearer bsp_ops_token" http://localhost:8080/pending \
  | jq -r '.pending[] | select(.agent_id == "cursor-local") | .id' \
  | while read id; do
      curl -s -X POST \
        -H "Authorization: Bearer bsp_ops_token" \
        "http://localhost:8080/approve/$id"
      echo "Approved $id"
    done

Sidecar not heartbeating

When /metadata/health shows sidecar_status: "stale" or sidecar_heartbeat_age_seconds > 120.

  1. Check sidecar logs: docker logs backstop-sidecar or journalctl -u backstop-sidecar
  2. Verify sidecar can reach the gateway: curl http://<gateway>/health from sidecar host
  3. Check S3 connectivity from sidecar: backstop doctor storage-permissions --storage s3://...
  4. Restart sidecar: docker restart backstop-sidecar