# migration-observability > Make database migrations safe and observable. Define progress + safety metrics, dashboards, and runbook gates (go/no-go criteria) for live migrations, backfills, and cutovers. Works standalone and is database/tooling agnostic. - Author: Daniel Montero - Repository: dmonteroh/curated-agent-skills - Version: 20260208022942 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-08 - Source: https://github.com/dmonteroh/curated-agent-skills - Web: https://mule.run/skillshub/@@dmonteroh/curated-agent-skills~migration-observability:20260208022942 --- --- name: migration-observability description: "Make database migrations safe and observable. Define progress + safety metrics, dashboards, and runbook gates (go/no-go criteria) for live migrations, backfills, and cutovers. Works standalone and is database/tooling agnostic." category: observability --- # migration-observability Provides guidance for **running migrations safely** (not for authoring SQL/ORM migration steps). It focuses on: - progress visibility (are we moving? how fast? ETA?) - safety signals (are we harming prod? are errors rising? is lag growing?) - runbook gates (objective go/no-go thresholds and rollback triggers) ## Use this skill when - Running a production migration/backfill/cutover that can’t be “fire and forget”. - Dashboards/alerts and objective gates (pause, slow down, rollback, proceed) are required. - A shared, deterministic runbook is required for the migration. ## Do not use this skill when - The change is a tiny, low-risk schema tweak with trivial rollback. - Only authoring migration scripts (this skill is about operating them). ## Required inputs - Migration summary: what changes, why, scope, expected duration, rollback complexity. - Migration type: schema-only / backfill / online rewrite / cutover / dual-write / reindex. - Operational constraints: maintenance window, allowed error budget, throttling knobs. - Available telemetry: metrics, logs, traces, dashboards, alerting system. - Data correctness expectations: invariants, validation approach, sampling strategy. ## Outputs (artifacts produced) Minimum artifacts (paths are suggestions; adapt to repo conventions): - docs/runbooks/migrations/.md (the runbook with gates) - docs/runbooks/migrations/-dashboard.md (dashboard/queries, tool-agnostic) - docs/runbooks/migrations/-alerts.md (alert thresholds + paging policy) Templates: - `references/runbook-template.md` - `references/metrics-and-gates.md` ## Workflow (step-by-step with outputs) 1) Classify the migration (controls what must be observed) - Decide type, blast radius, and rollback complexity. - Output: short classification summary to include in the runbook. 2) Define progress metrics (prove it is moving) - Pick counters and derived rates (rows processed, throughput, ETA, lag). - Decision point: if instrumentation is unavailable, define structured logs and manual sampling cadence. - Output: a progress metrics list with target values or expected ranges. 3) Define safety metrics (prove it is not hurting prod) - Select DB, query, app, and correctness signals that exist today. - Decision point: if a metric is missing, either add lightweight instrumentation or choose a proxy metric. - Output: a safety metrics list with thresholds and measurement sources. 4) Turn metrics into gates (objective go/no-go) - Write Proceed (green), Pause/Throttle (yellow), Rollback (red) per phase. - Decision point: if rollback is impossible, define a “stop and stabilize” gate plus explicit escalation steps. - Output: a gate table for each phase. 5) Build the runbook, dashboard, and alert specs - Use the templates to capture phases, checks, and thresholds. - Output: the three runbook artifacts in your repo structure. 6) Execute with controls and document outcomes - Start with a canary, ramp cautiously, and validate invariants continuously. - Decision point: if any gate triggers, follow the runbook’s pause/rollback steps. - Output: a short closeout section with verification results and follow-ups. ## Common pitfalls - Missing baselines, so regressions cannot be detected. - Gates with vague language instead of numeric thresholds. - No throttle/rollback plan for long-running backfills. - Correctness checks that rely on assumptions that cannot be measured. - Alerting that pages the wrong team or has no escalation path. ## Tooling guidance (agnostic) This skill does not require a specific stack. Common setups: - Metrics: Prometheus/OpenTelemetry/Cloud metrics - Dashboards: Grafana / cloud dashboards - Alerts: Grafana alerting / PagerDuty / OpsGenie / Slack If none exist, remain “observable” by emitting structured logs + writing a runbook with manual checks and thresholds. ## Examples Example input: - Migration type: backfill - Operational constraints: 2-hour maintenance window, batch size throttle - Available telemetry: Prometheus metrics + Grafana dashboarding - Data correctness: checksum sampling every 10k rows Example output summary (abbreviated): - Runbook: docs/runbooks/migrations/2024-09-customer-backfill.md - Dashboard spec: docs/runbooks/migrations/2024-09-customer-backfill-dashboard.md - Alerts spec: docs/runbooks/migrations/2024-09-customer-backfill-alerts.md - Gates: proceed/pause/rollback thresholds for canary, ramp, full run ## Output contract (reporting format) Return a single summary using labeled bullets with the following fields, in this order: - Migration classification (type, blast radius, rollback complexity). - Progress metrics with thresholds + sources. - Safety metrics with thresholds + sources. - Gate table by phase (proceed/pause/rollback). - Runbook/dashboard/alerts file paths produced. - Risks, assumptions, and missing instrumentation. ## References - `references/README.md`