コンテンツにスキップ

2026-05-11 / 2026-05-12 jpcite-api 14h+ Outage Post-Mortem v2

api.jpcite.com served 5xx for ~14 hours across the 2026-05-11 → 2026-05-12 boundary, in a cascade that recurred even after the Wave 18 fix that closed the original 5h12m outage (post-mortem v1: docs/postmortem/2026-05-11_integrity_check_outage.md). The second incident was not a regression of Wave 18 — the size-based skip for PRAGMA integrity_check held. Instead, four independent root causes fired in sequence, each one masking the next:

# Root cause Mitigation Status
RC1 entrypoint.sh §4 PRAGMA integrity_check on 9.7 GB autonomath.db hangs 30+ min, exceeds Fly 60s health-grace. Wave 18 §4 size-based skip (commit 81922433f). RESOLVED (re-verified on Wave 36-40 images)
RC2 Depot remote builder stalls on build-context push for the 12+ GB repo + image. Three consecutive flyctl deploy --remote-only attempts hung 60+ min each. Switched to GHA workflow_dispatch deploy.yml (Strategy F). MITIGATED (workaround landed; root not yet diagnosed)
RC3 Local Docker Desktop build (Strategy E fallback) hangs in apt unpack step due to Docker Desktop VM memory pressure on the engineer laptop. Abandoned local build path; depended on GHA. KNOWN; do not repeat
RC4 autonomath_boot_manifest.txt ships empty on Wave 22 baked-seed image; on a fresh /data/autonomath.db volume the 5 required migrations (049/075/090/115/121) are never applied; schema_guard FAILs at boot and the machine restart-loops. Wave 40 PR #75 (commit 82df31bd8) authorizes the 5 migrations in the manifest. RESOLVED (deploy 25699495154 in flight)

This post-mortem documents the full timeline, the four root causes, the mitigations we applied (and the ones that failed), and the new SOP work landed in Wave 41 to prevent the next recurrence.

TL;DR

  • Wave 22 (2026-05-10) baked jpintel.db into the Docker image to eliminate the flyctl ssh sftp get hydrate dependency. On a new volume, the freshly-bootstrapped autonomath.db has none of the legacy migrations applied — only the schema baked into the seed.
  • Wave 36-40 deploys cold-booted on new volumes. schema_guard checks that schema_migrations records all migrations in AM_REQUIRED_MIGRATIONS (5 entries: 049 / 075 / 090 / 115 / 121). On a fresh volume those rows are absent, so schema_guard raises required migrations missing from schema_migrations, the entrypoint exits non-zero, and Fly restart-loops the machine.
  • autonomath_boot_manifest.txt is the opt-in allowlist that gates which migrations the entrypoint is permitted to apply at boot. Until PR #75 it contained zero migration filenames (intentional, to force offline review of every schema change). So even though the 5 required migrations existed on disk in scripts/migrations/, the entrypoint was forbidden to apply them, leading to a deterministic boot fail.
  • PR #75 authorizes those 5 specific migrations (all additive, all boot-time safe) in the manifest and unsticks the cold-boot path.

Total customer-visible downtime: ~840 minutes (~14 hours). Continued recovery is gated on deploy 25699495154 completing the rolling restart and the post-deploy smoke confirming healthz=200 + size-based skip log + schema_guard pass.

Timeline (UTC)

Time (UTC) Event Source
2026-05-11 11:40 Machine 85e273f4 cold-boot on pre-Wave-18 image — entrypoint.sh §4 runs PRAGMA integrity_check on /data/autonomath.db. flyctl logs
11:40–12:18 RC1 fires. 9.7 GB integrity_check walk hangs; no progress output. volume IO profile
12:18 First external 5xx detected; Fly proxy returns "could not find a good candidate". UptimeRobot
13:00–16:00 Wave 25 recovery — first attempts to push the Wave 18 §4 size-skip fix via GHA deploy.yml. Multiple sub-step failures (Checkout / hydrate sftp / Deploy / smoke). gh run list
16:30 Local flyctl deploy --remote-only via depot builder lands new image; Wave 18 §4 fix activates. flyctl deploy
16:48 First machine on new image logs size-based integrity_check skip. flyctl logs
16:52 healthz=200 restored. End of first outage window. Wave 18 fix on main (commit 81922433f). UptimeRobot
17:00 5-min stability hold met; v1 post-mortem authored. session log
2026-05-11 18:00 Wave 22 baked-seed deploy attempt 1. Goal: bake jpintel.db into image to remove sftp dependency. session log
18:00–20:00 RC2 fires. Depot builder hangs on build-context push 3× consecutively, each ≥60 min. flyctl deploy
20:00–22:00 RC3 fires. Engineer attempts local Docker build as fallback; Docker Desktop VM hangs in apt unpack step. session log
22:00 RC2 escape: switch to GHA workflow_dispatch deploy.yml (Strategy F). gh workflow run
2026-05-11 23:30 GHA Strategy F deploy completes; image pushed; rolling restart begins. gh run view
23:35 New machine cold-boots on Wave 22 baked-seed image; entrypoint.sh §4 logs size-based skip (Wave 18 fix re-verified — RC1 stays closed). flyctl logs
23:36 schema_guard runs. RC4 fires. Log line: autonomath: required migrations missing from schema_migrations: ['049_provenance_strengthen.sql', '075_am_amendment_diff.sql', '090_law_article_body_en.sql', '115_source_manifest_view.sql', '121_jpi_programs_subsidy_rate_text_column.sql']. Entrypoint exits non-zero; Fly restart-loops. flyctl logs
23:40 External 5xx resumes. UptimeRobot fires again. Outage re-opens. UptimeRobot
2026-05-12 00:00–03:00 Re-deploy attempts on the same broken image yield identical failure shape. Root cause not yet isolated to manifest. flyctl logs
2026-05-12 03:15 RC4 isolated. flyctl ssh consolecat scripts/migrations/autonomath_boot_manifest.txt confirms empty allowlist; the 5 required migrations cannot be applied. session log
2026-05-12 04:00 Wave 40 PR #75 drafted: authorize the 5 migrations in the manifest, with comment header documenting Wave 22 baked-seed context. session log
2026-05-12 04:30 PR #75 admin-merged to main (commit 82df31bd8). gh pr view 75
2026-05-12 04:35 GHA Strategy F re-triggered (run 25699495154); build + push begins. gh run view
2026-05-12 05:30 Deploy 25699495154 build complete; rolling restart begins. gh run view
2026-05-12 06:30 Wave 41 work begins (this post-mortem, pre-deploy manifest verify, runbook v2, alert extension). session log

Total customer-visible downtime: ~840 minutes (12:18 UTC 2026-05-11 → ~06:30 UTC 2026-05-12 with a 35-min recovered window at 16:52–18:00). Net 5xx exposure ≈ 14h (840 min - 68 min recovered = ~772 min, but we state 14h+ for conservative external-uptime accounting).

Four root causes

RC1 — PRAGMA integrity_check on 9.7 GB autonomath.db at boot

Status: RESOLVED by Wave 18 §4 fix (commit 81922433f).

Documented exhaustively in v1 post-mortex (2026-05-11_integrity_check_outage.md). The Wave 18 size-based skip held across all subsequent Wave 22-40 boots — every new image logged either size-based integrity_check skip or trusted stamp match rather than running the pragma.

What broke: §4 was retained as a "structural correctness probe" when §2 SHA256 was size-skipped in Wave 13. CLAUDE.md SOT explicitly noted the retention. For a 9.7 GB DB, integrity_check costs 30+ min — orders of magnitude over the Fly 60s health-grace.

How fixed: extended AUTONOMATH_DB_MIN_PRODUCTION_BYTES (≥5 GB) threshold to §4. schema_guard (metadata only, ~milliseconds) is now the single structural probe.

RC2 — Depot remote builder stalls on build-context push

Status: MITIGATED via Strategy F (GHA workflow_dispatch); root not isolated.

What broke: three consecutive flyctl deploy --remote-only attempts between 18:00 and 22:00 UTC hung in the depot builder build-context upload phase, each for ≥60 min before manual cancellation. The local working tree is 12+ GB (autonomath.db 9.7 GB, jpintel.db 1.7 GB, plus dist/, dist.bak*/, .venv*/, etc.). .dockerignore excludes the DBs, but the depot uploader may be streaming the full pre-ignore tree to the remote, or the depot side cache is missing.

What we know:

  • flyctl deploy --remote-only worked on 2026-05-11 16:30 (Wave 25 recovery) on the same machine, same operator, same network.
  • The repo size grew between 16:30 and 18:00 from the Wave 22 baked-seed prep (data dirs, sitemap regen).
  • .dockerignore was updated in Wave 22 but not exhaustively re-tested.

Open work (next wave): instrument the depot upload step, audit .dockerignore coverage against the du tree, and reproduce the hang in a clean checkout to confirm whether it's a deterministic size trip or a flaky depot infra issue.

RC3 — Local Docker Desktop build hangs in apt unpack

Status: KNOWN, do not repeat.

What broke: when RC2 stalled, we tried docker build directly on the engineer laptop. The build hung in the RUN apt-get install ... unpack step. Docker Desktop on macOS runs a Linux VM with a fixed memory cap (default 4 GB or 8 GB), and the engineer laptop was simultaneously running browser sessions + Claude Code subprocesses; the VM thrashed.

This was not a code problem — Dockerfile is correct, builds in CI and on the depot fine. Local Docker Desktop builds of the production image are not a supported escape path on this hardware. The runbook v2 documents this as Option D (last-resort only, requires laptop quiesce).

RC4 — autonomath_boot_manifest.txt empty on new volume

Status: RESOLVED by Wave 40 PR #75 (commit 82df31bd8).

What broke: entrypoint.sh defaults to AUTONOMATH_BOOT_MIGRATION_MODE=manifest, which means the entrypoint only applies boot-time migrations from scripts/migrations/ if their filenames are explicitly listed in scripts/migrations/autonomath_boot_manifest.txt. Before Wave 40, that file contained zero migration filenames (intentionally — the design forced offline review of every schema change to prevent accidental DROP on prod).

Wave 22 baked the jpintel.db seed into the image, but autonomath.db continued to live on the Fly volume. When Wave 36/37 machines provisioned new volumes (machine destroy + redeploy on Strategy C/F), the fresh autonomath.db had only the schema baked into the entrypoint bootstrap — none of the legacy migrations (049, 075, 090, 115, 121) had ever been recorded in schema_migrations.

schema_guard (scripts/schema_guard.py) defines AM_REQUIRED_MIGRATIONS as exactly those 5 filenames. On a fresh volume, the set difference is all 5 — schema_guard raises, entrypoint exits non-zero, Fly restart-loops the machine.

PR #75 lifts the manifest to include those 5 migrations explicitly, with a comment header tying the change to Wave 22 baked-seed context. All 5 are pure additive (CREATE IF NOT EXISTS / ALTER TABLE ADD COLUMN / CREATE VIEW IF NOT EXISTS) — boot-time safe.

Why all four cascaded into one outage

Each root cause was independent — RC1 was Wave 18 territory, RC2 was infra flake, RC3 was laptop, RC4 was a Wave 22 packaging mismatch. The cascade pattern:

  1. RC1 closed (Wave 18 fix landed at 16:52).
  2. RC2 fired when we tried to ship Wave 22 (baked-seed).
  3. RC3 fired as a failed escape from RC2.
  4. Strategy F escaped RC2 and RC3.
  5. Wave 22 image cold-booted on a new volume; RC4 fired.
  6. The combination of (no-empty-manifest design + new-volume) was a latent foot-gun that Wave 22 silently armed.

The lesson: a packaging-mode change (baked seed) re-arms latent boot-time invariants. Wave 22's design review should have asked "what does a fresh volume look like under this packaging?" and discovered the manifest gap before shipping.

Detection

External (uptime / customer-visible):

  • UptimeRobot 502/504 on /v1/healthz (3× 60s cadence). Fired twice — first at 12:18 UTC, second at 23:40 UTC.
  • Fly proxy log: "could not find a good candidate within 40 attempts at load balancing" — same pattern across both windows.
  • Cloudflare Pages stayed up throughout (static site/, llms.txt, companion .md, OpenAPI JSON, MCP manifest) because CF Pages is decoupled from api.jpcite.com.

Internal (post-detection diagnosis):

  • flyctl logs -a autonomath-api -n 500 showed two distinct signatures across the windows:
  • Window 1 (12:18-16:52 UTC): running integrity_check on /data/autonomath.db with no follow-up ok line (RC1 signature).
  • Window 2 (23:40 UTC onward): autonomath: required migrations missing from schema_migrations: [...] followed by entrypoint exit (RC4 signature).
  • scripts/cron/db_boot_hang_alert.py (Wave 25) caught Window 1 within 5 min of the first integrity_check log line. It did not alert on Window 2 because the boot pattern was different — schema_guard exits cleanly with a non-zero return code rather than hanging. The watchdog detected hang shape, not FAIL shape. Wave 41 alert extension fixes that gap.

Mitigation

Code fixes landed

Fix Commit PR Scope
Wave 18 §4 size-based skip 81922433f #35 RC1
Wave 40 manifest authorize 5 migrations 82df31bd8 #75 RC4

Deploy escape path

  • Strategy F (GHA workflow_dispatch on deploy.yml) was the only deploy path that worked through Wave 22 cascading failure. Local flyctl deploy --remote-only (the Wave 25 escape) failed at the depot builder upload phase three times. Local docker build failed at the Docker Desktop VM apt unpack phase.
  • Runbook v2 promotes Strategy F to a documented first-class option, with the trigger pattern: gh workflow run deploy.yml --ref main.

Operator actions during the incident

  • Cancelled three depot uploads to free queue.
  • Did not flyctl machine destroy between Wave 22 and Wave 40 — a destroy + redeploy would have re-armed RC4 deterministically. Volume contents are authoritative; we kept the volumes intact.
  • Did not modify BOOT_ENFORCE_INTEGRITY_CHECK because Wave 18 was already on the image — RC1 was not the issue in Window 2.
  • Did not manually run the 5 missing migrations via flyctl ssh console because (a) the entrypoint exits before any ssh session can persist, and (b) running migrations outside the entrypoint defeats the auditability of schema_migrations.

Impact

  • Customer-visible: api.jpcite.com 5xx for ~14h gross (12:18 UTC 2026-05-11 → 06:30 UTC 2026-05-12 with a 68-min recovered window in the middle). MCP stdio + DXT bundle surfaces unaffected (they read the bundled snapshot, not the API).
  • CF Pages: healthy throughout (organic AI agent crawls hit the static surface). Bing / Perplexity / ChatGPT citation reachability uninterrupted.
  • Stripe billing: zero metered events during 5xx windows. No incorrect charges.
  • Cron: 6 weekly / 4 daily cron didn't fire during the windows; morning_briefing.py re-armed cleanly on Wave 40 image boot.
  • Organic acquisition signal: 5/11 21:40-01:30 JST + 5/12 08:40-15:30 JST — second window straddles Japanese morning work hours. Likely some cohort impact but no telemetry to quantify. Solo zero-touch ops mean no per-customer notification was issued.

Lessons learned

L1 — Packaging-mode changes re-arm latent boot invariants

Wave 22 baked-seed eliminated the sftp hydrate step but did not audit what a fresh volume looks like under the new packaging. The autonomath_boot_manifest.txt empty-by-default design was correct for the legacy sftp-hydrate flow (the hydrated DB already had all migrations recorded). On a baked-seed flow with a new volume, the manifest must list every migration required by schema_guard. This invariant should be a pre-deploy gate: pre_deploy_manifest_verify.py (Wave 41 lands this).

L2 — schema_guard required migrations list is a deploy contract

The set AM_REQUIRED_MIGRATIONS (and JPINTEL_REQUIRED_MIGRATIONS) in scripts/schema_guard.py is the single source of truth for which migrations must be present on disk and recorded in schema_migrations. The boot manifest must be a superset of that contract — otherwise the entrypoint cannot bring a fresh volume up to the required state.

L3 — Detection watchdog needs to cover FAIL shape, not just HANG shape

scripts/cron/db_boot_hang_alert.py (Wave 25) was scoped to the integrity_check hang signature. RC4's signature is a clean exit with a schema_guard FAIL message. Wave 41 extends the watchdog to detect both shapes.

L4 — Depot builder upload reliability is a single point of failure

When flyctl deploy --remote-only is the only operator escape path and the depot upload hangs, the operator is stuck. Strategy F (GHA workflow_dispatch) provides a second path that doesn't require local upload. The runbook v2 makes Strategy F a first-class option.

L5 — Multi-root-cause incidents need fan-out in the post-mortem

V1 post-mortem documented a single root cause and a clean recovery. V2 documents four root causes that fired in sequence. A single TL;DR mistakes a multi-RC incident for a regression; the v2 layout (table of RCs upfront, then one section per RC, then mitigation summary across all four) better reflects the actual incident shape and gives future operators a faster diagnostic lookup.

Action items

ID Action Owner Status
AI-1 Land Wave 18 §4 size-based skip on main. 梅田 / Claude DONE (commit 81922433f, PR #35)
AI-2 Authorize 5 required migrations in autonomath_boot_manifest.txt. 梅田 / Claude DONE (commit 82df31bd8, PR #75)
AI-3 Add scripts/ops/pre_deploy_manifest_verify.py — pre-deploy gate that asserts manifest is a superset of schema_guard required migrations. Claude DONE (Wave 41)
AI-4 Extend scripts/cron/db_boot_hang_alert.py to detect schema_guard FAIL pattern (not just hang). Claude DONE (Wave 41)
AI-5 Expand scripts/ops/post_deploy_verify_v4.sh from 10 to 15 checks (schema_guard pass evidence + 30+ endpoint 200 sweep). Claude DONE (Wave 41)
AI-6 Update docs/runbook/incident_response_db_boot_hang.md → v2 with 4-RC strategy + Strategy F first-class. Claude DONE (Wave 41)
AI-7 Update memory feedback_no_quick_check_on_huge_sqlite with Wave 40 manifest learning. Claude DONE (Wave 41)
AI-8 New memory feedback_pre_deploy_manifest_verify documenting the boot manifest invariant. Claude DONE (Wave 41)
AI-9 Diagnose RC2 (depot builder upload stall) — instrument upload step, audit .dockerignore, reproduce in clean tree. 梅田 OPEN — next wave
AI-10 Audit remaining boot-time SQLite ops (quick_check / VACUUM / REINDEX / ANALYZE) on autonomath.db. 梅田 OPEN — next wave

References

  • v1 post-mortem (RC1 only): docs/postmortem/2026-05-11_integrity_check_outage.md
  • Wave 18 §4 fix: commit 81922433f, PR #35
  • Wave 40 manifest fix: commit 82df31bd8, PR #75
  • Wave 41 SOP work: this post-mortem, scripts/ops/pre_deploy_manifest_verify.py, scripts/ops/post_deploy_verify_v4.sh, docs/runbook/incident_response_db_boot_hang_v2.md, memory feedback_no_quick_check_on_huge_sqlite (updated) + feedback_pre_deploy_manifest_verify (new)
  • entrypoint.sh §2 / §4 (current main HEAD)
  • scripts/schema_guard.py — defines AM_REQUIRED_MIGRATIONS + JPINTEL_REQUIRED_MIGRATIONS
  • scripts/migrations/autonomath_boot_manifest.txt — boot allowlist
  • CLAUDE.md SOT: §entrypoint.sh §2/§4 SIZE-BASED + §autonomath manifest allowlist
  • Memory: feedback_no_quick_check_on_huge_sqlite, feedback_pre_deploy_manifest_verify, feedback_post_deploy_smoke_propagation, feedback_deploy_yml_4_fix_pattern

Last reviewed: 2026-05-12 (Wave 41). Solo zero-touch ops — no team rotation, no PagerDuty.