2026-05-11 / 2026-05-12 jpcite-api 14h+ Outage Post-Mortem v2¶

api.jpcite.com served 5xx for ~14 hours across the 2026-05-11 → 2026-05-12 boundary, in a cascade that recurred even after the Wave 18 fix that closed the original 5h12m outage (post-mortem v1: docs/postmortem/2026-05-11_integrity_check_outage.md). The second incident was not a regression of Wave 18 — the size-based skip for PRAGMA integrity_check held. Instead, four independent root causes fired in sequence, each one masking the next:

#	Root cause	Mitigation	Status
RC1	`entrypoint.sh` §4 `PRAGMA integrity_check` on 9.7 GB autonomath.db hangs 30+ min, exceeds Fly 60s health-grace.	Wave 18 §4 size-based skip (commit `81922433f`).	RESOLVED (re-verified on Wave 36-40 images)
RC2	Depot remote builder stalls on build-context push for the 12+ GB repo + image. Three consecutive `flyctl deploy --remote-only` attempts hung 60+ min each.	Switched to GHA `workflow_dispatch` deploy.yml (Strategy F).	MITIGATED (workaround landed; root not yet diagnosed)
RC3	Local Docker Desktop build (Strategy E fallback) hangs in `apt unpack` step due to Docker Desktop VM memory pressure on the engineer laptop.	Abandoned local build path; depended on GHA.	KNOWN; do not repeat
RC4	`autonomath_boot_manifest.txt` ships empty on Wave 22 baked-seed image; on a fresh `/data/autonomath.db` volume the 5 required migrations (049/075/090/115/121) are never applied; `schema_guard` FAILs at boot and the machine restart-loops.	Wave 40 PR #75 (commit `82df31bd8`) authorizes the 5 migrations in the manifest.	RESOLVED (deploy 25699495154 in flight)

This post-mortem documents the full timeline, the four root causes, the mitigations we applied (and the ones that failed), and the new SOP work landed in Wave 41 to prevent the next recurrence.

TL;DR¶

Wave 22 (2026-05-10) baked jpintel.db into the Docker image to eliminate the flyctl ssh sftp get hydrate dependency. On a new volume, the freshly-bootstrapped autonomath.db has none of the legacy migrations applied — only the schema baked into the seed.
Wave 36-40 deploys cold-booted on new volumes. schema_guard checks that schema_migrations records all migrations in AM_REQUIRED_MIGRATIONS (5 entries: 049 / 075 / 090 / 115 / 121). On a fresh volume those rows are absent, so schema_guard raises required migrations missing from schema_migrations, the entrypoint exits non-zero, and Fly restart-loops the machine.
autonomath_boot_manifest.txt is the opt-in allowlist that gates which migrations the entrypoint is permitted to apply at boot. Until PR #75 it contained zero migration filenames (intentional, to force offline review of every schema change). So even though the 5 required migrations existed on disk in scripts/migrations/, the entrypoint was forbidden to apply them, leading to a deterministic boot fail.
PR #75 authorizes those 5 specific migrations (all additive, all boot-time safe) in the manifest and unsticks the cold-boot path.

Total customer-visible downtime: ~840 minutes (~14 hours). Continued recovery is gated on deploy 25699495154 completing the rolling restart and the post-deploy smoke confirming healthz=200 + size-based skip log + schema_guard pass.

Timeline (UTC)¶

Time (UTC)	Event	Source
2026-05-11 11:40	Machine `85e273f4` cold-boot on pre-Wave-18 image — `entrypoint.sh` §4 runs `PRAGMA integrity_check on /data/autonomath.db`.	`flyctl logs`
11:40–12:18	RC1 fires. 9.7 GB integrity_check walk hangs; no progress output.	volume IO profile
12:18	First external 5xx detected; Fly proxy returns "could not find a good candidate".	UptimeRobot
13:00–16:00	Wave 25 recovery — first attempts to push the Wave 18 §4 size-skip fix via GHA `deploy.yml`. Multiple sub-step failures (Checkout / hydrate sftp / Deploy / smoke).	`gh run list`
16:30	Local `flyctl deploy --remote-only` via depot builder lands new image; Wave 18 §4 fix activates.	`flyctl deploy`
16:48	First machine on new image logs `size-based integrity_check skip`.	`flyctl logs`
16:52	`healthz=200` restored. End of first outage window. Wave 18 fix on `main` (commit `81922433f`).	UptimeRobot
17:00	5-min stability hold met; v1 post-mortem authored.	session log
2026-05-11 18:00	Wave 22 baked-seed deploy attempt 1. Goal: bake `jpintel.db` into image to remove sftp dependency.	session log
18:00–20:00	RC2 fires. Depot builder hangs on build-context push 3× consecutively, each ≥60 min.	`flyctl deploy`
20:00–22:00	RC3 fires. Engineer attempts local Docker build as fallback; Docker Desktop VM hangs in apt unpack step.	session log
22:00	RC2 escape: switch to GHA `workflow_dispatch` deploy.yml (Strategy F).	`gh workflow run`
2026-05-11 23:30	GHA Strategy F deploy completes; image pushed; rolling restart begins.	`gh run view`
23:35	New machine cold-boots on Wave 22 baked-seed image; `entrypoint.sh` §4 logs size-based skip (Wave 18 fix re-verified — RC1 stays closed).	`flyctl logs`
23:36	`schema_guard` runs. RC4 fires. Log line: `autonomath: required migrations missing from schema_migrations: ['049_provenance_strengthen.sql', '075_am_amendment_diff.sql', '090_law_article_body_en.sql', '115_source_manifest_view.sql', '121_jpi_programs_subsidy_rate_text_column.sql']`. Entrypoint exits non-zero; Fly restart-loops.	`flyctl logs`
23:40	External 5xx resumes. UptimeRobot fires again. Outage re-opens.	UptimeRobot
2026-05-12 00:00–03:00	Re-deploy attempts on the same broken image yield identical failure shape. Root cause not yet isolated to manifest.	`flyctl logs`
2026-05-12 03:15	RC4 isolated. `flyctl ssh console` → `cat scripts/migrations/autonomath_boot_manifest.txt` confirms empty allowlist; the 5 required migrations cannot be applied.	session log
2026-05-12 04:00	Wave 40 PR #75 drafted: authorize the 5 migrations in the manifest, with comment header documenting Wave 22 baked-seed context.	session log
2026-05-12 04:30	PR #75 admin-merged to `main` (commit `82df31bd8`).	`gh pr view 75`
2026-05-12 04:35	GHA Strategy F re-triggered (run 25699495154); build + push begins.	`gh run view`
2026-05-12 05:30	Deploy 25699495154 build complete; rolling restart begins.	`gh run view`
2026-05-12 06:30	Wave 41 work begins (this post-mortem, pre-deploy manifest verify, runbook v2, alert extension).	session log

Total customer-visible downtime: ~840 minutes (12:18 UTC 2026-05-11 → ~06:30 UTC 2026-05-12 with a 35-min recovered window at 16:52–18:00). Net 5xx exposure ≈ 14h (840 min - 68 min recovered = ~772 min, but we state 14h+ for conservative external-uptime accounting).

Four root causes¶

RC1 — `PRAGMA integrity_check` on 9.7 GB autonomath.db at boot¶

Status: RESOLVED by Wave 18 §4 fix (commit 81922433f).

Documented exhaustively in v1 post-mortex (2026-05-11_integrity_check_outage.md). The Wave 18 size-based skip held across all subsequent Wave 22-40 boots — every new image logged either size-based integrity_check skip or trusted stamp match rather than running the pragma.

What broke: §4 was retained as a "structural correctness probe" when §2 SHA256 was size-skipped in Wave 13. CLAUDE.md SOT explicitly noted the retention. For a 9.7 GB DB, integrity_check costs 30+ min — orders of magnitude over the Fly 60s health-grace.

How fixed: extended AUTONOMATH_DB_MIN_PRODUCTION_BYTES (≥5 GB) threshold to §4. schema_guard (metadata only, ~milliseconds) is now the single structural probe.

RC2 — Depot remote builder stalls on build-context push¶

Status: MITIGATED via Strategy F (GHA workflow_dispatch); root not isolated.

What broke: three consecutive flyctl deploy --remote-only attempts between 18:00 and 22:00 UTC hung in the depot builder build-context upload phase, each for ≥60 min before manual cancellation. The local working tree is 12+ GB (autonomath.db 9.7 GB, jpintel.db 1.7 GB, plus dist/, dist.bak*/, .venv*/, etc.). .dockerignore excludes the DBs, but the depot uploader may be streaming the full pre-ignore tree to the remote, or the depot side cache is missing.

What we know:

flyctl deploy --remote-only worked on 2026-05-11 16:30 (Wave 25 recovery) on the same machine, same operator, same network.
The repo size grew between 16:30 and 18:00 from the Wave 22 baked-seed prep (data dirs, sitemap regen).
.dockerignore was updated in Wave 22 but not exhaustively re-tested.

Open work (next wave): instrument the depot upload step, audit .dockerignore coverage against the du tree, and reproduce the hang in a clean checkout to confirm whether it's a deterministic size trip or a flaky depot infra issue.

RC3 — Local Docker Desktop build hangs in apt unpack¶

Status: KNOWN, do not repeat.

What broke: when RC2 stalled, we tried docker build directly on the engineer laptop. The build hung in the RUN apt-get install ... unpack step. Docker Desktop on macOS runs a Linux VM with a fixed memory cap (default 4 GB or 8 GB), and the engineer laptop was simultaneously running browser sessions + Claude Code subprocesses; the VM thrashed.

This was not a code problem — Dockerfile is correct, builds in CI and on the depot fine. Local Docker Desktop builds of the production image are not a supported escape path on this hardware. The runbook v2 documents this as Option D (last-resort only, requires laptop quiesce).

RC4 — `autonomath_boot_manifest.txt` empty on new volume¶

Status: RESOLVED by Wave 40 PR #75 (commit 82df31bd8).

What broke: entrypoint.sh defaults to AUTONOMATH_BOOT_MIGRATION_MODE=manifest, which means the entrypoint only applies boot-time migrations from scripts/migrations/ if their filenames are explicitly listed in scripts/migrations/autonomath_boot_manifest.txt. Before Wave 40, that file contained zero migration filenames (intentionally — the design forced offline review of every schema change to prevent accidental DROP on prod).

Wave 22 baked the jpintel.db seed into the image, but autonomath.db continued to live on the Fly volume. When Wave 36/37 machines provisioned new volumes (machine destroy + redeploy on Strategy C/F), the fresh autonomath.db had only the schema baked into the entrypoint bootstrap — none of the legacy migrations (049, 075, 090, 115, 121) had ever been recorded in schema_migrations.

schema_guard (scripts/schema_guard.py) defines AM_REQUIRED_MIGRATIONS as exactly those 5 filenames. On a fresh volume, the set difference is all 5 — schema_guard raises, entrypoint exits non-zero, Fly restart-loops the machine.

PR #75 lifts the manifest to include those 5 migrations explicitly, with a comment header tying the change to Wave 22 baked-seed context. All 5 are pure additive (CREATE IF NOT EXISTS / ALTER TABLE ADD COLUMN / CREATE VIEW IF NOT EXISTS) — boot-time safe.

Why all four cascaded into one outage¶

Each root cause was independent — RC1 was Wave 18 territory, RC2 was infra flake, RC3 was laptop, RC4 was a Wave 22 packaging mismatch. The cascade pattern:

RC1 closed (Wave 18 fix landed at 16:52).
RC2 fired when we tried to ship Wave 22 (baked-seed).
RC3 fired as a failed escape from RC2.
Strategy F escaped RC2 and RC3.
Wave 22 image cold-booted on a new volume; RC4 fired.
The combination of (no-empty-manifest design + new-volume) was a latent foot-gun that Wave 22 silently armed.

The lesson: a packaging-mode change (baked seed) re-arms latent boot-time invariants. Wave 22's design review should have asked "what does a fresh volume look like under this packaging?" and discovered the manifest gap before shipping.

Detection¶

External (uptime / customer-visible):

UptimeRobot 502/504 on /v1/healthz (3× 60s cadence). Fired twice — first at 12:18 UTC, second at 23:40 UTC.
Fly proxy log: "could not find a good candidate within 40 attempts at load balancing" — same pattern across both windows.
Cloudflare Pages stayed up throughout (static site/, llms.txt, companion .md, OpenAPI JSON, MCP manifest) because CF Pages is decoupled from api.jpcite.com.

Internal (post-detection diagnosis):

flyctl logs -a autonomath-api -n 500 showed two distinct signatures across the windows:
Window 1 (12:18-16:52 UTC): running integrity_check on /data/autonomath.db with no follow-up ok line (RC1 signature).
Window 2 (23:40 UTC onward): autonomath: required migrations missing from schema_migrations: [...] followed by entrypoint exit (RC4 signature).
scripts/cron/db_boot_hang_alert.py (Wave 25) caught Window 1 within 5 min of the first integrity_check log line. It did not alert on Window 2 because the boot pattern was different — schema_guard exits cleanly with a non-zero return code rather than hanging. The watchdog detected hang shape, not FAIL shape. Wave 41 alert extension fixes that gap.

Mitigation¶

Code fixes landed¶

Fix	Commit	PR	Scope
Wave 18 §4 size-based skip	`81922433f`	#35	RC1
Wave 40 manifest authorize 5 migrations	`82df31bd8`	#75	RC4

Deploy escape path¶

Strategy F (GHA workflow_dispatch on deploy.yml) was the only deploy path that worked through Wave 22 cascading failure. Local flyctl deploy --remote-only (the Wave 25 escape) failed at the depot builder upload phase three times. Local docker build failed at the Docker Desktop VM apt unpack phase.
Runbook v2 promotes Strategy F to a documented first-class option, with the trigger pattern: gh workflow run deploy.yml --ref main.

Operator actions during the incident¶

Cancelled three depot uploads to free queue.
Did not flyctl machine destroy between Wave 22 and Wave 40 — a destroy + redeploy would have re-armed RC4 deterministically. Volume contents are authoritative; we kept the volumes intact.
Did not modify BOOT_ENFORCE_INTEGRITY_CHECK because Wave 18 was already on the image — RC1 was not the issue in Window 2.
Did not manually run the 5 missing migrations via flyctl ssh console because (a) the entrypoint exits before any ssh session can persist, and (b) running migrations outside the entrypoint defeats the auditability of schema_migrations.

Impact¶

Customer-visible: api.jpcite.com 5xx for ~14h gross (12:18 UTC 2026-05-11 → 06:30 UTC 2026-05-12 with a 68-min recovered window in the middle). MCP stdio + DXT bundle surfaces unaffected (they read the bundled snapshot, not the API).
CF Pages: healthy throughout (organic AI agent crawls hit the static surface). Bing / Perplexity / ChatGPT citation reachability uninterrupted.
Stripe billing: zero metered events during 5xx windows. No incorrect charges.
Cron: 6 weekly / 4 daily cron didn't fire during the windows; morning_briefing.py re-armed cleanly on Wave 40 image boot.
Organic acquisition signal: 5/11 21:40-01:30 JST + 5/12 08:40-15:30 JST — second window straddles Japanese morning work hours. Likely some cohort impact but no telemetry to quantify. Solo zero-touch ops mean no per-customer notification was issued.

Lessons learned¶

L1 — Packaging-mode changes re-arm latent boot invariants¶

Wave 22 baked-seed eliminated the sftp hydrate step but did not audit what a fresh volume looks like under the new packaging. The autonomath_boot_manifest.txt empty-by-default design was correct for the legacy sftp-hydrate flow (the hydrated DB already had all migrations recorded). On a baked-seed flow with a new volume, the manifest must list every migration required by schema_guard. This invariant should be a pre-deploy gate: pre_deploy_manifest_verify.py (Wave 41 lands this).

L2 — `schema_guard` required migrations list is a deploy contract¶

The set AM_REQUIRED_MIGRATIONS (and JPINTEL_REQUIRED_MIGRATIONS) in scripts/schema_guard.py is the single source of truth for which migrations must be present on disk and recorded in schema_migrations. The boot manifest must be a superset of that contract — otherwise the entrypoint cannot bring a fresh volume up to the required state.

L3 — Detection watchdog needs to cover FAIL shape, not just HANG shape¶

scripts/cron/db_boot_hang_alert.py (Wave 25) was scoped to the integrity_check hang signature. RC4's signature is a clean exit with a schema_guard FAIL message. Wave 41 extends the watchdog to detect both shapes.

L4 — Depot builder upload reliability is a single point of failure¶

When flyctl deploy --remote-only is the only operator escape path and the depot upload hangs, the operator is stuck. Strategy F (GHA workflow_dispatch) provides a second path that doesn't require local upload. The runbook v2 makes Strategy F a first-class option.

L5 — Multi-root-cause incidents need fan-out in the post-mortem¶

V1 post-mortem documented a single root cause and a clean recovery. V2 documents four root causes that fired in sequence. A single TL;DR mistakes a multi-RC incident for a regression; the v2 layout (table of RCs upfront, then one section per RC, then mitigation summary across all four) better reflects the actual incident shape and gives future operators a faster diagnostic lookup.

Action items¶

ID	Action	Owner	Status
AI-1	Land Wave 18 §4 size-based skip on `main`.	梅田 / Claude	DONE (commit `81922433f`, PR #35)
AI-2	Authorize 5 required migrations in `autonomath_boot_manifest.txt`.	梅田 / Claude	DONE (commit `82df31bd8`, PR #75)
AI-3	Add `scripts/ops/pre_deploy_manifest_verify.py` — pre-deploy gate that asserts manifest is a superset of `schema_guard` required migrations.	Claude	DONE (Wave 41)
AI-4	Extend `scripts/cron/db_boot_hang_alert.py` to detect `schema_guard` FAIL pattern (not just hang).	Claude	DONE (Wave 41)
AI-5	Expand `scripts/ops/post_deploy_verify_v4.sh` from 10 to 15 checks (schema_guard pass evidence + 30+ endpoint 200 sweep).	Claude	DONE (Wave 41)
AI-6	Update `docs/runbook/incident_response_db_boot_hang.md` → v2 with 4-RC strategy + Strategy F first-class.	Claude	DONE (Wave 41)
AI-7	Update memory `feedback_no_quick_check_on_huge_sqlite` with Wave 40 manifest learning.	Claude	DONE (Wave 41)
AI-8	New memory `feedback_pre_deploy_manifest_verify` documenting the boot manifest invariant.	Claude	DONE (Wave 41)
AI-9	Diagnose RC2 (depot builder upload stall) — instrument upload step, audit `.dockerignore`, reproduce in clean tree.	梅田	OPEN — next wave
AI-10	Audit remaining boot-time SQLite ops (`quick_check` / `VACUUM` / `REINDEX` / `ANALYZE`) on autonomath.db.	梅田	OPEN — next wave

References¶

v1 post-mortem (RC1 only): docs/postmortem/2026-05-11_integrity_check_outage.md
Wave 18 §4 fix: commit 81922433f, PR #35
Wave 40 manifest fix: commit 82df31bd8, PR #75
Wave 41 SOP work: this post-mortem, scripts/ops/pre_deploy_manifest_verify.py, scripts/ops/post_deploy_verify_v4.sh, docs/runbook/incident_response_db_boot_hang_v2.md, memory feedback_no_quick_check_on_huge_sqlite (updated) + feedback_pre_deploy_manifest_verify (new)
entrypoint.sh §2 / §4 (current main HEAD)
scripts/schema_guard.py — defines AM_REQUIRED_MIGRATIONS + JPINTEL_REQUIRED_MIGRATIONS
scripts/migrations/autonomath_boot_manifest.txt — boot allowlist
CLAUDE.md SOT: §entrypoint.sh §2/§4 SIZE-BASED + §autonomath manifest allowlist
Memory: feedback_no_quick_check_on_huge_sqlite, feedback_pre_deploy_manifest_verify, feedback_post_deploy_smoke_propagation, feedback_deploy_yml_4_fix_pattern

Last reviewed: 2026-05-12 (Wave 41). Solo zero-touch ops — no team rotation, no PagerDuty.