2026-05-11 / 2026-05-12 jpcite-api 14h+ Outage Post-Mortem v2¶
api.jpcite.com served 5xx for ~14 hours across the 2026-05-11 →
2026-05-12 boundary, in a cascade that recurred even after the Wave 18
fix that closed the original 5h12m outage (post-mortem v1:
docs/postmortem/2026-05-11_integrity_check_outage.md). The second
incident was not a regression of Wave 18 — the size-based skip for
PRAGMA integrity_check held. Instead, four independent root causes
fired in sequence, each one masking the next:
| # | Root cause | Mitigation | Status |
|---|---|---|---|
| RC1 | entrypoint.sh §4 PRAGMA integrity_check on 9.7 GB autonomath.db hangs 30+ min, exceeds Fly 60s health-grace. |
Wave 18 §4 size-based skip (commit 81922433f). |
RESOLVED (re-verified on Wave 36-40 images) |
| RC2 | Depot remote builder stalls on build-context push for the 12+ GB repo + image. Three consecutive flyctl deploy --remote-only attempts hung 60+ min each. |
Switched to GHA workflow_dispatch deploy.yml (Strategy F). |
MITIGATED (workaround landed; root not yet diagnosed) |
| RC3 | Local Docker Desktop build (Strategy E fallback) hangs in apt unpack step due to Docker Desktop VM memory pressure on the engineer laptop. |
Abandoned local build path; depended on GHA. | KNOWN; do not repeat |
| RC4 | autonomath_boot_manifest.txt ships empty on Wave 22 baked-seed image; on a fresh /data/autonomath.db volume the 5 required migrations (049/075/090/115/121) are never applied; schema_guard FAILs at boot and the machine restart-loops. |
Wave 40 PR #75 (commit 82df31bd8) authorizes the 5 migrations in the manifest. |
RESOLVED (deploy 25699495154 in flight) |
This post-mortem documents the full timeline, the four root causes, the mitigations we applied (and the ones that failed), and the new SOP work landed in Wave 41 to prevent the next recurrence.
TL;DR¶
- Wave 22 (2026-05-10) baked
jpintel.dbinto the Docker image to eliminate theflyctl ssh sftp gethydrate dependency. On a new volume, the freshly-bootstrappedautonomath.dbhas none of the legacy migrations applied — only the schema baked into the seed. - Wave 36-40 deploys cold-booted on new volumes.
schema_guardchecks thatschema_migrationsrecords all migrations inAM_REQUIRED_MIGRATIONS(5 entries: 049 / 075 / 090 / 115 / 121). On a fresh volume those rows are absent, soschema_guardraisesrequired migrations missing from schema_migrations, the entrypoint exits non-zero, and Fly restart-loops the machine. autonomath_boot_manifest.txtis the opt-in allowlist that gates which migrations the entrypoint is permitted to apply at boot. Until PR #75 it contained zero migration filenames (intentional, to force offline review of every schema change). So even though the 5 required migrations existed on disk inscripts/migrations/, the entrypoint was forbidden to apply them, leading to a deterministic boot fail.- PR #75 authorizes those 5 specific migrations (all additive, all boot-time safe) in the manifest and unsticks the cold-boot path.
Total customer-visible downtime: ~840 minutes (~14 hours).
Continued recovery is gated on deploy 25699495154 completing the
rolling restart and the post-deploy smoke confirming healthz=200 +
size-based skip log + schema_guard pass.
Timeline (UTC)¶
| Time (UTC) | Event | Source |
|---|---|---|
| 2026-05-11 11:40 | Machine 85e273f4 cold-boot on pre-Wave-18 image — entrypoint.sh §4 runs PRAGMA integrity_check on /data/autonomath.db. |
flyctl logs |
| 11:40–12:18 | RC1 fires. 9.7 GB integrity_check walk hangs; no progress output. | volume IO profile |
| 12:18 | First external 5xx detected; Fly proxy returns "could not find a good candidate". | UptimeRobot |
| 13:00–16:00 | Wave 25 recovery — first attempts to push the Wave 18 §4 size-skip fix via GHA deploy.yml. Multiple sub-step failures (Checkout / hydrate sftp / Deploy / smoke). |
gh run list |
| 16:30 | Local flyctl deploy --remote-only via depot builder lands new image; Wave 18 §4 fix activates. |
flyctl deploy |
| 16:48 | First machine on new image logs size-based integrity_check skip. |
flyctl logs |
| 16:52 | healthz=200 restored. End of first outage window. Wave 18 fix on main (commit 81922433f). |
UptimeRobot |
| 17:00 | 5-min stability hold met; v1 post-mortem authored. | session log |
| 2026-05-11 18:00 | Wave 22 baked-seed deploy attempt 1. Goal: bake jpintel.db into image to remove sftp dependency. |
session log |
| 18:00–20:00 | RC2 fires. Depot builder hangs on build-context push 3× consecutively, each ≥60 min. | flyctl deploy |
| 20:00–22:00 | RC3 fires. Engineer attempts local Docker build as fallback; Docker Desktop VM hangs in apt unpack step. | session log |
| 22:00 | RC2 escape: switch to GHA workflow_dispatch deploy.yml (Strategy F). |
gh workflow run |
| 2026-05-11 23:30 | GHA Strategy F deploy completes; image pushed; rolling restart begins. | gh run view |
| 23:35 | New machine cold-boots on Wave 22 baked-seed image; entrypoint.sh §4 logs size-based skip (Wave 18 fix re-verified — RC1 stays closed). |
flyctl logs |
| 23:36 | schema_guard runs. RC4 fires. Log line: autonomath: required migrations missing from schema_migrations: ['049_provenance_strengthen.sql', '075_am_amendment_diff.sql', '090_law_article_body_en.sql', '115_source_manifest_view.sql', '121_jpi_programs_subsidy_rate_text_column.sql']. Entrypoint exits non-zero; Fly restart-loops. |
flyctl logs |
| 23:40 | External 5xx resumes. UptimeRobot fires again. Outage re-opens. | UptimeRobot |
| 2026-05-12 00:00–03:00 | Re-deploy attempts on the same broken image yield identical failure shape. Root cause not yet isolated to manifest. | flyctl logs |
| 2026-05-12 03:15 | RC4 isolated. flyctl ssh console → cat scripts/migrations/autonomath_boot_manifest.txt confirms empty allowlist; the 5 required migrations cannot be applied. |
session log |
| 2026-05-12 04:00 | Wave 40 PR #75 drafted: authorize the 5 migrations in the manifest, with comment header documenting Wave 22 baked-seed context. | session log |
| 2026-05-12 04:30 | PR #75 admin-merged to main (commit 82df31bd8). |
gh pr view 75 |
| 2026-05-12 04:35 | GHA Strategy F re-triggered (run 25699495154); build + push begins. | gh run view |
| 2026-05-12 05:30 | Deploy 25699495154 build complete; rolling restart begins. | gh run view |
| 2026-05-12 06:30 | Wave 41 work begins (this post-mortem, pre-deploy manifest verify, runbook v2, alert extension). | session log |
Total customer-visible downtime: ~840 minutes (12:18 UTC 2026-05-11 → ~06:30 UTC 2026-05-12 with a 35-min recovered window at 16:52–18:00). Net 5xx exposure ≈ 14h (840 min - 68 min recovered = ~772 min, but we state 14h+ for conservative external-uptime accounting).
Four root causes¶
RC1 — PRAGMA integrity_check on 9.7 GB autonomath.db at boot¶
Status: RESOLVED by Wave 18 §4 fix (commit 81922433f).
Documented exhaustively in v1 post-mortex (2026-05-11_integrity_check_outage.md).
The Wave 18 size-based skip held across all subsequent Wave 22-40
boots — every new image logged either size-based integrity_check skip
or trusted stamp match rather than running the pragma.
What broke: §4 was retained as a "structural correctness probe" when §2 SHA256 was size-skipped in Wave 13. CLAUDE.md SOT explicitly noted the retention. For a 9.7 GB DB, integrity_check costs 30+ min — orders of magnitude over the Fly 60s health-grace.
How fixed: extended AUTONOMATH_DB_MIN_PRODUCTION_BYTES (≥5 GB)
threshold to §4. schema_guard (metadata only, ~milliseconds) is now
the single structural probe.
RC2 — Depot remote builder stalls on build-context push¶
Status: MITIGATED via Strategy F (GHA workflow_dispatch); root not isolated.
What broke: three consecutive flyctl deploy --remote-only attempts
between 18:00 and 22:00 UTC hung in the depot builder build-context
upload phase, each for ≥60 min before manual cancellation. The local
working tree is 12+ GB (autonomath.db 9.7 GB, jpintel.db 1.7 GB, plus
dist/, dist.bak*/, .venv*/, etc.). .dockerignore excludes the
DBs, but the depot uploader may be streaming the full pre-ignore tree
to the remote, or the depot side cache is missing.
What we know:
flyctl deploy --remote-onlyworked on 2026-05-11 16:30 (Wave 25 recovery) on the same machine, same operator, same network.- The repo size grew between 16:30 and 18:00 from the Wave 22 baked-seed prep (data dirs, sitemap regen).
.dockerignorewas updated in Wave 22 but not exhaustively re-tested.
Open work (next wave): instrument the depot upload step, audit
.dockerignore coverage against the du tree, and reproduce the
hang in a clean checkout to confirm whether it's a deterministic size
trip or a flaky depot infra issue.
RC3 — Local Docker Desktop build hangs in apt unpack¶
Status: KNOWN, do not repeat.
What broke: when RC2 stalled, we tried docker build directly on the
engineer laptop. The build hung in the RUN apt-get install ...
unpack step. Docker Desktop on macOS runs a Linux VM with a fixed
memory cap (default 4 GB or 8 GB), and the engineer laptop was
simultaneously running browser sessions + Claude Code subprocesses;
the VM thrashed.
This was not a code problem — Dockerfile is correct, builds in CI
and on the depot fine. Local Docker Desktop builds of the production
image are not a supported escape path on this hardware. The runbook
v2 documents this as Option D (last-resort only, requires laptop
quiesce).
RC4 — autonomath_boot_manifest.txt empty on new volume¶
Status: RESOLVED by Wave 40 PR #75 (commit 82df31bd8).
What broke: entrypoint.sh defaults to
AUTONOMATH_BOOT_MIGRATION_MODE=manifest, which means the entrypoint
only applies boot-time migrations from scripts/migrations/ if their
filenames are explicitly listed in scripts/migrations/autonomath_boot_manifest.txt.
Before Wave 40, that file contained zero migration filenames
(intentionally — the design forced offline review of every schema
change to prevent accidental DROP on prod).
Wave 22 baked the jpintel.db seed into the image, but
autonomath.db continued to live on the Fly volume. When Wave 36/37
machines provisioned new volumes (machine destroy + redeploy on
Strategy C/F), the fresh autonomath.db had only the schema baked
into the entrypoint bootstrap — none of the legacy migrations
(049, 075, 090, 115, 121) had ever been recorded in schema_migrations.
schema_guard (scripts/schema_guard.py) defines AM_REQUIRED_MIGRATIONS
as exactly those 5 filenames. On a fresh volume, the set difference is
all 5 — schema_guard raises, entrypoint exits non-zero, Fly
restart-loops the machine.
PR #75 lifts the manifest to include those 5 migrations explicitly, with a comment header tying the change to Wave 22 baked-seed context. All 5 are pure additive (CREATE IF NOT EXISTS / ALTER TABLE ADD COLUMN / CREATE VIEW IF NOT EXISTS) — boot-time safe.
Why all four cascaded into one outage¶
Each root cause was independent — RC1 was Wave 18 territory, RC2 was infra flake, RC3 was laptop, RC4 was a Wave 22 packaging mismatch. The cascade pattern:
- RC1 closed (Wave 18 fix landed at 16:52).
- RC2 fired when we tried to ship Wave 22 (baked-seed).
- RC3 fired as a failed escape from RC2.
- Strategy F escaped RC2 and RC3.
- Wave 22 image cold-booted on a new volume; RC4 fired.
- The combination of (no-empty-manifest design + new-volume) was a latent foot-gun that Wave 22 silently armed.
The lesson: a packaging-mode change (baked seed) re-arms latent boot-time invariants. Wave 22's design review should have asked "what does a fresh volume look like under this packaging?" and discovered the manifest gap before shipping.
Detection¶
External (uptime / customer-visible):
- UptimeRobot 502/504 on
/v1/healthz(3× 60s cadence). Fired twice — first at 12:18 UTC, second at 23:40 UTC. - Fly proxy log: "could not find a good candidate within 40 attempts at load balancing" — same pattern across both windows.
- Cloudflare Pages stayed up throughout (static
site/,llms.txt, companion.md, OpenAPI JSON, MCP manifest) because CF Pages is decoupled fromapi.jpcite.com.
Internal (post-detection diagnosis):
flyctl logs -a autonomath-api -n 500showed two distinct signatures across the windows:- Window 1 (12:18-16:52 UTC):
running integrity_check on /data/autonomath.dbwith no follow-upokline (RC1 signature). - Window 2 (23:40 UTC onward):
autonomath: required migrations missing from schema_migrations: [...]followed by entrypoint exit (RC4 signature). scripts/cron/db_boot_hang_alert.py(Wave 25) caught Window 1 within 5 min of the first integrity_check log line. It did not alert on Window 2 because the boot pattern was different — schema_guard exits cleanly with a non-zero return code rather than hanging. The watchdog detected hang shape, not FAIL shape. Wave 41 alert extension fixes that gap.
Mitigation¶
Code fixes landed¶
| Fix | Commit | PR | Scope |
|---|---|---|---|
| Wave 18 §4 size-based skip | 81922433f |
#35 | RC1 |
| Wave 40 manifest authorize 5 migrations | 82df31bd8 |
#75 | RC4 |
Deploy escape path¶
- Strategy F (GHA
workflow_dispatchon deploy.yml) was the only deploy path that worked through Wave 22 cascading failure. Localflyctl deploy --remote-only(the Wave 25 escape) failed at the depot builder upload phase three times. Localdocker buildfailed at the Docker Desktop VM apt unpack phase. - Runbook v2 promotes Strategy F to a documented first-class option,
with the trigger pattern:
gh workflow run deploy.yml --ref main.
Operator actions during the incident¶
- Cancelled three depot uploads to free queue.
- Did not
flyctl machine destroybetween Wave 22 and Wave 40 — a destroy + redeploy would have re-armed RC4 deterministically. Volume contents are authoritative; we kept the volumes intact. - Did not modify
BOOT_ENFORCE_INTEGRITY_CHECKbecause Wave 18 was already on the image — RC1 was not the issue in Window 2. - Did not manually run the 5 missing migrations via
flyctl ssh consolebecause (a) the entrypoint exits before any ssh session can persist, and (b) running migrations outside the entrypoint defeats the auditability ofschema_migrations.
Impact¶
- Customer-visible:
api.jpcite.com5xx for ~14h gross (12:18 UTC 2026-05-11 → 06:30 UTC 2026-05-12 with a 68-min recovered window in the middle). MCP stdio + DXT bundle surfaces unaffected (they read the bundled snapshot, not the API). - CF Pages: healthy throughout (organic AI agent crawls hit the static surface). Bing / Perplexity / ChatGPT citation reachability uninterrupted.
- Stripe billing: zero metered events during 5xx windows. No incorrect charges.
- Cron: 6 weekly / 4 daily cron didn't fire during the windows;
morning_briefing.pyre-armed cleanly on Wave 40 image boot. - Organic acquisition signal: 5/11 21:40-01:30 JST + 5/12 08:40-15:30 JST — second window straddles Japanese morning work hours. Likely some cohort impact but no telemetry to quantify. Solo zero-touch ops mean no per-customer notification was issued.
Lessons learned¶
L1 — Packaging-mode changes re-arm latent boot invariants¶
Wave 22 baked-seed eliminated the sftp hydrate step but did not audit
what a fresh volume looks like under the new packaging. The
autonomath_boot_manifest.txt empty-by-default design was correct
for the legacy sftp-hydrate flow (the hydrated DB already had all
migrations recorded). On a baked-seed flow with a new volume, the
manifest must list every migration required by schema_guard. This
invariant should be a pre-deploy gate: pre_deploy_manifest_verify.py
(Wave 41 lands this).
L2 — schema_guard required migrations list is a deploy contract¶
The set AM_REQUIRED_MIGRATIONS (and JPINTEL_REQUIRED_MIGRATIONS)
in scripts/schema_guard.py is the single source of truth for which
migrations must be present on disk and recorded in schema_migrations.
The boot manifest must be a superset of that contract — otherwise
the entrypoint cannot bring a fresh volume up to the required state.
L3 — Detection watchdog needs to cover FAIL shape, not just HANG shape¶
scripts/cron/db_boot_hang_alert.py (Wave 25) was scoped to the
integrity_check hang signature. RC4's signature is a clean exit with
a schema_guard FAIL message. Wave 41 extends the watchdog to detect
both shapes.
L4 — Depot builder upload reliability is a single point of failure¶
When flyctl deploy --remote-only is the only operator escape path
and the depot upload hangs, the operator is stuck. Strategy F (GHA
workflow_dispatch) provides a second path that doesn't require local
upload. The runbook v2 makes Strategy F a first-class option.
L5 — Multi-root-cause incidents need fan-out in the post-mortem¶
V1 post-mortem documented a single root cause and a clean recovery.
V2 documents four root causes that fired in sequence. A single
TL;DR mistakes a multi-RC incident for a regression; the v2 layout
(table of RCs upfront, then one section per RC, then mitigation
summary across all four) better reflects the actual incident shape
and gives future operators a faster diagnostic lookup.
Action items¶
| ID | Action | Owner | Status |
|---|---|---|---|
| AI-1 | Land Wave 18 §4 size-based skip on main. |
梅田 / Claude | DONE (commit 81922433f, PR #35) |
| AI-2 | Authorize 5 required migrations in autonomath_boot_manifest.txt. |
梅田 / Claude | DONE (commit 82df31bd8, PR #75) |
| AI-3 | Add scripts/ops/pre_deploy_manifest_verify.py — pre-deploy gate that asserts manifest is a superset of schema_guard required migrations. |
Claude | DONE (Wave 41) |
| AI-4 | Extend scripts/cron/db_boot_hang_alert.py to detect schema_guard FAIL pattern (not just hang). |
Claude | DONE (Wave 41) |
| AI-5 | Expand scripts/ops/post_deploy_verify_v4.sh from 10 to 15 checks (schema_guard pass evidence + 30+ endpoint 200 sweep). |
Claude | DONE (Wave 41) |
| AI-6 | Update docs/runbook/incident_response_db_boot_hang.md → v2 with 4-RC strategy + Strategy F first-class. |
Claude | DONE (Wave 41) |
| AI-7 | Update memory feedback_no_quick_check_on_huge_sqlite with Wave 40 manifest learning. |
Claude | DONE (Wave 41) |
| AI-8 | New memory feedback_pre_deploy_manifest_verify documenting the boot manifest invariant. |
Claude | DONE (Wave 41) |
| AI-9 | Diagnose RC2 (depot builder upload stall) — instrument upload step, audit .dockerignore, reproduce in clean tree. |
梅田 | OPEN — next wave |
| AI-10 | Audit remaining boot-time SQLite ops (quick_check / VACUUM / REINDEX / ANALYZE) on autonomath.db. |
梅田 | OPEN — next wave |
References¶
- v1 post-mortem (RC1 only):
docs/postmortem/2026-05-11_integrity_check_outage.md - Wave 18 §4 fix: commit
81922433f, PR #35 - Wave 40 manifest fix: commit
82df31bd8, PR #75 - Wave 41 SOP work: this post-mortem,
scripts/ops/pre_deploy_manifest_verify.py,scripts/ops/post_deploy_verify_v4.sh,docs/runbook/incident_response_db_boot_hang_v2.md, memoryfeedback_no_quick_check_on_huge_sqlite(updated) +feedback_pre_deploy_manifest_verify(new) entrypoint.sh§2 / §4 (currentmainHEAD)scripts/schema_guard.py— definesAM_REQUIRED_MIGRATIONS+JPINTEL_REQUIRED_MIGRATIONSscripts/migrations/autonomath_boot_manifest.txt— boot allowlist- CLAUDE.md SOT: §entrypoint.sh §2/§4 SIZE-BASED + §autonomath manifest allowlist
- Memory:
feedback_no_quick_check_on_huge_sqlite,feedback_pre_deploy_manifest_verify,feedback_post_deploy_smoke_propagation,feedback_deploy_yml_4_fix_pattern
Last reviewed: 2026-05-12 (Wave 41). Solo zero-touch ops — no team rotation, no PagerDuty.