2026-05-11 → 2026-05-12 jpcite-api 18h+ Outage Post-Mortem v3¶

api.jpcite.com served 5xx across a 4-day deploy chain stretching from Wave 22 (2026-05-10) through Wave 44 (2026-05-12), totaling ~18 hours of customer-visible 5xx with a 68-min recovered window in the middle and a long tail of partial-degradation while the recovery chain wound through five independent root causes.

v1 documented a single RC (PRAGMA integrity_check on 9.7 GB autonomath.db). v2 lifted to four RCs after Wave 22's baked-seed packaging-mode shift re-armed the schema_guard manifest invariant. v3 closes the loop and adds the fifth RC — parallel agent branch contention — that surfaced as Wave 43-44 multi-PR churn turned into merge conflicts (#97 / #100 / #102) and stalled the Wave 44 retry of Strategy F.

#	Root cause	Mitigation strategy	Status
RC1	`entrypoint.sh` §4 `PRAGMA integrity_check` on 9.7 GB autonomath.db hangs 30+ min, exceeds Fly 60s health-grace.	Wave 18 §4 size-based skip (commit `81922433f`, PR #35).	RESOLVED
RC2	Depot remote builder stalls on build-context push for the 12+ GB local working tree. Three `flyctl deploy --remote-only` attempts each hung ≥60 min.	Wave 22 baked seed eliminated the sftp hydrate dependency, and Wave 24 `.dockerignore` audit shrunk the upload tree. Strategy F (GHA workflow_dispatch) bypasses depot entirely.	MITIGATED (workaround established)
RC3	`autonomath_boot_manifest.txt` ships empty on Wave 22 baked-seed image; on a fresh `/data/autonomath.db` volume the 5 required migrations (049/075/090/115/121) are never applied; `schema_guard` FAILs at boot, machine restart-loops.	Wave 40 PR #75 (commit `82df31bd8`) authorizes the 5 migrations in the manifest. Wave 41 added `scripts/ops/pre_deploy_manifest_verify.py` as a pre-deploy gate.	RESOLVED
RC4	GHA `workflow_run` guard on subsidiary workflows mis-skipped deploys triggered by a `workflow_dispatch` parent, so Strategy F triggers landed but downstream verify steps short-circuited "skip" instead of running.	Wave 41 Strategy F established: `gh workflow run deploy.yml --ref main` direct dispatch bypasses the `workflow_run` guard chain. Documented in runbook v2 §Option F.	RESOLVED
RC5	Parallel-agent branch contention. Wave 43-44 ran 18+ subagents in parallel; multiple branches touched the same manifest / cron workflow files and produced merge conflicts on PRs #97 / #100 / #102 right when the Strategy F retry needed `main` to be linear. Each conflict added a 20-40 min cycle of rebase + re-CI + admin merge.	Wave 44 established the worktree-isolation principle per memory `feedback_dual_cli_lane_atomic`: each agent claims a `/tmp/jpcite-<wave>-<lane>` worktree, branches from `main`, and `git push` lands a non-conflicting PR. mkdir-atomic claim + AGENT_LEDGER append-only.	ESTABLISHED (Wave 45 enforces)

This post-mortem documents the full 4-day timeline, the five root causes, the mitigation matrix that maps each RC to the strategy that resolved it (A/B/C/F/G + worktree), and the multi-RC chain lesson: completely-automated deploy infrastructure failure is almost never a single root cause — five fired in this chain, each one masking the next.

TL;DR¶

The Wave 22 packaging-mode change (bake jpintel.db into the image) removed the sftp hydrate dependency but did not audit what a fresh /data volume looks like under the new packaging. That silently armed RC3 (manifest empty → schema_guard FAIL on cold boot).
The Wave 36-40 cold-boot path tripped RC3 deterministically. PR #75 authorized the 5 missing migrations and unblocked schema_guard.
The Wave 41 Strategy F triggered, but RC4 (workflow_run guard skipping downstream verify) caused the deploy to land with smoke-test "skip" instead of green, so the operator could not confirm health. Strategy F was re-documented to use direct gh workflow run dispatch.
Wave 42-43 layered 17+ subagents adding features (Dim A-J, AX Resilience cells, monitoring + AMS + discoverability). Several branches touched autonomath_boot_manifest.txt, deploy.yml, and the cron workflow YAMLs simultaneously, generating merge conflicts on PRs #97 / #100 / #102 that each took 20-40 min to rebase. The Strategy F retry pipeline could not run until main was linear.
Wave 44 retried Strategy F after the conflict storm cleared and finally restored healthz=200. Wave 45 captures the lesson: worktree-isolation is mandatory for any wave touching deploy / manifest / cron surfaces.

Total customer-visible downtime: ~1080 minutes (~18 hours) gross, with a 68-min recovered window in the middle of Wave 25-40 and a long tail of partial-degradation while Wave 41-44 wound through RC4 + RC5.

Timeline (UTC)¶

Time (UTC)	Event	Source
2026-05-10 (Wave 22)	Baked-seed Dockerfile lands. `jpintel.db` (352 MB) now baked into image; sftp hydrate is now a recovery-only path.	session log
2026-05-11 11:40	Machine `85e273f4` cold-boot on pre-Wave-18 image — `entrypoint.sh` §4 runs `PRAGMA integrity_check on /data/autonomath.db`.	`flyctl logs`
11:40–12:18	RC1 fires. 9.7 GB integrity_check walk hangs; no progress output.	volume IO profile
12:18	First external 5xx detected; Fly proxy returns "could not find a good candidate".	UptimeRobot
13:00–16:00	Wave 25 recovery — push the Wave 18 §4 size-skip fix via GHA `deploy.yml`. Multiple sub-step failures (Checkout / hydrate sftp / Deploy / smoke).	`gh run list`
16:30	Local `flyctl deploy --remote-only` via depot builder lands new image; Wave 18 §4 fix activates.	`flyctl deploy`
16:48	First machine on new image logs `size-based integrity_check skip`.	`flyctl logs`
16:52	`healthz=200` restored. End of first outage window. Wave 18 fix on `main` (commit `81922433f`, PR #35).	UptimeRobot
17:00–18:00	5-min stability hold met; v1 post-mortem authored.	session log
18:00	Wave 22 baked-seed deploy attempt 1. Goal: bake `jpintel.db` into image to remove sftp dependency.	session log
18:00–20:00	RC2 fires. Depot builder hangs on build-context push 3× consecutively, each ≥60 min.	`flyctl deploy`
20:00–22:00	Engineer attempts local Docker build as fallback; Docker Desktop VM hangs in apt unpack step.	session log
22:00	RC2 escape: switch to GHA `workflow_dispatch` deploy.yml (Strategy F).	`gh workflow run`
2026-05-11 23:30	GHA Strategy F deploy completes; image pushed; rolling restart begins.	`gh run view`
23:35	New machine cold-boots on Wave 22 baked-seed image; `entrypoint.sh` §4 logs size-based skip (Wave 18 fix re-verified — RC1 stays closed).	`flyctl logs`
23:36	`schema_guard` runs. RC3 fires. Log line: `autonomath: required migrations missing from schema_migrations: ['049_provenance_strengthen.sql', '075_am_amendment_diff.sql', '090_law_article_body_en.sql', '115_source_manifest_view.sql', '121_jpi_programs_subsidy_rate_text_column.sql']`. Entrypoint exits non-zero; Fly restart-loops.	`flyctl logs`
23:40	External 5xx resumes. UptimeRobot fires again. Outage re-opens.	UptimeRobot
2026-05-12 00:00–03:00	Re-deploy attempts on the same broken image yield identical failure shape. Root cause not yet isolated to manifest.	`flyctl logs`
2026-05-12 03:15	RC3 isolated. `flyctl ssh console` → `cat scripts/migrations/autonomath_boot_manifest.txt` confirms empty allowlist; the 5 required migrations cannot be applied.	session log
2026-05-12 04:00	Wave 40 PR #75 drafted: authorize the 5 migrations in the manifest, with comment header documenting Wave 22 baked-seed context.	session log
2026-05-12 04:30	PR #75 admin-merged to `main` (commit `82df31bd8`).	`gh pr view 75`
2026-05-12 04:35	GHA Strategy F re-triggered (run 25699495154); build + push begins.	`gh run view`
2026-05-12 05:30	Deploy 25699495154 build complete; rolling restart begins.	`gh run view`
2026-05-12 06:30	Wave 41 work begins (post-mortem v2, pre-deploy manifest verify, runbook v2, alert extension).	session log
2026-05-12 07:00	RC4 surface. GHA `workflow_run` guard on `verify.yml` / `acceptance-criteria-ci.yml` chained off the deploy.yml run mis-skips the post-deploy verify; operator sees "skipped" instead of "passed", reaches healthz manually to confirm 200.	`gh run list`
2026-05-12 07:30	Wave 41 dispatches direct `gh workflow run` calls per workflow instead of relying on workflow_run chains; runbook v2 documents the Option F + direct dispatch pattern.	session log
2026-05-12 08:00–14:00	Wave 42 + Wave 43 push 17+ subagent branches (Dim A-J, AX Resilience cells, monitoring + AMS + discoverability). PRs #97 / #100 / #102 all touch `autonomath_boot_manifest.txt` simultaneously.	`gh pr list`
2026-05-12 14:00–17:00	RC5 surface. Merge conflicts on the boot manifest force 4 sequential rebases. Each cycle takes 20-40 min (rebase + re-CI + admin merge). The Strategy F retry pipeline could not start until `main` was linear.	`gh pr view`
2026-05-12 17:00	Wave 44 worktree-isolation principle established. Each subagent now claims `/tmp/jpcite-<wave>-<lane>` and branches from `main`; PR conflicts collapse.	session log
2026-05-12 18:00	Wave 44 Strategy F retry deploys clean. `healthz=200` confirmed and held.	UptimeRobot
2026-05-12 18:30	5-min stability hold met. End of incident.	session log
2026-05-12 19:00	Wave 45 post-mortem v3 begins (this document).	session log

Net customer-visible downtime: ~18 hours gross (12:18 UTC 2026-05-11 → 18:00 UTC 2026-05-12 with a 68-min recovered window at 16:52–18:00 on 2026-05-11). 5xx exposure rate during this window ranged from 100% (both Fly machines down) to ~30% (one machine healthy, rolling restart).

Five root causes¶

RC1 — `PRAGMA integrity_check` on 9.7 GB autonomath.db at boot¶

Status: RESOLVED by Wave 18 §4 fix (commit 81922433f, PR #35).

Documented exhaustively in v1 post-mortem (2026-05-11_integrity_check_outage.md) and v2 (2026-05-11_14h_outage_v2.md). The Wave 18 size-based skip held across all subsequent Wave 22-44 boots — every new image logged either size-based integrity_check skip or trusted stamp match rather than running the pragma.

What broke: §4 was retained as a "structural correctness probe" when §2 SHA256 was size-skipped in Wave 13. CLAUDE.md SOT explicitly noted the retention. For a 9.7 GB DB, integrity_check costs 30+ min — orders of magnitude over the Fly 60s health-grace.

How fixed: extended AUTONOMATH_DB_MIN_PRODUCTION_BYTES (≥5 GB) threshold to §4. schema_guard (metadata only, ~milliseconds) is now the single structural probe.

Lesson re-stated for v3: the trap was already documented in CLAUDE.md SOT and fly.toml ("fresh-volume bootstrap runs long ... brief first-deploy restart loop rather than widening this check past Fly's effective cap"), but the §4 retention bypassed the SOT warning. Treat SOT cautions as hard constraints, not optional defenses.

RC2 — Depot remote builder stalls on build-context push¶

Status: MITIGATED via Wave 22 (baked seed) + Wave 24 (.dockerignore audit) + Strategy F (GHA workflow_dispatch).

What broke: three consecutive flyctl deploy --remote-only attempts between 18:00 and 22:00 UTC on 2026-05-11 hung in the depot builder build-context upload phase, each for ≥60 min before manual cancellation. The local working tree was 12+ GB (autonomath.db 9.7 GB, jpintel.db 1.7 GB, plus dist/, dist.bak*/, .venv*/, etc.). .dockerignore excluded the DBs but the depot uploader may have been streaming the full pre-ignore tree.

What landed:

Wave 22: baked jpintel.db (352 MB) into the image so the hydrate dependency went away.
Wave 24: audited .dockerignore against du -sh of every top-level directory in the repo, added .venv*/, dist.bak*/, data/jpintel.db.bak-*, *.wrangler/ exclusions.
Strategy F: GHA workflow_dispatch on deploy.yml runs the build inside the GHA runner (not depot), so the upload phase is irrelevant. Runbook v2 promotes this to a first-class option.

Lesson re-stated for v3: "remote builder works in CI" is not the same as "remote builder works under load with a 12+ GB local tree." RC2 was a deploy infrastructure single point of failure — the fix was to give the operator a second escape path (Strategy F), not to make depot bullet-proof.

RC3 — `autonomath_boot_manifest.txt` empty on new volume¶

Status: RESOLVED by Wave 40 PR #75 (commit 82df31bd8).

What broke: entrypoint.sh defaults to AUTONOMATH_BOOT_MIGRATION_MODE=manifest, which means the entrypoint only applies boot-time migrations from scripts/migrations/ if their filenames are explicitly listed in scripts/migrations/autonomath_boot_manifest.txt. Before Wave 40, that file contained zero migration filenames (intentionally — the design forced offline review of every schema change to prevent accidental DROP on prod).

Wave 22 baked the jpintel.db seed into the image, but autonomath.db continued to live on the Fly volume. When Wave 36/37 machines provisioned new volumes (machine destroy + redeploy on Strategy C/F), the fresh autonomath.db had only the schema baked into the entrypoint bootstrap — none of the legacy migrations (049, 075, 090, 115, 121) had ever been recorded in schema_migrations.

schema_guard (scripts/schema_guard.py) defines AM_REQUIRED_MIGRATIONS as exactly those 5 filenames. On a fresh volume, the set difference is all 5 — schema_guard raises, entrypoint exits non-zero, Fly restart-loops the machine.

PR #75 lifts the manifest to include those 5 migrations explicitly, with a comment header tying the change to Wave 22 baked-seed context. All 5 are pure additive (CREATE IF NOT EXISTS / ALTER TABLE ADD COLUMN / CREATE VIEW IF NOT EXISTS) — boot-time safe.

Lesson re-stated for v3: this is the canonical case of "packaging-mode change re-arms latent boot invariants." The sftp-hydrate flow had a hidden invariant that "the hydrated DB already has all migrations recorded." The baked-seed flow has a different invariant: "the volume DB has only entrypoint-bootstrap schema; everything else must be applied via boot-time migrations, which require manifest authorization." Wave 22 silently flipped which invariant was in force. The pre-deploy manifest gate (Wave 41 AI-3) catches this class of drift before it reaches prod.

RC4 — GHA `workflow_run` guard skips deploy verify¶

Status: RESOLVED by Strategy F direct-dispatch pattern (Wave 41).

What broke: several subsidiary workflows (verify.yml, acceptance-criteria-ci.yml, static-drift-and-runtime-probe.yml) were chained off deploy.yml via workflow_run triggers with a branch-name guard. When Strategy F triggered deploy.yml directly via workflow_dispatch instead of a push to main, the workflow_run evaluation cascaded into "skip" because the guard expected a push-shaped event.

The operator saw:

deploy.yml: success
verify.yml: skipped
acceptance-criteria-ci.yml: skipped

…and could not confirm the deploy was healthy from CI alone. Manual curl https://api.jpcite.com/v1/healthz was the only signal until the next push to main happened naturally.

How fixed: Wave 41 documented Strategy F as direct dispatch of every workflow that needs to run after a deploy, not chained via workflow_run. The runbook v2 §Option F covers the exact gh workflow run invocation pattern.

Lesson re-stated for v3: CI guards optimized for the steady-state shape (push to main) silently misfire during emergency deploy paths. When designing a new chained workflow, add an explicit workflow_dispatch trigger AND verify the chain fires under both push and workflow_dispatch parents.

RC5 — Parallel-agent branch contention¶

Status: ESTABLISHED principle (Wave 44); enforced from Wave 45.

What broke: Wave 42-43 ran 18+ subagents in parallel. Each subagent landed a branch + PR per its lane (Dim A through J, AX Resilience cells 1-12, monitoring + AMS + discoverability). Several branches touched the same files concurrently:

scripts/migrations/autonomath_boot_manifest.txt — multiple subagents appended new migration filenames per their lane (252_law_jorei_pref, 255_enforcement_municipality, 259_court_decisions_extended, 260_vec_e5, 261_legal_chain_v2, 262_*, 263_realtime_signal, 264_personalization, 265_cross_source_agreement, 266_fdi_80country). Each append at the same EOF caused git merge to flag conflict.
.github/workflows/deploy.yml — Wave 41 (workflow_run guard fix) + Wave 43 (concurrency + dependabot auto-merge) modified overlapping blocks.
pyproject.toml / server.json — Wave 43.5 bumped 0.3.5 → 0.4.0 while other waves still referenced 0.3.5.

PRs #97 (Wave 43.1.910 enforcement_court), #100 (Wave 43.2.34 amendment_audit), and #102 (Wave 43.2.78 realtime_personalization) all showed merge conflict on the boot manifest within a 90-min window. Each rebase + re-CI + admin merge cycle took 20-40 min. The Strategy F retry pipeline could not run until main was linear, so the deploy recovery stalled even after PR #75 (Wave 40) was already merged.

How fixed: Wave 44 established the worktree-isolation principle per memory feedback_dual_cli_lane_atomic:

Each subagent gets its own worktree at /tmp/jpcite-<wave>-<lane>.
The worktree branches from main HEAD (not from another agent's branch).
Lane claim is mkdir tools/offline/_inbox/<task>/<agent_id> — POSIX atomic, race-free.
Output writes only into the claimed dir; never directly into scripts/migrations/autonomath_boot_manifest.txt from inside a subagent. The manifest is updated via a single integrator pass (collect from _inbox/ and apply once).
Agent ledger is tools/offline/_inbox/<task>/AGENT_LEDGER.md append-only (>>), so concurrent flushes interleave but never lose writes.

Lesson re-stated for v3: parallel-agent speed-up is real and mandatory (memory feedback_max_parallel_subagents), but speed-up without isolation guarantees produces PR contention that compounds deploy-chain downtime. Worktree isolation is a deploy hygiene requirement, not a developer convenience.

Mitigation matrix¶

The five RCs map to five distinct strategy categories. The matrix below ties each RC × strategy × effect:

RC	Strategy	Effect	Where landed
RC1	A (local `flyctl deploy --remote-only`)	Bypass GHA, push fix from local checkout	Wave 25 (16:30 UTC 2026-05-11)
RC1	B (env-injection `BOOT_ENFORCE_INTEGRITY_CHECK=0`)	Re-arm legacy path off; only when fix already on box	runbook v2 §Option B
RC1	C (machine clone + destroy hung)	Last resort; only if A+B fail	runbook v2 §Option C
RC1	(code fix)	Wave 18 §4 size-based skip permanent	commit `81922433f`, PR #35
RC2	F (GHA workflow_dispatch deploy.yml)	Skip depot upload entirely	Wave 25 + Wave 41 (operator escape)
RC2	(code fix)	Wave 24 `.dockerignore` audit	commit chain (Wave 24)
RC3	G (authorize migrations in boot manifest)	Add filenames to `autonomath_boot_manifest.txt` + Strategy F re-deploy	Wave 40 PR #75, commit `82df31bd8`
RC3	(gate)	Wave 41 `pre_deploy_manifest_verify.py`	runbook v2 §Phase 2
RC3	(alert)	Wave 41 watchdog extended to RC3 FAIL pattern	`scripts/cron/db_boot_hang_alert.py`
RC4	F (direct `gh workflow run` dispatch)	Bypass `workflow_run` chain entirely	Wave 41 (runbook v2 §Option F)
RC5	(worktree isolation)	mkdir-atomic lane claim + AGENT_LEDGER append-only	Wave 44 principle, enforced from Wave 45
RC5	(memory)	`feedback_dual_cli_lane_atomic` (2026-05-06, existing) + `feedback_multi_root_cause_chain` (Wave 45, new)	memory store

Multi-RC chain observation: each fix above resolved exactly one RC. The deploy chain stayed broken until all five were addressed. This is the v3 lesson: don't expect a single root cause for a completely-automated deploy infrastructure failure.

Detection¶

External (uptime / customer-visible):

UptimeRobot 502/504 on /v1/healthz (3× 60s cadence). Fired multiple times across the 4-day chain — first at 12:18 UTC 2026-05-11, second at 23:40 UTC 2026-05-11, third at 14:00 UTC 2026-05-12.
Fly proxy log: "could not find a good candidate within 40 attempts at load balancing" — same pattern across each window.
Cloudflare Pages stayed up throughout (static site/, llms.txt, companion .md, OpenAPI JSON, MCP manifest) because CF Pages is decoupled from api.jpcite.com. Organic AI agent crawls hit the static surface uninterrupted.

Internal (post-detection diagnosis):

flyctl logs -a autonomath-api -n 500 showed three distinct signatures across the windows:
Window 1 (12:18-16:52 UTC 2026-05-11): running integrity_check on /data/autonomath.db with no follow-up ok line (RC1 signature).
Window 2 (23:40 UTC 2026-05-11 onward): autonomath: required migrations missing from schema_migrations: [...] followed by entrypoint exit (RC3 signature).
Window 3 (14:00 UTC 2026-05-12 onward): verify.yml: skipped in GHA + healthz manually 200 (RC4 signature — not a true outage, but customer-visible because Slack alerts mis-fired).
scripts/cron/db_boot_hang_alert.py (Wave 25 + Wave 41 extensions) caught Window 1 within 5 min of the first integrity_check log line and Window 2 immediately on the first RC3 FAIL line. Wave 45 extends this to all 5 RC patterns (see Alert Extension section).
For RC5 (parallel-agent branch contention), the signal is gh pr list --search 'is:open conflict' returning ≥2 PRs during a deploy window. Wave 45 adds this as a workflow guard.

Impact¶

Customer-visible: api.jpcite.com 5xx for ~18h gross across the 4-day chain. MCP stdio + DXT bundle surfaces unaffected (they read the bundled snapshot, not the API).
CF Pages: healthy throughout. Bing / Perplexity / ChatGPT citation reachability uninterrupted. This is the structural resilience that "organic-only + agent-led growth" provides — the static surface answers AI agent crawls even when the API is down. Companion .md chunks for laws / enforcement / programs remained reachable.
Stripe billing: zero metered events during 5xx windows. No incorrect charges. Anonymous 3 req/day quota timing was correct (JST midnight reset, IP-keyed).
Cron: 6 weekly / 4 daily cron didn't fire during the windows; morning_briefing.py re-armed cleanly on Wave 44 image boot.
Organic acquisition signal: 5/11 21:40-01:30 JST + 5/12 08:40-15:30 JST + 5/12 23:00-03:00 JST — the second and third windows straddle Japanese morning + early-afternoon work hours. Likely some cohort impact but no telemetry to quantify. Solo zero-touch ops mean no per-customer notification was issued.
Trust signal during recovery: the entire incident was resolved without a single LLM API call. Solo + zero-touch + AI agent only — recovery loop ran on Claude Code Max Pro per memory feedback_no_operator_llm_api. No ¥0.5/req cost incurred against the metered floor.

Lessons learned¶

L1 — Multi-RC chains are the default for fully-automated deploy infra¶

Five independent RCs fired in one 4-day chain. Each had its own fix, each fix resolved exactly one RC. The deploy infrastructure does not have a single failure mode — it has at least 5, and likely more we haven't tripped yet. v3's multi-RC framing is the canonical post-mortem template going forward.

Memory: feedback_multi_root_cause_chain (new, Wave 45) — every deploy-chain post-mortem must explicitly enumerate RCs and map each to its mitigation, not pretend a single cause explains everything.

L2 — Packaging-mode changes re-arm latent boot invariants¶

(Carried forward from v2 §L1.) Wave 22 baked-seed eliminated the sftp hydrate step but did not audit what a fresh volume looks like under the new packaging. The autonomath_boot_manifest.txt empty-by-default design was correct for the legacy sftp-hydrate flow (the hydrated DB already had all migrations recorded). On a baked-seed flow with a new volume, the manifest must list every migration required by schema_guard. This invariant is now a pre-deploy gate (scripts/ops/pre_deploy_manifest_verify.py, Wave 41).

L3 — CI guards optimized for `push` silently misfire under `workflow_dispatch`¶

RC4 surfaced because workflow_run guards on subsidiary workflows evaluated to "skip" when the parent ran via workflow_dispatch instead of push. Emergency deploy paths must be tested under the same workflow trigger shape as the steady-state path — not just "the runner ran green once."

Action: any new workflow that chains off deploy.yml must include an explicit workflow_dispatch trigger AND a check that workflow_run guards allow both event shapes.

L4 — Parallel-agent speed-up demands worktree isolation¶

RC5 surfaced because Wave 42-43 ran 18+ subagents in parallel without worktree isolation. Branches touched the same files (boot manifest, deploy.yml, version files) and generated merge conflicts. The worktree-isolation principle (memory feedback_dual_cli_lane_atomic) is now mandatory for any wave touching deploy / manifest / cron surfaces.

Specific anti-patterns to ban:

Multiple subagents appending to autonomath_boot_manifest.txt — ban; collect manifest entries in _inbox/ and apply via a single integrator pass.
Multiple subagents bumping pyproject.toml / server.json version in parallel — ban; version bumps live in a separate dedicated wave.
Multiple subagents modifying deploy.yml — ban; deploy.yml changes serialize via a single dedicated wave.

L5 — production-down ≠ customer-down when static surface is decoupled¶

api.jpcite.com was 5xx for ~18h but jpcite.com (CF Pages static) stayed up. AI agent crawls — the primary acquisition channel under organic-only + agent-led growth — hit the static surface uninterrupted. The dual-surface architecture (API on Fly + static on CF Pages) is what makes 18h+ down survivable. Companion .md chunks (10,259 files) remained reachable; sitemap remained served.

This is structural resilience earned by the design — keep it.

L6 — SOT cautions must be treated as hard constraints¶

fly.toml and CLAUDE.md SOT both warned about full-scan ops on the 9.7 GB autonomath.db before RC1 fired. entrypoint.sh §4 retention bypassed the warning ("keep this for structural correctness"). The warning was correct; the retention was wrong. SOT cautions are not optional defenses — they are hard constraints. Memory feedback_no_quick_check_on_huge_sqlite carries this lesson.

Action items¶

ID	Action	Owner	Status
AI-1	Land Wave 18 §4 size-based skip on `main`.	梅田 / Claude	DONE (commit `81922433f`, PR #35)
AI-2	Authorize 5 required migrations in `autonomath_boot_manifest.txt`.	梅田 / Claude	DONE (commit `82df31bd8`, PR #75)
AI-3	Add `scripts/ops/pre_deploy_manifest_verify.py` — pre-deploy gate that asserts manifest is a superset of `schema_guard` required migrations.	Claude	DONE (Wave 41)
AI-4	Extend `scripts/cron/db_boot_hang_alert.py` to detect `schema_guard` FAIL pattern (not just hang).	Claude	DONE (Wave 41)
AI-5	Expand `scripts/ops/post_deploy_verify_v4.sh` from 10 to 15 checks (schema_guard pass evidence + 30+ endpoint 200 sweep).	Claude	DONE (Wave 41)
AI-6	Update `docs/runbook/incident_response_db_boot_hang.md` → v2 with 4-RC strategy + Strategy F first-class.	Claude	DONE (Wave 41)
AI-7	Update memory `feedback_no_quick_check_on_huge_sqlite` with Wave 40 manifest learning.	Claude	DONE (Wave 41)
AI-8	New memory `feedback_pre_deploy_manifest_verify` documenting the boot manifest invariant.	Claude	DONE (Wave 41)
AI-9	Diagnose RC2 (depot builder upload stall) — instrument upload step, audit `.dockerignore`, reproduce in clean tree.	梅田	PARTIAL (Wave 22 baked seed + Wave 24 dockerignore audit landed; root not fully diagnosed)
AI-10	Audit remaining boot-time SQLite ops (`quick_check` / `VACUUM` / `REINDEX` / `ANALYZE`) on autonomath.db.	梅田	OPEN — next wave
AI-11	New v3. Author `docs/runbook/incident_response_v3_multi_root_cause.md` — multi-RC strategy switching flow.	Claude	DONE (Wave 45, this wave)
AI-12	New v3. New memory `feedback_multi_root_cause_chain.md` — multi-RC framing for deploy-chain post-mortems.	Claude	DONE (Wave 45, this wave)
AI-13	New v3. Extend `scripts/cron/db_boot_hang_alert.py` to detect all 5 RC patterns + GHA `workflow_run` skip pattern (RC4) + PR conflict open during deploy window (RC5).	Claude	DONE (Wave 45, this wave)
AI-14	New v3. Add `gh pr list --search 'is:open conflict'` as a pre-deploy assertion in the runbook v3 §Phase 1 detection chain.	Claude	DONE (runbook v3 §Phase 1)
AI-15	New v3. Audit every workflow chained off `deploy.yml` via `workflow_run`; replace with direct `gh workflow run` invocations in runbook v3 §Option F.	Claude	DONE (runbook v3 §Option F + §Option F-strict)

References¶

v1 post-mortem (RC1 only): docs/postmortem/2026-05-11_integrity_check_outage.md
v2 post-mortem (4 RCs): docs/postmortem/2026-05-11_14h_outage_v2.md
Wave 18 §4 fix: commit 81922433f, PR #35
Wave 40 manifest fix: commit 82df31bd8, PR #75
Wave 41 SOP work: pre_deploy_manifest_verify.py, post_deploy_verify_v4.sh, incident_response_db_boot_hang_v2.md, memory updates
Wave 45 (this wave): post-mortem v3, runbook v3 multi-RC, memory feedback_multi_root_cause_chain, db_boot_hang_alert.py extension to all 5 RC patterns
entrypoint.sh §2 / §4 (current main HEAD)
scripts/schema_guard.py — defines AM_REQUIRED_MIGRATIONS + JPINTEL_REQUIRED_MIGRATIONS
scripts/migrations/autonomath_boot_manifest.txt — boot allowlist
CLAUDE.md SOT: §entrypoint.sh §2/§4 SIZE-BASED + §autonomath manifest allowlist
Memory: feedback_no_quick_check_on_huge_sqlite, feedback_pre_deploy_manifest_verify, feedback_post_deploy_smoke_propagation, feedback_deploy_yml_4_fix_pattern, feedback_dual_cli_lane_atomic, feedback_multi_root_cause_chain (new, Wave 45)

Last reviewed: 2026-05-12 (Wave 45). Solo zero-touch ops — no team rotation, no PagerDuty. Detection chain: UptimeRobot → Telegram bot → operator phone. Recovery loop ran entirely on Claude Code Max Pro (no LLM API calls), per memory feedback_no_operator_llm_api.