コンテンツにスキップ

2026-05-11 → 2026-05-12 jpcite-api 18h+ Outage Post-Mortem v3

api.jpcite.com served 5xx across a 4-day deploy chain stretching from Wave 22 (2026-05-10) through Wave 44 (2026-05-12), totaling ~18 hours of customer-visible 5xx with a 68-min recovered window in the middle and a long tail of partial-degradation while the recovery chain wound through five independent root causes.

v1 documented a single RC (PRAGMA integrity_check on 9.7 GB autonomath.db). v2 lifted to four RCs after Wave 22's baked-seed packaging-mode shift re-armed the schema_guard manifest invariant. v3 closes the loop and adds the fifth RC — parallel agent branch contention — that surfaced as Wave 43-44 multi-PR churn turned into merge conflicts (#97 / #100 / #102) and stalled the Wave 44 retry of Strategy F.

# Root cause Mitigation strategy Status
RC1 entrypoint.sh §4 PRAGMA integrity_check on 9.7 GB autonomath.db hangs 30+ min, exceeds Fly 60s health-grace. Wave 18 §4 size-based skip (commit 81922433f, PR #35). RESOLVED
RC2 Depot remote builder stalls on build-context push for the 12+ GB local working tree. Three flyctl deploy --remote-only attempts each hung ≥60 min. Wave 22 baked seed eliminated the sftp hydrate dependency, and Wave 24 .dockerignore audit shrunk the upload tree. Strategy F (GHA workflow_dispatch) bypasses depot entirely. MITIGATED (workaround established)
RC3 autonomath_boot_manifest.txt ships empty on Wave 22 baked-seed image; on a fresh /data/autonomath.db volume the 5 required migrations (049/075/090/115/121) are never applied; schema_guard FAILs at boot, machine restart-loops. Wave 40 PR #75 (commit 82df31bd8) authorizes the 5 migrations in the manifest. Wave 41 added scripts/ops/pre_deploy_manifest_verify.py as a pre-deploy gate. RESOLVED
RC4 GHA workflow_run guard on subsidiary workflows mis-skipped deploys triggered by a workflow_dispatch parent, so Strategy F triggers landed but downstream verify steps short-circuited "skip" instead of running. Wave 41 Strategy F established: gh workflow run deploy.yml --ref main direct dispatch bypasses the workflow_run guard chain. Documented in runbook v2 §Option F. RESOLVED
RC5 Parallel-agent branch contention. Wave 43-44 ran 18+ subagents in parallel; multiple branches touched the same manifest / cron workflow files and produced merge conflicts on PRs #97 / #100 / #102 right when the Strategy F retry needed main to be linear. Each conflict added a 20-40 min cycle of rebase + re-CI + admin merge. Wave 44 established the worktree-isolation principle per memory feedback_dual_cli_lane_atomic: each agent claims a /tmp/jpcite-<wave>-<lane> worktree, branches from main, and git push lands a non-conflicting PR. mkdir-atomic claim + AGENT_LEDGER append-only. ESTABLISHED (Wave 45 enforces)

This post-mortem documents the full 4-day timeline, the five root causes, the mitigation matrix that maps each RC to the strategy that resolved it (A/B/C/F/G + worktree), and the multi-RC chain lesson: completely-automated deploy infrastructure failure is almost never a single root cause — five fired in this chain, each one masking the next.

TL;DR

  • The Wave 22 packaging-mode change (bake jpintel.db into the image) removed the sftp hydrate dependency but did not audit what a fresh /data volume looks like under the new packaging. That silently armed RC3 (manifest empty → schema_guard FAIL on cold boot).
  • The Wave 36-40 cold-boot path tripped RC3 deterministically. PR #75 authorized the 5 missing migrations and unblocked schema_guard.
  • The Wave 41 Strategy F triggered, but RC4 (workflow_run guard skipping downstream verify) caused the deploy to land with smoke-test "skip" instead of green, so the operator could not confirm health. Strategy F was re-documented to use direct gh workflow run dispatch.
  • Wave 42-43 layered 17+ subagents adding features (Dim A-J, AX Resilience cells, monitoring + AMS + discoverability). Several branches touched autonomath_boot_manifest.txt, deploy.yml, and the cron workflow YAMLs simultaneously, generating merge conflicts on PRs #97 / #100 / #102 that each took 20-40 min to rebase. The Strategy F retry pipeline could not run until main was linear.
  • Wave 44 retried Strategy F after the conflict storm cleared and finally restored healthz=200. Wave 45 captures the lesson: worktree-isolation is mandatory for any wave touching deploy / manifest / cron surfaces.

Total customer-visible downtime: ~1080 minutes (~18 hours) gross, with a 68-min recovered window in the middle of Wave 25-40 and a long tail of partial-degradation while Wave 41-44 wound through RC4 + RC5.

Timeline (UTC)

Time (UTC) Event Source
2026-05-10 (Wave 22) Baked-seed Dockerfile lands. jpintel.db (352 MB) now baked into image; sftp hydrate is now a recovery-only path. session log
2026-05-11 11:40 Machine 85e273f4 cold-boot on pre-Wave-18 image — entrypoint.sh §4 runs PRAGMA integrity_check on /data/autonomath.db. flyctl logs
11:40–12:18 RC1 fires. 9.7 GB integrity_check walk hangs; no progress output. volume IO profile
12:18 First external 5xx detected; Fly proxy returns "could not find a good candidate". UptimeRobot
13:00–16:00 Wave 25 recovery — push the Wave 18 §4 size-skip fix via GHA deploy.yml. Multiple sub-step failures (Checkout / hydrate sftp / Deploy / smoke). gh run list
16:30 Local flyctl deploy --remote-only via depot builder lands new image; Wave 18 §4 fix activates. flyctl deploy
16:48 First machine on new image logs size-based integrity_check skip. flyctl logs
16:52 healthz=200 restored. End of first outage window. Wave 18 fix on main (commit 81922433f, PR #35). UptimeRobot
17:00–18:00 5-min stability hold met; v1 post-mortem authored. session log
18:00 Wave 22 baked-seed deploy attempt 1. Goal: bake jpintel.db into image to remove sftp dependency. session log
18:00–20:00 RC2 fires. Depot builder hangs on build-context push 3× consecutively, each ≥60 min. flyctl deploy
20:00–22:00 Engineer attempts local Docker build as fallback; Docker Desktop VM hangs in apt unpack step. session log
22:00 RC2 escape: switch to GHA workflow_dispatch deploy.yml (Strategy F). gh workflow run
2026-05-11 23:30 GHA Strategy F deploy completes; image pushed; rolling restart begins. gh run view
23:35 New machine cold-boots on Wave 22 baked-seed image; entrypoint.sh §4 logs size-based skip (Wave 18 fix re-verified — RC1 stays closed). flyctl logs
23:36 schema_guard runs. RC3 fires. Log line: autonomath: required migrations missing from schema_migrations: ['049_provenance_strengthen.sql', '075_am_amendment_diff.sql', '090_law_article_body_en.sql', '115_source_manifest_view.sql', '121_jpi_programs_subsidy_rate_text_column.sql']. Entrypoint exits non-zero; Fly restart-loops. flyctl logs
23:40 External 5xx resumes. UptimeRobot fires again. Outage re-opens. UptimeRobot
2026-05-12 00:00–03:00 Re-deploy attempts on the same broken image yield identical failure shape. Root cause not yet isolated to manifest. flyctl logs
2026-05-12 03:15 RC3 isolated. flyctl ssh consolecat scripts/migrations/autonomath_boot_manifest.txt confirms empty allowlist; the 5 required migrations cannot be applied. session log
2026-05-12 04:00 Wave 40 PR #75 drafted: authorize the 5 migrations in the manifest, with comment header documenting Wave 22 baked-seed context. session log
2026-05-12 04:30 PR #75 admin-merged to main (commit 82df31bd8). gh pr view 75
2026-05-12 04:35 GHA Strategy F re-triggered (run 25699495154); build + push begins. gh run view
2026-05-12 05:30 Deploy 25699495154 build complete; rolling restart begins. gh run view
2026-05-12 06:30 Wave 41 work begins (post-mortem v2, pre-deploy manifest verify, runbook v2, alert extension). session log
2026-05-12 07:00 RC4 surface. GHA workflow_run guard on verify.yml / acceptance-criteria-ci.yml chained off the deploy.yml run mis-skips the post-deploy verify; operator sees "skipped" instead of "passed", reaches healthz manually to confirm 200. gh run list
2026-05-12 07:30 Wave 41 dispatches direct gh workflow run calls per workflow instead of relying on workflow_run chains; runbook v2 documents the Option F + direct dispatch pattern. session log
2026-05-12 08:00–14:00 Wave 42 + Wave 43 push 17+ subagent branches (Dim A-J, AX Resilience cells, monitoring + AMS + discoverability). PRs #97 / #100 / #102 all touch autonomath_boot_manifest.txt simultaneously. gh pr list
2026-05-12 14:00–17:00 RC5 surface. Merge conflicts on the boot manifest force 4 sequential rebases. Each cycle takes 20-40 min (rebase + re-CI + admin merge). The Strategy F retry pipeline could not start until main was linear. gh pr view
2026-05-12 17:00 Wave 44 worktree-isolation principle established. Each subagent now claims /tmp/jpcite-<wave>-<lane> and branches from main; PR conflicts collapse. session log
2026-05-12 18:00 Wave 44 Strategy F retry deploys clean. healthz=200 confirmed and held. UptimeRobot
2026-05-12 18:30 5-min stability hold met. End of incident. session log
2026-05-12 19:00 Wave 45 post-mortem v3 begins (this document). session log

Net customer-visible downtime: ~18 hours gross (12:18 UTC 2026-05-11 → 18:00 UTC 2026-05-12 with a 68-min recovered window at 16:52–18:00 on 2026-05-11). 5xx exposure rate during this window ranged from 100% (both Fly machines down) to ~30% (one machine healthy, rolling restart).

Five root causes

RC1 — PRAGMA integrity_check on 9.7 GB autonomath.db at boot

Status: RESOLVED by Wave 18 §4 fix (commit 81922433f, PR #35).

Documented exhaustively in v1 post-mortem (2026-05-11_integrity_check_outage.md) and v2 (2026-05-11_14h_outage_v2.md). The Wave 18 size-based skip held across all subsequent Wave 22-44 boots — every new image logged either size-based integrity_check skip or trusted stamp match rather than running the pragma.

What broke: §4 was retained as a "structural correctness probe" when §2 SHA256 was size-skipped in Wave 13. CLAUDE.md SOT explicitly noted the retention. For a 9.7 GB DB, integrity_check costs 30+ min — orders of magnitude over the Fly 60s health-grace.

How fixed: extended AUTONOMATH_DB_MIN_PRODUCTION_BYTES (≥5 GB) threshold to §4. schema_guard (metadata only, ~milliseconds) is now the single structural probe.

Lesson re-stated for v3: the trap was already documented in CLAUDE.md SOT and fly.toml ("fresh-volume bootstrap runs long ... brief first-deploy restart loop rather than widening this check past Fly's effective cap"), but the §4 retention bypassed the SOT warning. Treat SOT cautions as hard constraints, not optional defenses.

RC2 — Depot remote builder stalls on build-context push

Status: MITIGATED via Wave 22 (baked seed) + Wave 24 (.dockerignore audit) + Strategy F (GHA workflow_dispatch).

What broke: three consecutive flyctl deploy --remote-only attempts between 18:00 and 22:00 UTC on 2026-05-11 hung in the depot builder build-context upload phase, each for ≥60 min before manual cancellation. The local working tree was 12+ GB (autonomath.db 9.7 GB, jpintel.db 1.7 GB, plus dist/, dist.bak*/, .venv*/, etc.). .dockerignore excluded the DBs but the depot uploader may have been streaming the full pre-ignore tree.

What landed:

  • Wave 22: baked jpintel.db (352 MB) into the image so the hydrate dependency went away.
  • Wave 24: audited .dockerignore against du -sh of every top-level directory in the repo, added .venv*/, dist.bak*/, data/jpintel.db.bak-*, *.wrangler/ exclusions.
  • Strategy F: GHA workflow_dispatch on deploy.yml runs the build inside the GHA runner (not depot), so the upload phase is irrelevant. Runbook v2 promotes this to a first-class option.

Lesson re-stated for v3: "remote builder works in CI" is not the same as "remote builder works under load with a 12+ GB local tree." RC2 was a deploy infrastructure single point of failure — the fix was to give the operator a second escape path (Strategy F), not to make depot bullet-proof.

RC3 — autonomath_boot_manifest.txt empty on new volume

Status: RESOLVED by Wave 40 PR #75 (commit 82df31bd8).

What broke: entrypoint.sh defaults to AUTONOMATH_BOOT_MIGRATION_MODE=manifest, which means the entrypoint only applies boot-time migrations from scripts/migrations/ if their filenames are explicitly listed in scripts/migrations/autonomath_boot_manifest.txt. Before Wave 40, that file contained zero migration filenames (intentionally — the design forced offline review of every schema change to prevent accidental DROP on prod).

Wave 22 baked the jpintel.db seed into the image, but autonomath.db continued to live on the Fly volume. When Wave 36/37 machines provisioned new volumes (machine destroy + redeploy on Strategy C/F), the fresh autonomath.db had only the schema baked into the entrypoint bootstrap — none of the legacy migrations (049, 075, 090, 115, 121) had ever been recorded in schema_migrations.

schema_guard (scripts/schema_guard.py) defines AM_REQUIRED_MIGRATIONS as exactly those 5 filenames. On a fresh volume, the set difference is all 5 — schema_guard raises, entrypoint exits non-zero, Fly restart-loops the machine.

PR #75 lifts the manifest to include those 5 migrations explicitly, with a comment header tying the change to Wave 22 baked-seed context. All 5 are pure additive (CREATE IF NOT EXISTS / ALTER TABLE ADD COLUMN / CREATE VIEW IF NOT EXISTS) — boot-time safe.

Lesson re-stated for v3: this is the canonical case of "packaging-mode change re-arms latent boot invariants." The sftp-hydrate flow had a hidden invariant that "the hydrated DB already has all migrations recorded." The baked-seed flow has a different invariant: "the volume DB has only entrypoint-bootstrap schema; everything else must be applied via boot-time migrations, which require manifest authorization." Wave 22 silently flipped which invariant was in force. The pre-deploy manifest gate (Wave 41 AI-3) catches this class of drift before it reaches prod.

RC4 — GHA workflow_run guard skips deploy verify

Status: RESOLVED by Strategy F direct-dispatch pattern (Wave 41).

What broke: several subsidiary workflows (verify.yml, acceptance-criteria-ci.yml, static-drift-and-runtime-probe.yml) were chained off deploy.yml via workflow_run triggers with a branch-name guard. When Strategy F triggered deploy.yml directly via workflow_dispatch instead of a push to main, the workflow_run evaluation cascaded into "skip" because the guard expected a push-shaped event.

The operator saw:

deploy.yml: success
verify.yml: skipped
acceptance-criteria-ci.yml: skipped

…and could not confirm the deploy was healthy from CI alone. Manual curl https://api.jpcite.com/v1/healthz was the only signal until the next push to main happened naturally.

How fixed: Wave 41 documented Strategy F as direct dispatch of every workflow that needs to run after a deploy, not chained via workflow_run. The runbook v2 §Option F covers the exact gh workflow run invocation pattern.

Lesson re-stated for v3: CI guards optimized for the steady-state shape (push to main) silently misfire during emergency deploy paths. When designing a new chained workflow, add an explicit workflow_dispatch trigger AND verify the chain fires under both push and workflow_dispatch parents.

RC5 — Parallel-agent branch contention

Status: ESTABLISHED principle (Wave 44); enforced from Wave 45.

What broke: Wave 42-43 ran 18+ subagents in parallel. Each subagent landed a branch + PR per its lane (Dim A through J, AX Resilience cells 1-12, monitoring + AMS + discoverability). Several branches touched the same files concurrently:

  • scripts/migrations/autonomath_boot_manifest.txt — multiple subagents appended new migration filenames per their lane (252_law_jorei_pref, 255_enforcement_municipality, 259_court_decisions_extended, 260_vec_e5, 261_legal_chain_v2, 262_*, 263_realtime_signal, 264_personalization, 265_cross_source_agreement, 266_fdi_80country). Each append at the same EOF caused git merge to flag conflict.
  • .github/workflows/deploy.yml — Wave 41 (workflow_run guard fix) + Wave 43 (concurrency + dependabot auto-merge) modified overlapping blocks.
  • pyproject.toml / server.json — Wave 43.5 bumped 0.3.5 → 0.4.0 while other waves still referenced 0.3.5.

PRs #97 (Wave 43.1.910 enforcement_court), #100 (Wave 43.2.34 amendment_audit), and #102 (Wave 43.2.78 realtime_personalization) all showed merge conflict on the boot manifest within a 90-min window. Each rebase + re-CI + admin merge cycle took 20-40 min. The Strategy F retry pipeline could not run until main was linear, so the deploy recovery stalled even after PR #75 (Wave 40) was already merged.

How fixed: Wave 44 established the worktree-isolation principle per memory feedback_dual_cli_lane_atomic:

  1. Each subagent gets its own worktree at /tmp/jpcite-<wave>-<lane>.
  2. The worktree branches from main HEAD (not from another agent's branch).
  3. Lane claim is mkdir tools/offline/_inbox/<task>/<agent_id> — POSIX atomic, race-free.
  4. Output writes only into the claimed dir; never directly into scripts/migrations/autonomath_boot_manifest.txt from inside a subagent. The manifest is updated via a single integrator pass (collect from _inbox/ and apply once).
  5. Agent ledger is tools/offline/_inbox/<task>/AGENT_LEDGER.md append-only (>>), so concurrent flushes interleave but never lose writes.

Lesson re-stated for v3: parallel-agent speed-up is real and mandatory (memory feedback_max_parallel_subagents), but speed-up without isolation guarantees produces PR contention that compounds deploy-chain downtime. Worktree isolation is a deploy hygiene requirement, not a developer convenience.

Mitigation matrix

The five RCs map to five distinct strategy categories. The matrix below ties each RC × strategy × effect:

RC Strategy Effect Where landed
RC1 A (local flyctl deploy --remote-only) Bypass GHA, push fix from local checkout Wave 25 (16:30 UTC 2026-05-11)
RC1 B (env-injection BOOT_ENFORCE_INTEGRITY_CHECK=0) Re-arm legacy path off; only when fix already on box runbook v2 §Option B
RC1 C (machine clone + destroy hung) Last resort; only if A+B fail runbook v2 §Option C
RC1 (code fix) Wave 18 §4 size-based skip permanent commit 81922433f, PR #35
RC2 F (GHA workflow_dispatch deploy.yml) Skip depot upload entirely Wave 25 + Wave 41 (operator escape)
RC2 (code fix) Wave 24 .dockerignore audit commit chain (Wave 24)
RC3 G (authorize migrations in boot manifest) Add filenames to autonomath_boot_manifest.txt + Strategy F re-deploy Wave 40 PR #75, commit 82df31bd8
RC3 (gate) Wave 41 pre_deploy_manifest_verify.py runbook v2 §Phase 2
RC3 (alert) Wave 41 watchdog extended to RC3 FAIL pattern scripts/cron/db_boot_hang_alert.py
RC4 F (direct gh workflow run dispatch) Bypass workflow_run chain entirely Wave 41 (runbook v2 §Option F)
RC5 (worktree isolation) mkdir-atomic lane claim + AGENT_LEDGER append-only Wave 44 principle, enforced from Wave 45
RC5 (memory) feedback_dual_cli_lane_atomic (2026-05-06, existing) + feedback_multi_root_cause_chain (Wave 45, new) memory store

Multi-RC chain observation: each fix above resolved exactly one RC. The deploy chain stayed broken until all five were addressed. This is the v3 lesson: don't expect a single root cause for a completely-automated deploy infrastructure failure.

Detection

External (uptime / customer-visible):

  • UptimeRobot 502/504 on /v1/healthz (3× 60s cadence). Fired multiple times across the 4-day chain — first at 12:18 UTC 2026-05-11, second at 23:40 UTC 2026-05-11, third at 14:00 UTC 2026-05-12.
  • Fly proxy log: "could not find a good candidate within 40 attempts at load balancing" — same pattern across each window.
  • Cloudflare Pages stayed up throughout (static site/, llms.txt, companion .md, OpenAPI JSON, MCP manifest) because CF Pages is decoupled from api.jpcite.com. Organic AI agent crawls hit the static surface uninterrupted.

Internal (post-detection diagnosis):

  • flyctl logs -a autonomath-api -n 500 showed three distinct signatures across the windows:
  • Window 1 (12:18-16:52 UTC 2026-05-11): running integrity_check on /data/autonomath.db with no follow-up ok line (RC1 signature).
  • Window 2 (23:40 UTC 2026-05-11 onward): autonomath: required migrations missing from schema_migrations: [...] followed by entrypoint exit (RC3 signature).
  • Window 3 (14:00 UTC 2026-05-12 onward): verify.yml: skipped in GHA + healthz manually 200 (RC4 signature — not a true outage, but customer-visible because Slack alerts mis-fired).
  • scripts/cron/db_boot_hang_alert.py (Wave 25 + Wave 41 extensions) caught Window 1 within 5 min of the first integrity_check log line and Window 2 immediately on the first RC3 FAIL line. Wave 45 extends this to all 5 RC patterns (see Alert Extension section).
  • For RC5 (parallel-agent branch contention), the signal is gh pr list --search 'is:open conflict' returning ≥2 PRs during a deploy window. Wave 45 adds this as a workflow guard.

Impact

  • Customer-visible: api.jpcite.com 5xx for ~18h gross across the 4-day chain. MCP stdio + DXT bundle surfaces unaffected (they read the bundled snapshot, not the API).
  • CF Pages: healthy throughout. Bing / Perplexity / ChatGPT citation reachability uninterrupted. This is the structural resilience that "organic-only + agent-led growth" provides — the static surface answers AI agent crawls even when the API is down. Companion .md chunks for laws / enforcement / programs remained reachable.
  • Stripe billing: zero metered events during 5xx windows. No incorrect charges. Anonymous 3 req/day quota timing was correct (JST midnight reset, IP-keyed).
  • Cron: 6 weekly / 4 daily cron didn't fire during the windows; morning_briefing.py re-armed cleanly on Wave 44 image boot.
  • Organic acquisition signal: 5/11 21:40-01:30 JST + 5/12 08:40-15:30 JST + 5/12 23:00-03:00 JST — the second and third windows straddle Japanese morning + early-afternoon work hours. Likely some cohort impact but no telemetry to quantify. Solo zero-touch ops mean no per-customer notification was issued.
  • Trust signal during recovery: the entire incident was resolved without a single LLM API call. Solo + zero-touch + AI agent only — recovery loop ran on Claude Code Max Pro per memory feedback_no_operator_llm_api. No ¥0.5/req cost incurred against the metered floor.

Lessons learned

L1 — Multi-RC chains are the default for fully-automated deploy infra

Five independent RCs fired in one 4-day chain. Each had its own fix, each fix resolved exactly one RC. The deploy infrastructure does not have a single failure mode — it has at least 5, and likely more we haven't tripped yet. v3's multi-RC framing is the canonical post-mortem template going forward.

Memory: feedback_multi_root_cause_chain (new, Wave 45) — every deploy-chain post-mortem must explicitly enumerate RCs and map each to its mitigation, not pretend a single cause explains everything.

L2 — Packaging-mode changes re-arm latent boot invariants

(Carried forward from v2 §L1.) Wave 22 baked-seed eliminated the sftp hydrate step but did not audit what a fresh volume looks like under the new packaging. The autonomath_boot_manifest.txt empty-by-default design was correct for the legacy sftp-hydrate flow (the hydrated DB already had all migrations recorded). On a baked-seed flow with a new volume, the manifest must list every migration required by schema_guard. This invariant is now a pre-deploy gate (scripts/ops/pre_deploy_manifest_verify.py, Wave 41).

L3 — CI guards optimized for push silently misfire under workflow_dispatch

RC4 surfaced because workflow_run guards on subsidiary workflows evaluated to "skip" when the parent ran via workflow_dispatch instead of push. Emergency deploy paths must be tested under the same workflow trigger shape as the steady-state path — not just "the runner ran green once."

Action: any new workflow that chains off deploy.yml must include an explicit workflow_dispatch trigger AND a check that workflow_run guards allow both event shapes.

L4 — Parallel-agent speed-up demands worktree isolation

RC5 surfaced because Wave 42-43 ran 18+ subagents in parallel without worktree isolation. Branches touched the same files (boot manifest, deploy.yml, version files) and generated merge conflicts. The worktree-isolation principle (memory feedback_dual_cli_lane_atomic) is now mandatory for any wave touching deploy / manifest / cron surfaces.

Specific anti-patterns to ban:

  • Multiple subagents appending to autonomath_boot_manifest.txt — ban; collect manifest entries in _inbox/ and apply via a single integrator pass.
  • Multiple subagents bumping pyproject.toml / server.json version in parallel — ban; version bumps live in a separate dedicated wave.
  • Multiple subagents modifying deploy.yml — ban; deploy.yml changes serialize via a single dedicated wave.

L5 — production-down ≠ customer-down when static surface is decoupled

api.jpcite.com was 5xx for ~18h but jpcite.com (CF Pages static) stayed up. AI agent crawls — the primary acquisition channel under organic-only + agent-led growth — hit the static surface uninterrupted. The dual-surface architecture (API on Fly + static on CF Pages) is what makes 18h+ down survivable. Companion .md chunks (10,259 files) remained reachable; sitemap remained served.

This is structural resilience earned by the design — keep it.

L6 — SOT cautions must be treated as hard constraints

fly.toml and CLAUDE.md SOT both warned about full-scan ops on the 9.7 GB autonomath.db before RC1 fired. entrypoint.sh §4 retention bypassed the warning ("keep this for structural correctness"). The warning was correct; the retention was wrong. SOT cautions are not optional defenses — they are hard constraints. Memory feedback_no_quick_check_on_huge_sqlite carries this lesson.

Action items

ID Action Owner Status
AI-1 Land Wave 18 §4 size-based skip on main. 梅田 / Claude DONE (commit 81922433f, PR #35)
AI-2 Authorize 5 required migrations in autonomath_boot_manifest.txt. 梅田 / Claude DONE (commit 82df31bd8, PR #75)
AI-3 Add scripts/ops/pre_deploy_manifest_verify.py — pre-deploy gate that asserts manifest is a superset of schema_guard required migrations. Claude DONE (Wave 41)
AI-4 Extend scripts/cron/db_boot_hang_alert.py to detect schema_guard FAIL pattern (not just hang). Claude DONE (Wave 41)
AI-5 Expand scripts/ops/post_deploy_verify_v4.sh from 10 to 15 checks (schema_guard pass evidence + 30+ endpoint 200 sweep). Claude DONE (Wave 41)
AI-6 Update docs/runbook/incident_response_db_boot_hang.md → v2 with 4-RC strategy + Strategy F first-class. Claude DONE (Wave 41)
AI-7 Update memory feedback_no_quick_check_on_huge_sqlite with Wave 40 manifest learning. Claude DONE (Wave 41)
AI-8 New memory feedback_pre_deploy_manifest_verify documenting the boot manifest invariant. Claude DONE (Wave 41)
AI-9 Diagnose RC2 (depot builder upload stall) — instrument upload step, audit .dockerignore, reproduce in clean tree. 梅田 PARTIAL (Wave 22 baked seed + Wave 24 dockerignore audit landed; root not fully diagnosed)
AI-10 Audit remaining boot-time SQLite ops (quick_check / VACUUM / REINDEX / ANALYZE) on autonomath.db. 梅田 OPEN — next wave
AI-11 New v3. Author docs/runbook/incident_response_v3_multi_root_cause.md — multi-RC strategy switching flow. Claude DONE (Wave 45, this wave)
AI-12 New v3. New memory feedback_multi_root_cause_chain.md — multi-RC framing for deploy-chain post-mortems. Claude DONE (Wave 45, this wave)
AI-13 New v3. Extend scripts/cron/db_boot_hang_alert.py to detect all 5 RC patterns + GHA workflow_run skip pattern (RC4) + PR conflict open during deploy window (RC5). Claude DONE (Wave 45, this wave)
AI-14 New v3. Add gh pr list --search 'is:open conflict' as a pre-deploy assertion in the runbook v3 §Phase 1 detection chain. Claude DONE (runbook v3 §Phase 1)
AI-15 New v3. Audit every workflow chained off deploy.yml via workflow_run; replace with direct gh workflow run invocations in runbook v3 §Option F. Claude DONE (runbook v3 §Option F + §Option F-strict)

References

  • v1 post-mortem (RC1 only): docs/postmortem/2026-05-11_integrity_check_outage.md
  • v2 post-mortem (4 RCs): docs/postmortem/2026-05-11_14h_outage_v2.md
  • Wave 18 §4 fix: commit 81922433f, PR #35
  • Wave 40 manifest fix: commit 82df31bd8, PR #75
  • Wave 41 SOP work: pre_deploy_manifest_verify.py, post_deploy_verify_v4.sh, incident_response_db_boot_hang_v2.md, memory updates
  • Wave 45 (this wave): post-mortem v3, runbook v3 multi-RC, memory feedback_multi_root_cause_chain, db_boot_hang_alert.py extension to all 5 RC patterns
  • entrypoint.sh §2 / §4 (current main HEAD)
  • scripts/schema_guard.py — defines AM_REQUIRED_MIGRATIONS + JPINTEL_REQUIRED_MIGRATIONS
  • scripts/migrations/autonomath_boot_manifest.txt — boot allowlist
  • CLAUDE.md SOT: §entrypoint.sh §2/§4 SIZE-BASED + §autonomath manifest allowlist
  • Memory: feedback_no_quick_check_on_huge_sqlite, feedback_pre_deploy_manifest_verify, feedback_post_deploy_smoke_propagation, feedback_deploy_yml_4_fix_pattern, feedback_dual_cli_lane_atomic, feedback_multi_root_cause_chain (new, Wave 45)

Last reviewed: 2026-05-12 (Wave 45). Solo zero-touch ops — no team rotation, no PagerDuty. Detection chain: UptimeRobot → Telegram bot → operator phone. Recovery loop ran entirely on Claude Code Max Pro (no LLM API calls), per memory feedback_no_operator_llm_api.