2026-05-11 → 2026-05-12 jpcite-api 18h+ Outage Post-Mortem v3¶
api.jpcite.com served 5xx across a 4-day deploy chain stretching
from Wave 22 (2026-05-10) through Wave 44 (2026-05-12), totaling
~18 hours of customer-visible 5xx with a 68-min recovered window
in the middle and a long tail of partial-degradation while the
recovery chain wound through five independent root causes.
v1 documented a single RC (PRAGMA integrity_check on 9.7 GB
autonomath.db). v2 lifted to four RCs after Wave 22's baked-seed
packaging-mode shift re-armed the schema_guard manifest invariant.
v3 closes the loop and adds the fifth RC — parallel agent
branch contention — that surfaced as Wave 43-44 multi-PR churn turned
into merge conflicts (#97 / #100 / #102) and stalled the Wave 44 retry
of Strategy F.
| # | Root cause | Mitigation strategy | Status |
|---|---|---|---|
| RC1 | entrypoint.sh §4 PRAGMA integrity_check on 9.7 GB autonomath.db hangs 30+ min, exceeds Fly 60s health-grace. |
Wave 18 §4 size-based skip (commit 81922433f, PR #35). |
RESOLVED |
| RC2 | Depot remote builder stalls on build-context push for the 12+ GB local working tree. Three flyctl deploy --remote-only attempts each hung ≥60 min. |
Wave 22 baked seed eliminated the sftp hydrate dependency, and Wave 24 .dockerignore audit shrunk the upload tree. Strategy F (GHA workflow_dispatch) bypasses depot entirely. |
MITIGATED (workaround established) |
| RC3 | autonomath_boot_manifest.txt ships empty on Wave 22 baked-seed image; on a fresh /data/autonomath.db volume the 5 required migrations (049/075/090/115/121) are never applied; schema_guard FAILs at boot, machine restart-loops. |
Wave 40 PR #75 (commit 82df31bd8) authorizes the 5 migrations in the manifest. Wave 41 added scripts/ops/pre_deploy_manifest_verify.py as a pre-deploy gate. |
RESOLVED |
| RC4 | GHA workflow_run guard on subsidiary workflows mis-skipped deploys triggered by a workflow_dispatch parent, so Strategy F triggers landed but downstream verify steps short-circuited "skip" instead of running. |
Wave 41 Strategy F established: gh workflow run deploy.yml --ref main direct dispatch bypasses the workflow_run guard chain. Documented in runbook v2 §Option F. |
RESOLVED |
| RC5 | Parallel-agent branch contention. Wave 43-44 ran 18+ subagents in parallel; multiple branches touched the same manifest / cron workflow files and produced merge conflicts on PRs #97 / #100 / #102 right when the Strategy F retry needed main to be linear. Each conflict added a 20-40 min cycle of rebase + re-CI + admin merge. |
Wave 44 established the worktree-isolation principle per memory feedback_dual_cli_lane_atomic: each agent claims a /tmp/jpcite-<wave>-<lane> worktree, branches from main, and git push lands a non-conflicting PR. mkdir-atomic claim + AGENT_LEDGER append-only. |
ESTABLISHED (Wave 45 enforces) |
This post-mortem documents the full 4-day timeline, the five root causes, the mitigation matrix that maps each RC to the strategy that resolved it (A/B/C/F/G + worktree), and the multi-RC chain lesson: completely-automated deploy infrastructure failure is almost never a single root cause — five fired in this chain, each one masking the next.
TL;DR¶
- The Wave 22 packaging-mode change (bake
jpintel.dbinto the image) removed the sftp hydrate dependency but did not audit what a fresh/datavolume looks like under the new packaging. That silently armed RC3 (manifest empty → schema_guard FAIL on cold boot). - The Wave 36-40 cold-boot path tripped RC3 deterministically. PR #75 authorized the 5 missing migrations and unblocked schema_guard.
- The Wave 41 Strategy F triggered, but RC4 (workflow_run guard
skipping downstream verify) caused the deploy to land with
smoke-test "skip" instead of green, so the operator could not
confirm health. Strategy F was re-documented to use direct
gh workflow rundispatch. - Wave 42-43 layered 17+ subagents adding features (Dim A-J, AX
Resilience cells, monitoring + AMS + discoverability). Several
branches touched
autonomath_boot_manifest.txt,deploy.yml, and the cron workflow YAMLs simultaneously, generating merge conflicts on PRs #97 / #100 / #102 that each took 20-40 min to rebase. The Strategy F retry pipeline could not run untilmainwas linear. - Wave 44 retried Strategy F after the conflict storm cleared and
finally restored
healthz=200. Wave 45 captures the lesson: worktree-isolation is mandatory for any wave touching deploy / manifest / cron surfaces.
Total customer-visible downtime: ~1080 minutes (~18 hours) gross, with a 68-min recovered window in the middle of Wave 25-40 and a long tail of partial-degradation while Wave 41-44 wound through RC4 + RC5.
Timeline (UTC)¶
| Time (UTC) | Event | Source |
|---|---|---|
| 2026-05-10 (Wave 22) | Baked-seed Dockerfile lands. jpintel.db (352 MB) now baked into image; sftp hydrate is now a recovery-only path. |
session log |
| 2026-05-11 11:40 | Machine 85e273f4 cold-boot on pre-Wave-18 image — entrypoint.sh §4 runs PRAGMA integrity_check on /data/autonomath.db. |
flyctl logs |
| 11:40–12:18 | RC1 fires. 9.7 GB integrity_check walk hangs; no progress output. | volume IO profile |
| 12:18 | First external 5xx detected; Fly proxy returns "could not find a good candidate". | UptimeRobot |
| 13:00–16:00 | Wave 25 recovery — push the Wave 18 §4 size-skip fix via GHA deploy.yml. Multiple sub-step failures (Checkout / hydrate sftp / Deploy / smoke). |
gh run list |
| 16:30 | Local flyctl deploy --remote-only via depot builder lands new image; Wave 18 §4 fix activates. |
flyctl deploy |
| 16:48 | First machine on new image logs size-based integrity_check skip. |
flyctl logs |
| 16:52 | healthz=200 restored. End of first outage window. Wave 18 fix on main (commit 81922433f, PR #35). |
UptimeRobot |
| 17:00–18:00 | 5-min stability hold met; v1 post-mortem authored. | session log |
| 18:00 | Wave 22 baked-seed deploy attempt 1. Goal: bake jpintel.db into image to remove sftp dependency. |
session log |
| 18:00–20:00 | RC2 fires. Depot builder hangs on build-context push 3× consecutively, each ≥60 min. | flyctl deploy |
| 20:00–22:00 | Engineer attempts local Docker build as fallback; Docker Desktop VM hangs in apt unpack step. | session log |
| 22:00 | RC2 escape: switch to GHA workflow_dispatch deploy.yml (Strategy F). |
gh workflow run |
| 2026-05-11 23:30 | GHA Strategy F deploy completes; image pushed; rolling restart begins. | gh run view |
| 23:35 | New machine cold-boots on Wave 22 baked-seed image; entrypoint.sh §4 logs size-based skip (Wave 18 fix re-verified — RC1 stays closed). |
flyctl logs |
| 23:36 | schema_guard runs. RC3 fires. Log line: autonomath: required migrations missing from schema_migrations: ['049_provenance_strengthen.sql', '075_am_amendment_diff.sql', '090_law_article_body_en.sql', '115_source_manifest_view.sql', '121_jpi_programs_subsidy_rate_text_column.sql']. Entrypoint exits non-zero; Fly restart-loops. |
flyctl logs |
| 23:40 | External 5xx resumes. UptimeRobot fires again. Outage re-opens. | UptimeRobot |
| 2026-05-12 00:00–03:00 | Re-deploy attempts on the same broken image yield identical failure shape. Root cause not yet isolated to manifest. | flyctl logs |
| 2026-05-12 03:15 | RC3 isolated. flyctl ssh console → cat scripts/migrations/autonomath_boot_manifest.txt confirms empty allowlist; the 5 required migrations cannot be applied. |
session log |
| 2026-05-12 04:00 | Wave 40 PR #75 drafted: authorize the 5 migrations in the manifest, with comment header documenting Wave 22 baked-seed context. | session log |
| 2026-05-12 04:30 | PR #75 admin-merged to main (commit 82df31bd8). |
gh pr view 75 |
| 2026-05-12 04:35 | GHA Strategy F re-triggered (run 25699495154); build + push begins. | gh run view |
| 2026-05-12 05:30 | Deploy 25699495154 build complete; rolling restart begins. | gh run view |
| 2026-05-12 06:30 | Wave 41 work begins (post-mortem v2, pre-deploy manifest verify, runbook v2, alert extension). | session log |
| 2026-05-12 07:00 | RC4 surface. GHA workflow_run guard on verify.yml / acceptance-criteria-ci.yml chained off the deploy.yml run mis-skips the post-deploy verify; operator sees "skipped" instead of "passed", reaches healthz manually to confirm 200. |
gh run list |
| 2026-05-12 07:30 | Wave 41 dispatches direct gh workflow run calls per workflow instead of relying on workflow_run chains; runbook v2 documents the Option F + direct dispatch pattern. |
session log |
| 2026-05-12 08:00–14:00 | Wave 42 + Wave 43 push 17+ subagent branches (Dim A-J, AX Resilience cells, monitoring + AMS + discoverability). PRs #97 / #100 / #102 all touch autonomath_boot_manifest.txt simultaneously. |
gh pr list |
| 2026-05-12 14:00–17:00 | RC5 surface. Merge conflicts on the boot manifest force 4 sequential rebases. Each cycle takes 20-40 min (rebase + re-CI + admin merge). The Strategy F retry pipeline could not start until main was linear. |
gh pr view |
| 2026-05-12 17:00 | Wave 44 worktree-isolation principle established. Each subagent now claims /tmp/jpcite-<wave>-<lane> and branches from main; PR conflicts collapse. |
session log |
| 2026-05-12 18:00 | Wave 44 Strategy F retry deploys clean. healthz=200 confirmed and held. |
UptimeRobot |
| 2026-05-12 18:30 | 5-min stability hold met. End of incident. | session log |
| 2026-05-12 19:00 | Wave 45 post-mortem v3 begins (this document). | session log |
Net customer-visible downtime: ~18 hours gross (12:18 UTC 2026-05-11 → 18:00 UTC 2026-05-12 with a 68-min recovered window at 16:52–18:00 on 2026-05-11). 5xx exposure rate during this window ranged from 100% (both Fly machines down) to ~30% (one machine healthy, rolling restart).
Five root causes¶
RC1 — PRAGMA integrity_check on 9.7 GB autonomath.db at boot¶
Status: RESOLVED by Wave 18 §4 fix (commit 81922433f, PR #35).
Documented exhaustively in v1 post-mortem (2026-05-11_integrity_check_outage.md)
and v2 (2026-05-11_14h_outage_v2.md). The Wave 18 size-based skip
held across all subsequent Wave 22-44 boots — every new image logged
either size-based integrity_check skip or trusted stamp match
rather than running the pragma.
What broke: §4 was retained as a "structural correctness probe" when §2 SHA256 was size-skipped in Wave 13. CLAUDE.md SOT explicitly noted the retention. For a 9.7 GB DB, integrity_check costs 30+ min — orders of magnitude over the Fly 60s health-grace.
How fixed: extended AUTONOMATH_DB_MIN_PRODUCTION_BYTES (≥5 GB)
threshold to §4. schema_guard (metadata only, ~milliseconds) is now
the single structural probe.
Lesson re-stated for v3: the trap was already documented in
CLAUDE.md SOT and fly.toml ("fresh-volume bootstrap runs long ...
brief first-deploy restart loop rather than widening this check past
Fly's effective cap"), but the §4 retention bypassed the SOT warning.
Treat SOT cautions as hard constraints, not optional defenses.
RC2 — Depot remote builder stalls on build-context push¶
Status: MITIGATED via Wave 22 (baked seed) + Wave 24
(.dockerignore audit) + Strategy F (GHA workflow_dispatch).
What broke: three consecutive flyctl deploy --remote-only attempts
between 18:00 and 22:00 UTC on 2026-05-11 hung in the depot builder
build-context upload phase, each for ≥60 min before manual
cancellation. The local working tree was 12+ GB (autonomath.db
9.7 GB, jpintel.db 1.7 GB, plus dist/, dist.bak*/, .venv*/,
etc.). .dockerignore excluded the DBs but the depot uploader may
have been streaming the full pre-ignore tree.
What landed:
- Wave 22: baked
jpintel.db(352 MB) into the image so the hydrate dependency went away. - Wave 24: audited
.dockerignoreagainstdu -shof every top-level directory in the repo, added.venv*/,dist.bak*/,data/jpintel.db.bak-*,*.wrangler/exclusions. - Strategy F: GHA
workflow_dispatchondeploy.ymlruns the build inside the GHA runner (not depot), so the upload phase is irrelevant. Runbook v2 promotes this to a first-class option.
Lesson re-stated for v3: "remote builder works in CI" is not the same as "remote builder works under load with a 12+ GB local tree." RC2 was a deploy infrastructure single point of failure — the fix was to give the operator a second escape path (Strategy F), not to make depot bullet-proof.
RC3 — autonomath_boot_manifest.txt empty on new volume¶
Status: RESOLVED by Wave 40 PR #75 (commit 82df31bd8).
What broke: entrypoint.sh defaults to
AUTONOMATH_BOOT_MIGRATION_MODE=manifest, which means the entrypoint
only applies boot-time migrations from scripts/migrations/ if their
filenames are explicitly listed in
scripts/migrations/autonomath_boot_manifest.txt. Before Wave 40,
that file contained zero migration filenames (intentionally — the
design forced offline review of every schema change to prevent
accidental DROP on prod).
Wave 22 baked the jpintel.db seed into the image, but
autonomath.db continued to live on the Fly volume. When Wave 36/37
machines provisioned new volumes (machine destroy + redeploy on
Strategy C/F), the fresh autonomath.db had only the schema baked
into the entrypoint bootstrap — none of the legacy migrations
(049, 075, 090, 115, 121) had ever been recorded in schema_migrations.
schema_guard (scripts/schema_guard.py) defines
AM_REQUIRED_MIGRATIONS as exactly those 5 filenames. On a fresh
volume, the set difference is all 5 — schema_guard raises,
entrypoint exits non-zero, Fly restart-loops the machine.
PR #75 lifts the manifest to include those 5 migrations explicitly, with a comment header tying the change to Wave 22 baked-seed context. All 5 are pure additive (CREATE IF NOT EXISTS / ALTER TABLE ADD COLUMN / CREATE VIEW IF NOT EXISTS) — boot-time safe.
Lesson re-stated for v3: this is the canonical case of "packaging-mode change re-arms latent boot invariants." The sftp-hydrate flow had a hidden invariant that "the hydrated DB already has all migrations recorded." The baked-seed flow has a different invariant: "the volume DB has only entrypoint-bootstrap schema; everything else must be applied via boot-time migrations, which require manifest authorization." Wave 22 silently flipped which invariant was in force. The pre-deploy manifest gate (Wave 41 AI-3) catches this class of drift before it reaches prod.
RC4 — GHA workflow_run guard skips deploy verify¶
Status: RESOLVED by Strategy F direct-dispatch pattern (Wave 41).
What broke: several subsidiary workflows (verify.yml,
acceptance-criteria-ci.yml, static-drift-and-runtime-probe.yml)
were chained off deploy.yml via workflow_run triggers with a
branch-name guard. When Strategy F triggered deploy.yml directly
via workflow_dispatch instead of a push to main, the
workflow_run evaluation cascaded into "skip" because the guard
expected a push-shaped event.
The operator saw:
…and could not confirm the deploy was healthy from CI alone. Manual
curl https://api.jpcite.com/v1/healthz was the only signal until
the next push to main happened naturally.
How fixed: Wave 41 documented Strategy F as direct dispatch of
every workflow that needs to run after a deploy, not chained via
workflow_run. The runbook v2 §Option F covers the exact gh
workflow run invocation pattern.
Lesson re-stated for v3: CI guards optimized for the steady-state
shape (push to main) silently misfire during emergency deploy
paths. When designing a new chained workflow, add an explicit
workflow_dispatch trigger AND verify the chain fires under both
push and workflow_dispatch parents.
RC5 — Parallel-agent branch contention¶
Status: ESTABLISHED principle (Wave 44); enforced from Wave 45.
What broke: Wave 42-43 ran 18+ subagents in parallel. Each subagent landed a branch + PR per its lane (Dim A through J, AX Resilience cells 1-12, monitoring + AMS + discoverability). Several branches touched the same files concurrently:
scripts/migrations/autonomath_boot_manifest.txt— multiple subagents appended new migration filenames per their lane (252_law_jorei_pref,255_enforcement_municipality,259_court_decisions_extended,260_vec_e5,261_legal_chain_v2,262_*,263_realtime_signal,264_personalization,265_cross_source_agreement,266_fdi_80country). Each append at the same EOF caused git merge to flag conflict..github/workflows/deploy.yml— Wave 41 (workflow_run guard fix) + Wave 43 (concurrency + dependabot auto-merge) modified overlapping blocks.pyproject.toml/server.json— Wave 43.5 bumped 0.3.5 → 0.4.0 while other waves still referenced 0.3.5.
PRs #97 (Wave 43.1.910 enforcement_court), #100 (Wave 43.2.34
amendment_audit), and #102 (Wave 43.2.78 realtime_personalization) all
showed merge conflict on the boot manifest within a 90-min window.
Each rebase + re-CI + admin merge cycle took 20-40 min. The Strategy
F retry pipeline could not run until main was linear, so the deploy
recovery stalled even after PR #75 (Wave 40) was already merged.
How fixed: Wave 44 established the worktree-isolation principle
per memory feedback_dual_cli_lane_atomic:
- Each subagent gets its own worktree at
/tmp/jpcite-<wave>-<lane>. - The worktree branches from
mainHEAD (not from another agent's branch). - Lane claim is
mkdir tools/offline/_inbox/<task>/<agent_id>— POSIX atomic, race-free. - Output writes only into the claimed dir; never directly into
scripts/migrations/autonomath_boot_manifest.txtfrom inside a subagent. The manifest is updated via a single integrator pass (collect from_inbox/and apply once). - Agent ledger is
tools/offline/_inbox/<task>/AGENT_LEDGER.mdappend-only (>>), so concurrent flushes interleave but never lose writes.
Lesson re-stated for v3: parallel-agent speed-up is real and
mandatory (memory feedback_max_parallel_subagents), but speed-up
without isolation guarantees produces PR contention that compounds
deploy-chain downtime. Worktree isolation is a deploy hygiene
requirement, not a developer convenience.
Mitigation matrix¶
The five RCs map to five distinct strategy categories. The matrix below ties each RC × strategy × effect:
| RC | Strategy | Effect | Where landed |
|---|---|---|---|
| RC1 | A (local flyctl deploy --remote-only) |
Bypass GHA, push fix from local checkout | Wave 25 (16:30 UTC 2026-05-11) |
| RC1 | B (env-injection BOOT_ENFORCE_INTEGRITY_CHECK=0) |
Re-arm legacy path off; only when fix already on box | runbook v2 §Option B |
| RC1 | C (machine clone + destroy hung) | Last resort; only if A+B fail | runbook v2 §Option C |
| RC1 | (code fix) | Wave 18 §4 size-based skip permanent | commit 81922433f, PR #35 |
| RC2 | F (GHA workflow_dispatch deploy.yml) | Skip depot upload entirely | Wave 25 + Wave 41 (operator escape) |
| RC2 | (code fix) | Wave 24 .dockerignore audit |
commit chain (Wave 24) |
| RC3 | G (authorize migrations in boot manifest) | Add filenames to autonomath_boot_manifest.txt + Strategy F re-deploy |
Wave 40 PR #75, commit 82df31bd8 |
| RC3 | (gate) | Wave 41 pre_deploy_manifest_verify.py |
runbook v2 §Phase 2 |
| RC3 | (alert) | Wave 41 watchdog extended to RC3 FAIL pattern | scripts/cron/db_boot_hang_alert.py |
| RC4 | F (direct gh workflow run dispatch) |
Bypass workflow_run chain entirely |
Wave 41 (runbook v2 §Option F) |
| RC5 | (worktree isolation) | mkdir-atomic lane claim + AGENT_LEDGER append-only | Wave 44 principle, enforced from Wave 45 |
| RC5 | (memory) | feedback_dual_cli_lane_atomic (2026-05-06, existing) + feedback_multi_root_cause_chain (Wave 45, new) |
memory store |
Multi-RC chain observation: each fix above resolved exactly one RC. The deploy chain stayed broken until all five were addressed. This is the v3 lesson: don't expect a single root cause for a completely-automated deploy infrastructure failure.
Detection¶
External (uptime / customer-visible):
- UptimeRobot 502/504 on
/v1/healthz(3× 60s cadence). Fired multiple times across the 4-day chain — first at 12:18 UTC 2026-05-11, second at 23:40 UTC 2026-05-11, third at 14:00 UTC 2026-05-12. - Fly proxy log: "could not find a good candidate within 40 attempts at load balancing" — same pattern across each window.
- Cloudflare Pages stayed up throughout (static
site/,llms.txt, companion.md, OpenAPI JSON, MCP manifest) because CF Pages is decoupled fromapi.jpcite.com. Organic AI agent crawls hit the static surface uninterrupted.
Internal (post-detection diagnosis):
flyctl logs -a autonomath-api -n 500showed three distinct signatures across the windows:- Window 1 (12:18-16:52 UTC 2026-05-11):
running integrity_check on /data/autonomath.dbwith no follow-upokline (RC1 signature). - Window 2 (23:40 UTC 2026-05-11 onward):
autonomath: required migrations missing from schema_migrations: [...]followed by entrypoint exit (RC3 signature). - Window 3 (14:00 UTC 2026-05-12 onward):
verify.yml: skippedin GHA + healthz manually 200 (RC4 signature — not a true outage, but customer-visible because Slack alerts mis-fired). scripts/cron/db_boot_hang_alert.py(Wave 25 + Wave 41 extensions) caught Window 1 within 5 min of the first integrity_check log line and Window 2 immediately on the first RC3 FAIL line. Wave 45 extends this to all 5 RC patterns (see Alert Extension section).- For RC5 (parallel-agent branch contention), the signal is
gh pr list --search 'is:open conflict'returning ≥2 PRs during a deploy window. Wave 45 adds this as a workflow guard.
Impact¶
- Customer-visible:
api.jpcite.com5xx for ~18h gross across the 4-day chain. MCP stdio + DXT bundle surfaces unaffected (they read the bundled snapshot, not the API). - CF Pages: healthy throughout. Bing / Perplexity / ChatGPT
citation reachability uninterrupted. This is the structural
resilience that "organic-only + agent-led growth" provides —
the static surface answers AI agent crawls even when the API is
down. Companion
.mdchunks for laws / enforcement / programs remained reachable. - Stripe billing: zero metered events during 5xx windows. No incorrect charges. Anonymous 3 req/day quota timing was correct (JST midnight reset, IP-keyed).
- Cron: 6 weekly / 4 daily cron didn't fire during the windows;
morning_briefing.pyre-armed cleanly on Wave 44 image boot. - Organic acquisition signal: 5/11 21:40-01:30 JST + 5/12 08:40-15:30 JST + 5/12 23:00-03:00 JST — the second and third windows straddle Japanese morning + early-afternoon work hours. Likely some cohort impact but no telemetry to quantify. Solo zero-touch ops mean no per-customer notification was issued.
- Trust signal during recovery: the entire incident was
resolved without a single LLM API call. Solo + zero-touch + AI
agent only — recovery loop ran on Claude Code Max Pro per memory
feedback_no_operator_llm_api. No ¥0.5/req cost incurred against the metered floor.
Lessons learned¶
L1 — Multi-RC chains are the default for fully-automated deploy infra¶
Five independent RCs fired in one 4-day chain. Each had its own fix, each fix resolved exactly one RC. The deploy infrastructure does not have a single failure mode — it has at least 5, and likely more we haven't tripped yet. v3's multi-RC framing is the canonical post-mortem template going forward.
Memory: feedback_multi_root_cause_chain (new, Wave 45) — every
deploy-chain post-mortem must explicitly enumerate RCs and map each
to its mitigation, not pretend a single cause explains everything.
L2 — Packaging-mode changes re-arm latent boot invariants¶
(Carried forward from v2 §L1.) Wave 22 baked-seed eliminated the
sftp hydrate step but did not audit what a fresh volume looks like
under the new packaging. The autonomath_boot_manifest.txt
empty-by-default design was correct for the legacy sftp-hydrate flow
(the hydrated DB already had all migrations recorded). On a
baked-seed flow with a new volume, the manifest must list every
migration required by schema_guard. This invariant is now a
pre-deploy gate (scripts/ops/pre_deploy_manifest_verify.py,
Wave 41).
L3 — CI guards optimized for push silently misfire under workflow_dispatch¶
RC4 surfaced because workflow_run guards on subsidiary workflows
evaluated to "skip" when the parent ran via workflow_dispatch
instead of push. Emergency deploy paths must be tested under the
same workflow trigger shape as the steady-state path — not just
"the runner ran green once."
Action: any new workflow that chains off deploy.yml must include
an explicit workflow_dispatch trigger AND a check that workflow_run
guards allow both event shapes.
L4 — Parallel-agent speed-up demands worktree isolation¶
RC5 surfaced because Wave 42-43 ran 18+ subagents in parallel without
worktree isolation. Branches touched the same files (boot manifest,
deploy.yml, version files) and generated merge conflicts. The
worktree-isolation principle (memory feedback_dual_cli_lane_atomic)
is now mandatory for any wave touching deploy / manifest / cron
surfaces.
Specific anti-patterns to ban:
- Multiple subagents appending to
autonomath_boot_manifest.txt— ban; collect manifest entries in_inbox/and apply via a single integrator pass. - Multiple subagents bumping
pyproject.toml/server.jsonversion in parallel — ban; version bumps live in a separate dedicated wave. - Multiple subagents modifying
deploy.yml— ban; deploy.yml changes serialize via a single dedicated wave.
L5 — production-down ≠ customer-down when static surface is decoupled¶
api.jpcite.com was 5xx for ~18h but jpcite.com (CF Pages static)
stayed up. AI agent crawls — the primary acquisition channel under
organic-only + agent-led growth — hit the static surface
uninterrupted. The dual-surface architecture (API on Fly + static
on CF Pages) is what makes 18h+ down survivable. Companion .md
chunks (10,259 files) remained reachable; sitemap remained served.
This is structural resilience earned by the design — keep it.
L6 — SOT cautions must be treated as hard constraints¶
fly.toml and CLAUDE.md SOT both warned about full-scan ops on the
9.7 GB autonomath.db before RC1 fired. entrypoint.sh §4 retention
bypassed the warning ("keep this for structural correctness"). The
warning was correct; the retention was wrong. SOT cautions are
not optional defenses — they are hard constraints. Memory
feedback_no_quick_check_on_huge_sqlite carries this lesson.
Action items¶
| ID | Action | Owner | Status |
|---|---|---|---|
| AI-1 | Land Wave 18 §4 size-based skip on main. |
梅田 / Claude | DONE (commit 81922433f, PR #35) |
| AI-2 | Authorize 5 required migrations in autonomath_boot_manifest.txt. |
梅田 / Claude | DONE (commit 82df31bd8, PR #75) |
| AI-3 | Add scripts/ops/pre_deploy_manifest_verify.py — pre-deploy gate that asserts manifest is a superset of schema_guard required migrations. |
Claude | DONE (Wave 41) |
| AI-4 | Extend scripts/cron/db_boot_hang_alert.py to detect schema_guard FAIL pattern (not just hang). |
Claude | DONE (Wave 41) |
| AI-5 | Expand scripts/ops/post_deploy_verify_v4.sh from 10 to 15 checks (schema_guard pass evidence + 30+ endpoint 200 sweep). |
Claude | DONE (Wave 41) |
| AI-6 | Update docs/runbook/incident_response_db_boot_hang.md → v2 with 4-RC strategy + Strategy F first-class. |
Claude | DONE (Wave 41) |
| AI-7 | Update memory feedback_no_quick_check_on_huge_sqlite with Wave 40 manifest learning. |
Claude | DONE (Wave 41) |
| AI-8 | New memory feedback_pre_deploy_manifest_verify documenting the boot manifest invariant. |
Claude | DONE (Wave 41) |
| AI-9 | Diagnose RC2 (depot builder upload stall) — instrument upload step, audit .dockerignore, reproduce in clean tree. |
梅田 | PARTIAL (Wave 22 baked seed + Wave 24 dockerignore audit landed; root not fully diagnosed) |
| AI-10 | Audit remaining boot-time SQLite ops (quick_check / VACUUM / REINDEX / ANALYZE) on autonomath.db. |
梅田 | OPEN — next wave |
| AI-11 | New v3. Author docs/runbook/incident_response_v3_multi_root_cause.md — multi-RC strategy switching flow. |
Claude | DONE (Wave 45, this wave) |
| AI-12 | New v3. New memory feedback_multi_root_cause_chain.md — multi-RC framing for deploy-chain post-mortems. |
Claude | DONE (Wave 45, this wave) |
| AI-13 | New v3. Extend scripts/cron/db_boot_hang_alert.py to detect all 5 RC patterns + GHA workflow_run skip pattern (RC4) + PR conflict open during deploy window (RC5). |
Claude | DONE (Wave 45, this wave) |
| AI-14 | New v3. Add gh pr list --search 'is:open conflict' as a pre-deploy assertion in the runbook v3 §Phase 1 detection chain. |
Claude | DONE (runbook v3 §Phase 1) |
| AI-15 | New v3. Audit every workflow chained off deploy.yml via workflow_run; replace with direct gh workflow run invocations in runbook v3 §Option F. |
Claude | DONE (runbook v3 §Option F + §Option F-strict) |
References¶
- v1 post-mortem (RC1 only):
docs/postmortem/2026-05-11_integrity_check_outage.md - v2 post-mortem (4 RCs):
docs/postmortem/2026-05-11_14h_outage_v2.md - Wave 18 §4 fix: commit
81922433f, PR #35 - Wave 40 manifest fix: commit
82df31bd8, PR #75 - Wave 41 SOP work: pre_deploy_manifest_verify.py, post_deploy_verify_v4.sh, incident_response_db_boot_hang_v2.md, memory updates
- Wave 45 (this wave): post-mortem v3, runbook v3 multi-RC, memory
feedback_multi_root_cause_chain, db_boot_hang_alert.py extension to all 5 RC patterns entrypoint.sh§2 / §4 (currentmainHEAD)scripts/schema_guard.py— definesAM_REQUIRED_MIGRATIONS+JPINTEL_REQUIRED_MIGRATIONSscripts/migrations/autonomath_boot_manifest.txt— boot allowlist- CLAUDE.md SOT: §entrypoint.sh §2/§4 SIZE-BASED + §autonomath manifest allowlist
- Memory:
feedback_no_quick_check_on_huge_sqlite,feedback_pre_deploy_manifest_verify,feedback_post_deploy_smoke_propagation,feedback_deploy_yml_4_fix_pattern,feedback_dual_cli_lane_atomic,feedback_multi_root_cause_chain(new, Wave 45)
Last reviewed: 2026-05-12 (Wave 45). Solo zero-touch ops — no team
rotation, no PagerDuty. Detection chain: UptimeRobot → Telegram bot →
operator phone. Recovery loop ran entirely on Claude Code Max Pro
(no LLM API calls), per memory feedback_no_operator_llm_api.