Chaos engineering runbook (Wave 18 E3)¶

The jpcite API is exercised weekly by a Toxiproxy-driven fault-injection suite (tests/chaos/) that validates resilience to latency, connection resets, TCP timeouts, and bandwidth caps. This runbook covers the failure modes the suite exercises, the expected recovery behavior, and the on-call response when the weekly score drops below target.

Cron: .github/workflows/chaos-weekly.yml — Sat 04:00 UTC (Sat 13:00 JST). Target resilience score: ≥ 4.5 / 5.

Failure modes exercised¶

Scenario	Toxic	Expected behavior
500 ms upstream latency	`latency=500`	`/healthz` succeeds in ≤ 8 s; `x-request-id` round-trips intact
1 s upstream latency	`latency=1000`	`/healthz` succeeds in ≤ 12 s
3 s upstream latency	`latency=3000`	`/healthz` succeeds in ≤ 30 s
1 KB/s bandwidth cap	`bandwidth=1`	JSON body fully delivered; no truncation
TCP RST mid-flight	`reset_peer=100`	Client raises `httpx.HTTPError` within 5 s — no hang
Held connection 5 s	`timeout=5000`	Client-side 1 s read-timeout fires within ~2 s
32-byte data cap	`limit_data=32`	Client surfaces error OR receives truncated body
Toxic removal	(clear all)	Baseline returns; `/healthz` < 2 s

Recovery patterns¶

Pattern A — latency injection¶

The API uses uvicorn's default keep-alive + workers; slow upstream adds wall-clock cost but does NOT exhaust the worker pool unless the toxic exceeds the worker count × per-request budget. Recovery is automatic on toxic removal.

On-call action: none. If the latency scenario fails, root-cause is either (a) the API process has hung, (b) Toxiproxy never wired the toxic, or (c) the upstream port (CHAOS_UPSTREAM=127.0.0.1:8080) is not where the test API is listening. Re-check the workflow logs for "Start API under test" passing the curl probe.

Pattern B — connection reset¶

httpx surfaces a RemoteProtocolError / ConnectError / ReadError on RST. Application code that catches httpx.HTTPError (the parent class) handles all three; code that catches only httpx.HTTPStatusError is wrong and will let RSTs escape as 500s.

On-call action: if test_reset_peer_raises_connection_error fails, audit the call site for narrow exception types. The chaos suite is the canary for too-narrow exception handlers.

Pattern C — bandwidth cap¶

A 1 KB/s pipe should not crash the client. If test_bandwidth_cap fails, the JSON body is being parsed incrementally and the parser is unhappy with the stalled stream — check httpx version and any custom content-encoding middleware.

Pattern D — request-id propagation¶

_RequestContextMiddleware echoes the client-supplied x-request-id back on the response. If test_request_id_propagation_under_latency fails under load, the middleware order has been disturbed (the context middleware must run inside SecurityHeadersMiddleware so the response headers are set before the security middleware seals them).

Pattern E — bandwidth + RST + recovery interplay¶

Toxiproxy clears all toxics on session teardown via toxiproxy_client.reset(). If a test leaves a toxic behind (because the proxy fixture crashed mid-teardown), subsequent tests will see spurious latency. The test_recovery_after_toxic_clear guard catches this — its failure means the proxy teardown is leaking state.

Triage when the weekly score drops¶

Pull chaos-results.xml from the workflow run's artifacts.
Identify which scenario(s) failed — each test is a separate <testcase> element.
Match against the failure-mode table above; pick the corresponding pattern.

Reproduce locally:

docker run -d --name toxiproxy -p 8474:8474 -p 18001:18001 \
    ghcr.io/shopify/toxiproxy:2.9.0
.venv/bin/uvicorn jpintel_mcp.api.main:app --port 8080 &
pytest tests/chaos/ -v -k <scenario>

Fix or file an issue. Aim to restore the weekly score to 5/5 before the next Saturday run.

Skipping the suite¶

The suite is opt-in. Running pytest on a developer machine without Toxiproxy hits the _toxiproxy_available check in conftest.py and returns "skipped" for every test in the package. This is by design — mandating Toxiproxy for every local pytest run would slow the inner dev loop without buying real coverage (the production cron catches regressions before they ship).

Force-skip in CI by removing the cron + workflow_dispatch trigger from .github/workflows/chaos-weekly.yml — but do this only with a recorded sign-off, since the suite is the production gate for upstream-fault resilience.