Wave 47 — Dim A (semantic_search legacy v1) migration PR state¶
Generated 2026-05-12 (Wave 47 Phase 2 永遠ループ tick#5).
PR scope¶
Closes the Dim A (semantic_search legacy v1) storage gap left after PR #144, which landed the REST wrap for the v1 hash-fallback path but without a disk-backed substrate. Migration 260 already covers the canonical v2 layer (e5-small 384-dim sqlite-vec + cross-encoder reranker); this PR adds the parallel legacy v1 layer so the v1 wrap becomes 100% boot-survivable instead of in-memory only.
The v1 layer is retained for two reasons:
- Graceful fallback when sqlite-vec extension fails to load (Wave 47 wires the v1 cache as the read-only fallback so cold-start traffic is not lost).
- Long-tail warm cache — top 100 canonical queries baked offline so the ¥3/req micropayment still returns a useful top-K hit when the vec0 table is cold or absent.
Wave 47 (Dim A) deliverables¶
| File | Kind | LOC |
|---|---|---|
scripts/migrations/284_semantic_search_v1.sql |
Migration | ~85 |
scripts/migrations/284_semantic_search_v1_rollback.sql |
Rollback | ~20 |
scripts/etl/build_semantic_search_v1_cache.py |
ETL prebuild | ~180 |
tests/test_dim_a_semantic_v1.py |
Tests (13) | ~230 |
scripts/migrations/autonomath_boot_manifest.txt |
Manifest | +14 |
scripts/migrations/jpcite_boot_manifest.txt |
Manifest | +14 |
docs/research/wave46/STATE_w47_dim_a_pr.md |
State doc | ~150 |
Total: ~150 LOC across the migration pair (284 + rollback) and ~700
LOC overall. Hard constraints honored: NO LLM API, NO main worktree
(used dual-CLI lane mkdir /tmp/jpcite-w47-dim-a-mig.lane per
feedback_dual_cli_lane_atomic), NO rm/mv (additive-only), NO 旧
brand (税務会計AI / zeimu-kaikei.ai / AutonoMath agri), NO mig 260
overwrite (DDL stays disjoint per test_no_overlap_with_mig_260).
Storage schema¶
am_semantic_search_v1_cache¶
cache_idTEXT PRIMARY KEY (sha256 of normalized query)query_textTEXT — raw query, baked offline onlyembeddingBLOB NOT NULL — float32 packed bytes (1536 = 384 × 4)embedding_dimINTEGER default 384 with CHECK > 0top_k_resultsTEXT NOT NULL — JSON array of{entity_id, score}, CHECK length > 0top_kINTEGER default 10 with CHECK 1..100model_namedefaulthash-fallback-e5-small-v1cached_atISO-8601 UTC- 1 index:
(cached_at)for sweep
am_semantic_search_v1_log¶
search_idPK AUTOINCREMENTquery_hashTEXT NOT NULL — raw query NEVER stored on log pathlatency_msINTEGER CHECK >= 0hit_countINTEGER CHECK >= 0cache_hitINTEGER CHECK IN (0, 1)searched_atISO-8601 UTC- 2 indices:
(query_hash)and(searched_at)— hit-rate KPI + TTFP
ETL semantics¶
scripts/etl/build_semantic_search_v1_cache.py pre-warms the top-N
(default 100) canonical queries deterministically:
- Embedding —
hashlib.sha512chain expanded to 1536 bytes then unpacked as 384 float32 in [-1, 1]. NO LLM, NO network, NO sentence-transformers (the v1 layer is hash-fallback by design; the v2 layer keeps sentence-transformers). - Top-K results — pure SQL
LIKEscan overam_entities, graceful empty list if the table is absent (test fixtures). - Idempotent —
cache_idPK collision is skipped, not aborted; re-running the ETL after partial completion writes only the missing rows.
Smoke verify (Python 3.13, temp DB):
- 20 seed queries × dedup → 10 cache rows written, 10 skipped.
- Embedding integrity: length(embedding) == embedding_dim * 4 = 1536.
- ETL elapsed: 9 ms / 20 queries (hash-fallback is cheap).
Test matrix (13/13 green)¶
| Test | Asserts |
|---|---|
test_mig_284_creates_two_tables |
exactly cache + log, no extras |
test_mig_284_creates_three_indexes |
cache×1 + log×2 by name |
test_cache_id_is_primary_key |
duplicate insert → IntegrityError |
test_top_k_check_constraint |
top_k = 0 → IntegrityError |
test_log_cache_hit_boolean |
cache_hit = 2 → IntegrityError |
test_log_latency_nonneg |
latency_ms = -1 → IntegrityError |
test_log_autoincrement |
search_id = 1, 2, 3 in order |
test_embedding_dim_check |
embedding_dim = 0 → IntegrityError |
test_etl_idempotent |
2× run = 1× rows + ≥10 skipped, embedding bytes = dim × 4 |
test_etl_top_k_results_is_valid_json |
JSON parses to list of {entity_id, score} |
test_rollback_drops_everything |
mig + rollback → 0 tables, 0 indexes |
test_no_llm_api_import_in_etl |
0 anthropic / openai / google.generativeai / claude_agent_sdk |
test_no_overlap_with_mig_260 |
DDL of mig 284 ≠ DDL of mig 260 (comments allowed) |
Constraint compliance summary¶
feedback_no_operator_llm_api: 0 LLM imports inscripts/etl/+tests/+ the SQL migration itself. Verified bytest_no_llm_api_import_in_etl.feedback_dual_cli_lane_atomic: lane claimed viamkdir /tmp/jpcite-w47-dim-a-mig.lane(atomic) + worktree on separate branch — main worktree untouched.feedback_destruction_free_organization: norm/mv/ file rename. Pure additive across SQL, ETL, tests, manifest.feedback_completion_gate_minimal: PR-gate is the 13/13 pytest + SQL syntax verify, not a 40-item checklist. The v1 layer is opt-in for runtime callers; the canonical v2 path remains unchanged.
PR¶
PR # will be filled by gh pr create immediately after the worktree
push. Branch feat/jpcite_2026_05_12_wave47_dim_a_migration against
main.