Wave 49 tick#2 — provenance backfill ETL fact_id → id 1-axis fix¶
Context¶
The Wave 49 tick#3 PR landed the daily cron workflow for
scripts/etl/provenance_backfill_6M_facts_v2.py. The workflow LIVE-fired
the script via --dry-run against the prod 9.7 GB autonomath.db and
the cron actually executed, but the non-dry-run path crashed on the
very first batch SELECT with::
sqlite3.OperationalError: no such column: fact_id
Root cause: the walker cursor SELECT and 3 derive-helpers
referenced am_entity_facts.fact_id — but the canonical prod schema PK
column is am_entity_facts.id (confirmed by migration 049 doc string,
migration 265 inline comment "pointer to am_entity_facts.id", migration
069 view definition "fact_id — am_entity_facts.id", and the canonical
CREATE TABLE in tests/test_evidence_packet.py:71-82). prod write was
0 (dry-run gate held) so no rows corrupted, but Dim O backfill made 0
progress beyond row 0.
1-axis fix (6 SQL replacements, single conceptual edit)¶
Only the column NAME am_entity_facts.fact_id was changed to id
in 5 SQL statements (6 total token replacements — WHERE/ORDER BY/
SELECT clauses on the same column). The Python variable names
fact_id and the am_fact_metadata.fact_id / am_fact_attestation_log.fact_id
column references were kept (those columns are correctly named per
migration 275).
| File | Line | Before | After |
|---|---|---|---|
scripts/etl/provenance_backfill_6M_facts_v2.py |
143 | SELECT source_id FROM am_entity_facts WHERE fact_id = ? LIMIT 1 |
SELECT source_id FROM am_entity_facts WHERE id = ? LIMIT 1 |
scripts/etl/provenance_backfill_6M_facts_v2.py |
163 | SELECT confidence FROM am_entity_facts WHERE fact_id = ? LIMIT 1 |
SELECT confidence FROM am_entity_facts WHERE id = ? LIMIT 1 |
scripts/etl/provenance_backfill_6M_facts_v2.py |
183 | SELECT created_at FROM am_entity_facts WHERE fact_id = ? LIMIT 1 |
SELECT created_at FROM am_entity_facts WHERE id = ? LIMIT 1 |
scripts/etl/provenance_backfill_6M_facts_v2.py |
295-296 | SELECT fact_id ... ORDER BY fact_id ASC |
SELECT id ... ORDER BY id ASC |
scripts/etl/provenance_backfill_6M_facts_v2.py |
301-302 | SELECT fact_id ... WHERE fact_id > ? ORDER BY fact_id ASC |
SELECT id ... WHERE id > ? ORDER BY id ASC |
Companion test stub correction¶
tests/test_dim_o_provenance_attach.py:55-60 declared the stub
am_entity_facts table with fact_id TEXT PRIMARY KEY — i.e. it
was testing against a non-canonical schema that masked the prod bug.
The stub is now renamed to id TEXT PRIMARY KEY (TEXT kept for
seed ergonomics — prod is INTEGER PK AUTOINCREMENT, but the walker
SELECTs by column NAME and passes the value through verbatim, so type
is irrelevant for the code path under test).
New regression test¶
tests/test_provenance_etl_id_schema.py (~210 LOC) — 5 tests:
test_am_entity_facts_pk_is_id_not_fact_id— schema-shape locktest_am_fact_metadata_has_fact_id_column— am_fact_metadata mig 275 column kepttest_etl_script_uses_id_for_am_entity_facts— forbidden-pattern source-greptest_walker_dry_run_does_not_write— --dry-run gate verifytest_walker_non_dry_run_writes_with_placeholder_sig— idempotency + placeholder sig
grep result¶
Verified with grep -n 'fact_id\|am_entity_facts'::
143: "SELECT source_id FROM am_entity_facts WHERE id = ? LIMIT 1",
163: "SELECT confidence FROM am_entity_facts WHERE id = ? LIMIT 1",
183: "SELECT created_at FROM am_entity_facts WHERE id = ? LIMIT 1",
295: "SELECT id FROM am_entity_facts "
301: "SELECT id FROM am_entity_facts "
am_entity_facts.fact_id references: 0 (was 6 before fix).
am_fact_metadata.fact_id references: preserved as-is (canonical
column name per migration 275).
Test verdict¶
tests/test_dim_o_provenance_attach.py — 18 passed (was 1 fail)
tests/test_provenance_etl_id_schema.py — 5 passed (new)
tests/test_provenance_backfill_workflow.py — 3 passed
Total: 26 passed, 0 failed
ruff: All checks passed
Hard constraints honored¶
- No
PRAGMA quick_check(memory:feedback_no_quick_check_on_huge_sqlite). - No LLM SDK import (memory:
feedback_no_operator_llm_api). - No destructive
rm/mv(memory:feedback_destruction_free_organization). - Lane: dedicated worktree
/tmp/jpcite-w49-prov-etl-fix, branchfeat/jpcite_2026_05_12_wave49_prov_etl_fact_id_fix(memory:feedback_dual_cli_lane_atomic).
Out of scope (separate ticks)¶
_derive_source_docline 150 referencesam_source.url(column issource_url) andWHERE source_id = ?(am_source PK isid) — also broken but distinct: a separate adjacent-schema-drift bug. Filed as follow-up tick. The current PR is laser-focused on thefact_id→idblocker; fixingam_sourcereferences is additive and orthogonal.