Bench results template¶

Operators copy this file to docs/bench_results_YYYY-MM-DD.md, fill in the numbers, and publish under that dated filename. The template itself stays empty — DO NOT paste real numbers here.

Bench results YYYY-MM-DD¶

Model: claude-sonnet-4-6 (or gpt-4o-2024-11-20, gemini-2.0-pro, etc.)
Queries: 30 (see tools/offline/bench_queries_2026_04_30.csv)
Run by: <operator>
Date: 2026-MM-DD
jpcite corpus_snapshot_id (jpcite_packet arm): <from packet>
LLM provider input price (¥ per 1M tokens): <rate>
LLM provider output price (¥ per 1M tokens): <rate>
JPY/USD rate used: <rate>

Per-arm medians¶

Metric	direct_web (median, p25–p75)	jpcite_packet (median, p25–p75)	Δ% (median)
input_tokens	... (... – ...)	... (... – ...)	...%
output_tokens	... (... – ...)	... (... – ...)	...%
reasoning_tokens	... (... – ...)	... (... – ...)	...%
web_searches	... (... – ...)	0 (0 – 0)	...%
jpcite_requests	0 (0 – 0)	... (... – ...)	n/a
yen_cost_per_answer	¥... (¥... – ¥...)	¥... (¥... – ¥...)	...%
latency_seconds	... (... – ...)	... (... – ...)	...%

Per-arm rates¶

Metric	direct_web (mean)	jpcite_packet (mean)
citation_rate	...	...
hallucination_rate	...	...

Cost-per-answer distribution¶

Sorted full distribution (¥), N=30 each arm:

direct_web:    [..., ..., ..., ...]
jpcite_packet: [..., ..., ..., ...]

Caveats (REQUIRED — do not delete)¶

Bench was run on N=30 queries stratified across 5 domains (10 補助金 / 5 法人 / 5 法令 / 5 税制 / 5 行政処分) on date 2026-MM-DD with model <model>.
Results vary by model, prompt, query distribution, customer environment, provider free tiers, and provider-side caching.
Free LLM tiers and provider-side prompt caching are not modeled in the ¥cost numbers above.
The Δ% column is the median delta, not a guarantee. Customers will observe different numbers on their own query distributions.
Phrasing rules from docs/bench_methodology.md §6 apply: no 「必ずX%削減」、「業界最安」、「ChatGPTより正確」 phrasing in any derivative collateral citing this file.

Replication¶

# 1. Generate bench instructions (no LLM call from this repo)
python tools/offline/bench_harness.py \
    --queries-csv tools/offline/bench_queries_2026_04_30.csv \
    --mode emit \
    --model <your-model> \
    > bench_instructions.jsonl

# 2. Operator runs each instruction line manually against their LLM
#    provider, writes bench_results.csv with the columns listed in
#    docs/bench_methodology.md §3.

# 3. Aggregate
python tools/offline/bench_harness.py \
    --results-csv bench_results.csv \
    --mode aggregate \
    > bench_summary.json

See docs/bench_methodology.md for the full procedure, including the two-arm contract, metric definitions, and disclosure block.