Bench results template¶
Operators copy this file to
docs/bench_results_YYYY-MM-DD.md, fill in the numbers, and publish under that dated filename. The template itself stays empty — DO NOT paste real numbers here.
Bench results YYYY-MM-DD¶
- Model:
claude-sonnet-4-6(orgpt-4o-2024-11-20,gemini-2.0-pro, etc.) - Queries: 30 (see
tools/offline/bench_queries_2026_04_30.csv) - Run by:
<operator> - Date:
2026-MM-DD - jpcite
corpus_snapshot_id(jpcite_packet arm):<from packet> - LLM provider input price (¥ per 1M tokens):
<rate> - LLM provider output price (¥ per 1M tokens):
<rate> - JPY/USD rate used:
<rate>
Per-arm medians¶
| Metric | direct_web (median, p25–p75) | jpcite_packet (median, p25–p75) | Δ% (median) |
|---|---|---|---|
| input_tokens | ... (... – ...) | ... (... – ...) | ...% |
| output_tokens | ... (... – ...) | ... (... – ...) | ...% |
| reasoning_tokens | ... (... – ...) | ... (... – ...) | ...% |
| web_searches | ... (... – ...) | 0 (0 – 0) | ...% |
| jpcite_requests | 0 (0 – 0) | ... (... – ...) | n/a |
| yen_cost_per_answer | ¥... (¥... – ¥...) | ¥... (¥... – ¥...) | ...% |
| latency_seconds | ... (... – ...) | ... (... – ...) | ...% |
Per-arm rates¶
| Metric | direct_web (mean) | jpcite_packet (mean) |
|---|---|---|
| citation_rate | ... | ... |
| hallucination_rate | ... | ... |
Cost-per-answer distribution¶
Sorted full distribution (¥), N=30 each arm:
Caveats (REQUIRED — do not delete)¶
- Bench was run on N=30 queries stratified across 5 domains
(10 補助金 / 5 法人 / 5 法令 / 5 税制 / 5 行政処分) on date
2026-MM-DDwith model<model>. - Results vary by model, prompt, query distribution, customer environment, provider free tiers, and provider-side caching.
- Free LLM tiers and provider-side prompt caching are not modeled in the ¥cost numbers above.
- The Δ% column is the median delta, not a guarantee. Customers will observe different numbers on their own query distributions.
- Phrasing rules from
docs/bench_methodology.md§6 apply: no 「必ずX%削減」、「業界最安」、「ChatGPTより正確」 phrasing in any derivative collateral citing this file.
Replication¶
# 1. Generate bench instructions (no LLM call from this repo)
python tools/offline/bench_harness.py \
--queries-csv tools/offline/bench_queries_2026_04_30.csv \
--mode emit \
--model <your-model> \
> bench_instructions.jsonl
# 2. Operator runs each instruction line manually against their LLM
# provider, writes bench_results.csv with the columns listed in
# docs/bench_methodology.md §3.
# 3. Aggregate
python tools/offline/bench_harness.py \
--results-csv bench_results.csv \
--mode aggregate \
> bench_summary.json
See docs/bench_methodology.md for the full procedure, including
the two-arm contract, metric definitions, and disclosure block.