Benchmarks

All benchmarks run on a single Linux machine (x86_64). HTTP latency measured end-to-end including server overhead. Library latency measured per resolve() call.

Results summary

Dataset	Intents	Cold start	After learning	Latency
CLINC150	150	74.8% exact / 86.5% partial	90.3% exact / 98.4% partial	~2.8ms HTTP
SaaS support	10	58% exact / 70.7% F1	71% exact / 88.6% F1	~640µs
MCP tool routing	100	44.7% exact / 89.4% recall	48.9% exact / 94.7% recall	~870µs
Multilingual	5 × 5 langs	68% exact / 86% recall	cold only	~950µs

CLINC150

Dataset: 150 intents, 4,500 test queries. Standard academic intent classification benchmark.

Method: No LLM. Pure correction-based learning from test query results.

Round	Exact match	Partial match	Corrections applied
Cold start	74.8%	86.5%	—
Round 1	88.2%	96.8%	+607
Round 2	90.0%	98.2%	+714
Round 3	90.3%	98.4%	+726

Takeaway: 74% → 90% accuracy with zero LLM cost. Each correction round takes under 30 seconds.

SaaS support classification

Dataset: 10 intents (billing, auth, API limits, etc.), 69 queries. One LLM auto-learn pass using Claude Haiku.

Stage	Exact	Recall	Precision	F1	Top-K
Cold start	58%	72.7%	68.7%	70.7%	71%
After learn	71%	90.6%	86.7%	88.6%	89.9%

LLM cost: ~$0.004 (20 learned phrases, Claude Haiku)

MCP tool routing

Dataset: 100 MCP tool intents generated from a real tool registry, ~500 queries.

Stage	Exact	Recall	F1
Cold start	44.7%	89.4%	74.0%
After learn	48.9%	94.7%	80.3%

Note: Exact match is lower at 100 intents because many tools share terminology. Recall (94.7%) is high — the correct intent is in the top results. Use threshold / gap tuning or a secondary LLM pass for disambiguation at this scale.

Multilingual

Dataset: 5 intents × 5 languages (English, Japanese, Korean, Tamil, Chinese), 34 queries. Cold start only.

Language	Exact	Recall	Top-K	Latency
English	75%	75%	75%	~995µs
Japanese	71%	93%	86%	~1041µs
Korean	67%	94%	100%	~952µs
Tamil	67%	100%	100%	~925µs
Chinese	57%	71%	71%	~869µs
Overall	68%	86%	85%	~950µs

Key finding: Recall is high across all languages. Lower exact match for CJK is caused by multi-intent splits on complex queries — auto-learn reduces this significantly.

Reproducing results

Benchmark scripts are in benchmarks/scripts/:

# CLINC150
python benchmarks/scripts/bench_clinc150.py

# SaaS support
python benchmarks/scripts/bench_saas_support.py

# MCP tool routing
python benchmarks/scripts/bench_mcp_routing.py

# Multilingual
python benchmarks/scripts/bench_multilingual.py