Skip to content

Benchmarks

All benchmarks run on a single Linux machine (x86_64). HTTP latency measured end-to-end including server overhead. Library latency measured per resolve() call.

Results summary

DatasetIntentsCold startAfter learningLatency
CLINC15015074.8% exact / 86.5% partial90.3% exact / 98.4% partial~2.8ms HTTP
SaaS support1058% exact / 70.7% F171% exact / 88.6% F1~640µs
MCP tool routing10044.7% exact / 89.4% recall48.9% exact / 94.7% recall~870µs
Multilingual5 × 5 langs68% exact / 86% recallcold only~950µs

CLINC150

Dataset: 150 intents, 4,500 test queries. Standard academic intent classification benchmark.

Method: No LLM. Pure correction-based learning from test query results.

RoundExact matchPartial matchCorrections applied
Cold start74.8%86.5%
Round 188.2%96.8%+607
Round 290.0%98.2%+714
Round 390.3%98.4%+726

Takeaway: 74% → 90% accuracy with zero LLM cost. Each correction round takes under 30 seconds.


SaaS support classification

Dataset: 10 intents (billing, auth, API limits, etc.), 69 queries. One LLM auto-learn pass using Claude Haiku.

StageExactRecallPrecisionF1Top-K
Cold start58%72.7%68.7%70.7%71%
After learn71%90.6%86.7%88.6%89.9%

LLM cost: ~$0.004 (20 learned phrases, Claude Haiku)


MCP tool routing

Dataset: 100 MCP tool intents generated from a real tool registry, ~500 queries.

StageExactRecallF1
Cold start44.7%89.4%74.0%
After learn48.9%94.7%80.3%

Note: Exact match is lower at 100 intents because many tools share terminology. Recall (94.7%) is high — the correct intent is in the top results. Use threshold / gap tuning or a secondary LLM pass for disambiguation at this scale.


Multilingual

Dataset: 5 intents × 5 languages (English, Japanese, Korean, Tamil, Chinese), 34 queries. Cold start only.

LanguageExactRecallTop-KLatency
English75%75%75%~995µs
Japanese71%93%86%~1041µs
Korean67%94%100%~952µs
Tamil67%100%100%~925µs
Chinese57%71%71%~869µs
Overall68%86%85%~950µs

Key finding: Recall is high across all languages. Lower exact match for CJK is caused by multi-intent splits on complex queries — auto-learn reduces this significantly.


Reproducing results

Benchmark scripts are in benchmarks/scripts/:

Terminal window
# CLINC150
python benchmarks/scripts/bench_clinc150.py
# SaaS support
python benchmarks/scripts/bench_saas_support.py
# MCP tool routing
python benchmarks/scripts/bench_mcp_routing.py
# Multilingual
python benchmarks/scripts/bench_multilingual.py