Benchmarking provider performance with CLEAR+ across Ollama and Anthropic¶
Audience: Scientists and ML researchers who want to compare AI providers on the same workload using FCC's CLEAR+ benchmark suite.
What CLEAR+ measures¶
The CLEAR+ evaluation framework (in src/fcc/evaluation/) scores any
AI workflow on 7 dimensions:
| Dimension | What it captures |
|---|---|
| Cost | Total token spend (estimated USD or local-compute equivalent) |
| Latency | Wall-clock time end-to-end |
| Efficacy | Output quality vs reference / persona expectations |
| Assurance | Output consistency across re-runs |
| Reliability | Frequency of provider errors / retries |
| Coverage | Number of personas exercised |
| Explainability | Trace completeness |
In v1.1.0 you can run the same benchmark suite against ollama,
litellm, anthropic, openai, etc. — the engine builds a fresh
AIClient for each provider and reports CLEAR+ scores per run.
Run the baseline benchmark suite¶
The default suite is mock-based for fast CI runs:
This produces a BenchmarkResult with all 7 dimensions filled in for
the mock provider as a baseline.
Cross-provider comparison¶
Run the same suite against multiple providers and compare:
from fcc.evaluation.runner import BenchmarkRunner
from fcc.evaluation.benchmark import BenchmarkSpec
from fcc.simulation.ai_client import AIClient
import os
PROVIDERS_TO_TEST = ["mock", "ollama", "anthropic"]
results = {}
for provider_id in PROVIDERS_TO_TEST:
# Skip providers whose env vars are not set (no API key, no Ollama running)
client = AIClient(provider=provider_id)
if not client.is_available():
print(f"Skipping {provider_id} (not available in this environment)")
continue
spec = BenchmarkSpec(
suite_id="baseline-cross-provider",
provider_id=provider_id,
n_runs=5, # Multiple runs for assurance/reliability
max_steps=20,
)
runner = BenchmarkRunner(spec=spec, ai_client=client)
results[provider_id] = runner.run()
# Compare CLEAR+ dimensions
import json
for pid, result in results.items():
print(f"\n{pid}:")
print(f" Cost: {result.cost:.4f}")
print(f" Latency (ms): {result.latency_ms:.0f}")
print(f" Efficacy: {result.efficacy:.3f}")
print(f" Assurance: {result.assurance:.3f}")
print(f" Reliability: {result.reliability:.3f}")
print(f" Coverage: {result.coverage}")
print(f" Explainability: {result.explainability:.3f}")
Statistical significance¶
For publication-ready comparisons, run more iterations and report confidence intervals:
spec = BenchmarkSpec(
suite_id="paper-comparison",
provider_id="ollama",
n_runs=30, # ≥30 for normal-distribution assumptions
max_steps=20,
)
The result object exposes per-run scores in result.runs so you can
compute means, std deviations, and t-tests with scipy.stats.
Environment setup for fair comparison¶
# Pin Ollama model
export OLLAMA_BASE_URL=http://localhost:11434/v1
export OLLAMA_DEFAULT_MODEL=llama3.1:8b
# Pin Anthropic model
export ANTHROPIC_API_KEY=sk-ant-...
export ANTHROPIC_DEFAULT_MODEL=claude-3-5-sonnet-20241022
# Pin temperature for deterministic comparison
# (set in scenario ai_config, not env, to make it traceable)
Reporting cost in fair units¶
CLEAR+ reports cost as USD-equivalent for hosted providers. For local
providers (Ollama, vLLM) the cost is reported as estimated GPU
seconds based on token count and hardware characteristics. You can
override the cost model:
from fcc.evaluation.metrics import CLEARPlusMetrics
# Custom cost-per-1k-tokens for your specific deployment
custom_costs = {
"mock": 0.0,
"ollama": 0.0, # No marginal cost (already paid for the GPU)
"anthropic": 0.003, # Sonnet input price as of v1.1.0
"openai": 0.0025,
}
metrics = CLEARPlusMetrics(cost_per_1k_input=custom_costs)
Scenario-level pinning¶
For benchmarks that should ALWAYS run on a specific provider regardless
of operator environment, pin via ai_config in the scenario JSON:
{
"id": "BENCH-OLLAMA-001",
"name": "Ollama llama3.1 8B baseline",
"type": "ai",
"setup": {
"ai_config": {
"provider": "ollama",
"model": "llama3.1:8b",
"temperature": 0.0
}
}
}
The runner will use the scenario's ai_config rather than the global
default — useful for committing reproducible benchmarks alongside the
data.
See also¶
- Provider matrix
- Reproducible research stack (academic version)
- CLEAR+ evaluation framework — source
- Notebook 19: evaluation and compliance