Skip to content

Benchmarking provider performance with CLEAR+ across Ollama and Anthropic

Audience: Scientists and ML researchers who want to compare AI providers on the same workload using FCC's CLEAR+ benchmark suite.

What CLEAR+ measures

The CLEAR+ evaluation framework (in src/fcc/evaluation/) scores any AI workflow on 7 dimensions:

Dimension What it captures
Cost Total token spend (estimated USD or local-compute equivalent)
Latency Wall-clock time end-to-end
Efficacy Output quality vs reference / persona expectations
Assurance Output consistency across re-runs
Reliability Frequency of provider errors / retries
Coverage Number of personas exercised
Explainability Trace completeness

In v1.1.0 you can run the same benchmark suite against ollama, litellm, anthropic, openai, etc. — the engine builds a fresh AIClient for each provider and reports CLEAR+ scores per run.

Run the baseline benchmark suite

The default suite is mock-based for fast CI runs:

make benchmark
# or
fcc benchmark run --suite baseline --mock

This produces a BenchmarkResult with all 7 dimensions filled in for the mock provider as a baseline.

Cross-provider comparison

Run the same suite against multiple providers and compare:

from fcc.evaluation.runner import BenchmarkRunner
from fcc.evaluation.benchmark import BenchmarkSpec
from fcc.simulation.ai_client import AIClient
import os

PROVIDERS_TO_TEST = ["mock", "ollama", "anthropic"]
results = {}

for provider_id in PROVIDERS_TO_TEST:
    # Skip providers whose env vars are not set (no API key, no Ollama running)
    client = AIClient(provider=provider_id)
    if not client.is_available():
        print(f"Skipping {provider_id} (not available in this environment)")
        continue

    spec = BenchmarkSpec(
        suite_id="baseline-cross-provider",
        provider_id=provider_id,
        n_runs=5,         # Multiple runs for assurance/reliability
        max_steps=20,
    )
    runner = BenchmarkRunner(spec=spec, ai_client=client)
    results[provider_id] = runner.run()

# Compare CLEAR+ dimensions
import json
for pid, result in results.items():
    print(f"\n{pid}:")
    print(f"  Cost:           {result.cost:.4f}")
    print(f"  Latency (ms):   {result.latency_ms:.0f}")
    print(f"  Efficacy:       {result.efficacy:.3f}")
    print(f"  Assurance:      {result.assurance:.3f}")
    print(f"  Reliability:    {result.reliability:.3f}")
    print(f"  Coverage:       {result.coverage}")
    print(f"  Explainability: {result.explainability:.3f}")

Statistical significance

For publication-ready comparisons, run more iterations and report confidence intervals:

spec = BenchmarkSpec(
    suite_id="paper-comparison",
    provider_id="ollama",
    n_runs=30,            # ≥30 for normal-distribution assumptions
    max_steps=20,
)

The result object exposes per-run scores in result.runs so you can compute means, std deviations, and t-tests with scipy.stats.

Environment setup for fair comparison

# Pin Ollama model
export OLLAMA_BASE_URL=http://localhost:11434/v1
export OLLAMA_DEFAULT_MODEL=llama3.1:8b

# Pin Anthropic model
export ANTHROPIC_API_KEY=sk-ant-...
export ANTHROPIC_DEFAULT_MODEL=claude-3-5-sonnet-20241022

# Pin temperature for deterministic comparison
# (set in scenario ai_config, not env, to make it traceable)

Reporting cost in fair units

CLEAR+ reports cost as USD-equivalent for hosted providers. For local providers (Ollama, vLLM) the cost is reported as estimated GPU seconds based on token count and hardware characteristics. You can override the cost model:

from fcc.evaluation.metrics import CLEARPlusMetrics

# Custom cost-per-1k-tokens for your specific deployment
custom_costs = {
    "mock": 0.0,
    "ollama": 0.0,         # No marginal cost (already paid for the GPU)
    "anthropic": 0.003,    # Sonnet input price as of v1.1.0
    "openai": 0.0025,
}
metrics = CLEARPlusMetrics(cost_per_1k_input=custom_costs)

Scenario-level pinning

For benchmarks that should ALWAYS run on a specific provider regardless of operator environment, pin via ai_config in the scenario JSON:

{
  "id": "BENCH-OLLAMA-001",
  "name": "Ollama llama3.1 8B baseline",
  "type": "ai",
  "setup": {
    "ai_config": {
      "provider": "ollama",
      "model": "llama3.1:8b",
      "temperature": 0.0
    }
  }
}

The runner will use the scenario's ai_config rather than the global default — useful for committing reproducible benchmarks alongside the data.

See also