CLEAR+ Benchmark Demo¶
This demo walks through the CLEAR+ benchmark demo -- an interactive demonstration that runs evaluation benchmarks, visualises results across all seven dimensions, and shows regression detection in action.
Table of Contents¶
- Introduction and Prerequisites
- Launching the Demo
- Suite Selection
- Running Benchmarks
- Dimension Breakdown
- Regression Comparison
- Exporting Results
Introduction and Prerequisites¶
System Requirements¶
- Python 3.10+ with FCC installed (
pip install -e ".[dev]") - No API key required (demo uses MockAIClient)
What This Demo Shows¶
The CLEAR+ Benchmark Demo demonstrates the full evaluation lifecycle: running benchmark suites against the simulation engine, inspecting per-dimension metrics, comparing results against baselines, and exporting results for CI integration.
Launching the Demo¶
Or run directly:
The demo prints a step-by-step walkthrough to the console.
Suite Selection¶
The demo loads the baseline benchmark suite from
src/fcc/data/evaluation/baseline_benchmarks.yaml. It displays
the suite name, description, and number of specs:
Running Benchmarks¶
Each spec is executed through the BenchmarkRunner with mock=True.
The demo shows a progress indicator for each spec:
[1/3] Running: core-workflow-baseline ...
PASS | Cost: 0 | Latency: 0ms | Efficacy: 100% | Coverage: 100%
[2/3] Running: extended-workflow-stress ...
PASS | Cost: 0 | Latency: 0ms | Efficacy: 100% | Coverage: 100%
[3/3] Running: governance-audit-flow ...
PASS | Cost: 0 | Latency: 0ms | Efficacy: 100% | Coverage: 100%
Dimension Breakdown¶
After all specs complete, the demo displays a per-dimension summary table:
| Spec | Cost | Latency | Efficacy | Assurance | Reliability | Coverage | Explain. |
|---|---|---|---|---|---|---|---|
| core-workflow-baseline | 0 | 0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| extended-workflow-stress | 0 | 0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| governance-audit-flow | 0 | 0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
In mock mode, all dimensions show ideal values. With a real AI provider, values reflect actual token consumption, latency, and quality measures.
Regression Comparison¶
The demo demonstrates regression detection by running the suite twice with different parameters and comparing results:
Comparing baseline vs. candidate (threshold: 5%)
Regressions: 0
Improvements: 0
Unchanged: 3
Result: No regressions detected.
Exporting Results¶
The demo saves results to a temporary YAML file and shows how to load them for later comparison:
runner.serialize_results(results, "demo_results.yaml")
loaded = BenchmarkRunner.load_results("demo_results.yaml")
Tips¶
- Run with
--verbosefor detailed per-step simulation output - Use the Streamlit benchmark app (
apps/streamlit/benchmark_explorer.py) for interactive visualisation - Modify
src/fcc/data/evaluation/baseline_benchmarks.yamlto add custom specs