CLEAR+ Benchmark Demo¶

This demo walks through the CLEAR+ benchmark demo -- an interactive demonstration that runs evaluation benchmarks, visualises results across all seven dimensions, and shows regression detection in action.

Table of Contents¶

Introduction and Prerequisites
Launching the Demo
Suite Selection
Running Benchmarks
Dimension Breakdown
Regression Comparison
Exporting Results

Introduction and Prerequisites¶

System Requirements¶

Python 3.10+ with FCC installed (pip install -e ".[dev]")
No API key required (demo uses MockAIClient)

What This Demo Shows¶

The CLEAR+ Benchmark Demo demonstrates the full evaluation lifecycle: running benchmark suites against the simulation engine, inspecting per-dimension metrics, comparing results against baselines, and exporting results for CI integration.

Launching the Demo¶

fcc demo run clear-benchmark

Or run directly:

from fcc.demos.clear_benchmark_demo import run_demo
run_demo()

The demo prints a step-by-step walkthrough to the console.

Suite Selection¶

The demo loads the baseline benchmark suite from src/fcc/data/evaluation/baseline_benchmarks.yaml. It displays the suite name, description, and number of specs:

Suite: baseline
Description: Baseline CLEAR+ benchmarks for core workflows
Specs: 3

Running Benchmarks¶

Each spec is executed through the BenchmarkRunner with mock=True. The demo shows a progress indicator for each spec:

[1/3] Running: core-workflow-baseline ...
  PASS | Cost: 0 | Latency: 0ms | Efficacy: 100% | Coverage: 100%
[2/3] Running: extended-workflow-stress ...
  PASS | Cost: 0 | Latency: 0ms | Efficacy: 100% | Coverage: 100%
[3/3] Running: governance-audit-flow ...
  PASS | Cost: 0 | Latency: 0ms | Efficacy: 100% | Coverage: 100%

Dimension Breakdown¶

After all specs complete, the demo displays a per-dimension summary table:

Spec	Efficacy	Assurance	Reliability	Coverage	Explain.
core-workflow-baseline	1.00	1.00	1.00	1.00	1.00
extended-workflow-stress	1.00	1.00	1.00	1.00	1.00
governance-audit-flow	1.00	1.00	1.00	1.00	1.00

In mock mode, all dimensions show ideal values. With a real AI provider, values reflect actual token consumption, latency, and quality measures.

Regression Comparison¶

The demo demonstrates regression detection by running the suite twice with different parameters and comparing results:

Comparing baseline vs. candidate (threshold: 5%)
  Regressions: 0
  Improvements: 0
  Unchanged: 3
  Result: No regressions detected.

Exporting Results¶

The demo saves results to a temporary YAML file and shows how to load them for later comparison:

runner.serialize_results(results, "demo_results.yaml")
loaded = BenchmarkRunner.load_results("demo_results.yaml")

Tips¶

Run with --verbose for detailed per-step simulation output
Use the Streamlit benchmark app (apps/streamlit/benchmark_explorer.py) for interactive visualisation
Modify src/fcc/data/evaluation/baseline_benchmarks.yaml to add custom specs