Skip to content

CLEAR+ Benchmark Demo

This demo walks through the CLEAR+ benchmark demo -- an interactive demonstration that runs evaluation benchmarks, visualises results across all seven dimensions, and shows regression detection in action.


Table of Contents

  1. Introduction and Prerequisites
  2. Launching the Demo
  3. Suite Selection
  4. Running Benchmarks
  5. Dimension Breakdown
  6. Regression Comparison
  7. Exporting Results

Introduction and Prerequisites

System Requirements

  • Python 3.10+ with FCC installed (pip install -e ".[dev]")
  • No API key required (demo uses MockAIClient)

What This Demo Shows

The CLEAR+ Benchmark Demo demonstrates the full evaluation lifecycle: running benchmark suites against the simulation engine, inspecting per-dimension metrics, comparing results against baselines, and exporting results for CI integration.


Launching the Demo

fcc demo run clear-benchmark

Or run directly:

from fcc.demos.clear_benchmark_demo import run_demo
run_demo()

The demo prints a step-by-step walkthrough to the console.


Suite Selection

The demo loads the baseline benchmark suite from src/fcc/data/evaluation/baseline_benchmarks.yaml. It displays the suite name, description, and number of specs:

Suite: baseline
Description: Baseline CLEAR+ benchmarks for core workflows
Specs: 3

Running Benchmarks

Each spec is executed through the BenchmarkRunner with mock=True. The demo shows a progress indicator for each spec:

[1/3] Running: core-workflow-baseline ...
  PASS | Cost: 0 | Latency: 0ms | Efficacy: 100% | Coverage: 100%
[2/3] Running: extended-workflow-stress ...
  PASS | Cost: 0 | Latency: 0ms | Efficacy: 100% | Coverage: 100%
[3/3] Running: governance-audit-flow ...
  PASS | Cost: 0 | Latency: 0ms | Efficacy: 100% | Coverage: 100%

Dimension Breakdown

After all specs complete, the demo displays a per-dimension summary table:

Spec Cost Latency Efficacy Assurance Reliability Coverage Explain.
core-workflow-baseline 0 0 1.00 1.00 1.00 1.00 1.00
extended-workflow-stress 0 0 1.00 1.00 1.00 1.00 1.00
governance-audit-flow 0 0 1.00 1.00 1.00 1.00 1.00

In mock mode, all dimensions show ideal values. With a real AI provider, values reflect actual token consumption, latency, and quality measures.


Regression Comparison

The demo demonstrates regression detection by running the suite twice with different parameters and comparing results:

Comparing baseline vs. candidate (threshold: 5%)
  Regressions: 0
  Improvements: 0
  Unchanged: 3
  Result: No regressions detected.

Exporting Results

The demo saves results to a temporary YAML file and shows how to load them for later comparison:

runner.serialize_results(results, "demo_results.yaml")
loaded = BenchmarkRunner.load_results("demo_results.yaml")

Tips

  • Run with --verbose for detailed per-step simulation output
  • Use the Streamlit benchmark app (apps/streamlit/benchmark_explorer.py) for interactive visualisation
  • Modify src/fcc/data/evaluation/baseline_benchmarks.yaml to add custom specs