CLEAR+ Benchmarking Guide¶

Duration: 60 minutes Level: Advanced Module: fcc.evaluation

This tutorial walks you through configuring, running, and interpreting CLEAR+ benchmarks for FCC personas and workflows. You will define benchmark specs, execute suites, compare results across releases, and integrate benchmarks into CI pipelines.

Prerequisites¶

Completed beginner tutorials and at least one simulation run
Familiarity with PersonaRegistry, WorkflowGraph, and SimulationEngine
FCC installed with dev dependencies (pip install -e ".[dev]")

Understanding the Seven Dimensions¶

CLEAR+ measures agent system quality across seven dimensions:

Dimension	Direction	What It Captures
Cost	Lower is better	Token consumption or API call count
Latency	Lower is better	Wall-clock execution time in milliseconds
Efficacy	Higher is better	How well the task was completed (0--1)
Assurance	Higher is better	Safety score from constitution validation (0--1)
Reliability	Higher is better	Consistency of results across repeated runs (0--1)
Coverage	Higher is better	What fraction of target personas were activated (0--1)
Explainability	Higher is better	What fraction of events carry meaningful payloads (0--1)

Step 1: Create a Benchmark Spec¶

A BenchmarkSpec defines a single evaluation scenario:

from fcc.evaluation.benchmark import BenchmarkSpec
from fcc.evaluation.metrics import CLEARPlusMetrics

spec = BenchmarkSpec(
    name="governance-workflow-test",
    description="Test governance personas in extended workflow",
    scenario="GEN-001",
    personas=("DGS", "RAE", "EAG", "PIE"),
    expected_outcomes=("governance_report", "compliance_summary"),
    clear_thresholds=CLEARPlusMetrics(
        cost=3000.0,
        latency_ms=1500.0,
        efficacy=0.70,
        assurance=0.80,
        reliability=0.90,
        coverage=0.75,
        explainability=0.50,
    ),
    tags=("governance", "phase-14"),
    workflow_graph="base_sequence",
)

Step 2: Build a Benchmark Suite¶

Group specs into a suite:

from fcc.evaluation.benchmark import BenchmarkSuite

suite = BenchmarkSuite(
    name="phase-14-governance",
    description="Phase 14 governance benchmarks",
    specs=(spec,),
    version="1.0",
)

Or load from YAML:

suite = BenchmarkSuite.from_yaml("src/fcc/data/evaluation/baseline_benchmarks.yaml")
print(f"Suite: {suite.name} ({len(suite.specs)} specs)")

Step 3: Run the Suite¶

from fcc.evaluation.runner import BenchmarkRunner

runner = BenchmarkRunner(mock=True, max_steps=50)
results = runner.run_suite(suite)

for result in results:
    status = "PASS" if result.passed else "FAIL"
    print(f"[{status}] {result.spec.name}")
    print(f"  Cost: {result.metrics.cost:.0f}")
    print(f"  Latency: {result.metrics.latency_ms:.0f} ms")
    print(f"  Efficacy: {result.metrics.efficacy:.2%}")
    print(f"  Assurance: {result.metrics.assurance:.2%}")
    print(f"  Reliability: {result.metrics.reliability:.2%}")
    print(f"  Coverage: {result.metrics.coverage:.2%}")
    print(f"  Explainability: {result.metrics.explainability:.2%}")

Step 4: Persist and Compare Results¶

Save results for historical tracking:

runner.serialize_results(results, "benchmarks/phase14_v1.yaml")

Compare against a previous baseline:

baseline = BenchmarkRunner.load_results("benchmarks/baseline_v0.8.yaml")
comparison = runner.compare(baseline, results, threshold=0.05)

if comparison.has_regressions:
    print("REGRESSIONS:")
    for r in comparison.regressions:
        print(f"  {r}")
else:
    print("No regressions detected.")

print(f"Improvements: {len(comparison.improvements)}")
print(f"Unchanged: {len(comparison.unchanged)}")

Step 5: Event Bus Integration¶

Connect the runner to the event bus for real-time monitoring:

from fcc.messaging.bus import EventBus

bus = EventBus()
runner = BenchmarkRunner(mock=True, event_bus=bus)

# Subscribe to benchmark events
bus.subscribe_all(lambda e: print(f"  EVENT: {e.event_type.value}"))

results = runner.run_suite(suite)

Events emitted: benchmark.started, benchmark.completed, benchmark.regression.

Step 6: CI Integration¶

Add to your .github/workflows/benchmark.yml:

- name: Run CLEAR+ benchmarks
  run: fcc benchmark run --suite baseline --output results.yaml

- name: Check for regressions
  run: fcc benchmark compare --baseline benchmarks/baseline.yaml --candidate results.yaml
  continue-on-error: true

Use --strict to fail the build on regressions.

Interpreting Results¶

When Efficacy Drops¶

Low efficacy usually means the simulation did not complete enough steps. Check max_steps and workflow graph complexity.

When Coverage Drops¶

Low coverage means some target personas were not activated during the simulation. Verify that the workflow graph includes paths to all target personas.

When Assurance Drops¶

Low assurance indicates constitution validation issues. Review hard-stop rules for the affected personas.

Summary¶

In this tutorial you learned how to:

Define benchmark specs with CLEAR+ thresholds
Group specs into suites and load from YAML
Run benchmarks with mock or real AI providers
Persist results and compare across releases
Integrate benchmarks into CI pipelines
Interpret dimension-level results

Next Steps¶

Model Card Generation -- Generate structured documentation from benchmarks
EU AI Act Compliance -- Map benchmarks to regulatory requirements
Benchmark Interpretation Guide -- Deep dive into result analysis