Skip to content

CLEAR+ Benchmarking Guide

Duration: 60 minutes Level: Advanced Module: fcc.evaluation

This tutorial walks you through configuring, running, and interpreting CLEAR+ benchmarks for FCC personas and workflows. You will define benchmark specs, execute suites, compare results across releases, and integrate benchmarks into CI pipelines.

Prerequisites

  • Completed beginner tutorials and at least one simulation run
  • Familiarity with PersonaRegistry, WorkflowGraph, and SimulationEngine
  • FCC installed with dev dependencies (pip install -e ".[dev]")

Understanding the Seven Dimensions

CLEAR+ measures agent system quality across seven dimensions:

Dimension Direction What It Captures
Cost Lower is better Token consumption or API call count
Latency Lower is better Wall-clock execution time in milliseconds
Efficacy Higher is better How well the task was completed (0--1)
Assurance Higher is better Safety score from constitution validation (0--1)
Reliability Higher is better Consistency of results across repeated runs (0--1)
Coverage Higher is better What fraction of target personas were activated (0--1)
Explainability Higher is better What fraction of events carry meaningful payloads (0--1)

Step 1: Create a Benchmark Spec

A BenchmarkSpec defines a single evaluation scenario:

from fcc.evaluation.benchmark import BenchmarkSpec
from fcc.evaluation.metrics import CLEARPlusMetrics

spec = BenchmarkSpec(
    name="governance-workflow-test",
    description="Test governance personas in extended workflow",
    scenario="GEN-001",
    personas=("DGS", "RAE", "EAG", "PIE"),
    expected_outcomes=("governance_report", "compliance_summary"),
    clear_thresholds=CLEARPlusMetrics(
        cost=3000.0,
        latency_ms=1500.0,
        efficacy=0.70,
        assurance=0.80,
        reliability=0.90,
        coverage=0.75,
        explainability=0.50,
    ),
    tags=("governance", "phase-14"),
    workflow_graph="base_sequence",
)

Step 2: Build a Benchmark Suite

Group specs into a suite:

from fcc.evaluation.benchmark import BenchmarkSuite

suite = BenchmarkSuite(
    name="phase-14-governance",
    description="Phase 14 governance benchmarks",
    specs=(spec,),
    version="1.0",
)

Or load from YAML:

suite = BenchmarkSuite.from_yaml("src/fcc/data/evaluation/baseline_benchmarks.yaml")
print(f"Suite: {suite.name} ({len(suite.specs)} specs)")

Step 3: Run the Suite

from fcc.evaluation.runner import BenchmarkRunner

runner = BenchmarkRunner(mock=True, max_steps=50)
results = runner.run_suite(suite)

for result in results:
    status = "PASS" if result.passed else "FAIL"
    print(f"[{status}] {result.spec.name}")
    print(f"  Cost: {result.metrics.cost:.0f}")
    print(f"  Latency: {result.metrics.latency_ms:.0f} ms")
    print(f"  Efficacy: {result.metrics.efficacy:.2%}")
    print(f"  Assurance: {result.metrics.assurance:.2%}")
    print(f"  Reliability: {result.metrics.reliability:.2%}")
    print(f"  Coverage: {result.metrics.coverage:.2%}")
    print(f"  Explainability: {result.metrics.explainability:.2%}")

Step 4: Persist and Compare Results

Save results for historical tracking:

runner.serialize_results(results, "benchmarks/phase14_v1.yaml")

Compare against a previous baseline:

baseline = BenchmarkRunner.load_results("benchmarks/baseline_v0.8.yaml")
comparison = runner.compare(baseline, results, threshold=0.05)

if comparison.has_regressions:
    print("REGRESSIONS:")
    for r in comparison.regressions:
        print(f"  {r}")
else:
    print("No regressions detected.")

print(f"Improvements: {len(comparison.improvements)}")
print(f"Unchanged: {len(comparison.unchanged)}")

Step 5: Event Bus Integration

Connect the runner to the event bus for real-time monitoring:

from fcc.messaging.bus import EventBus

bus = EventBus()
runner = BenchmarkRunner(mock=True, event_bus=bus)

# Subscribe to benchmark events
bus.subscribe_all(lambda e: print(f"  EVENT: {e.event_type.value}"))

results = runner.run_suite(suite)

Events emitted: benchmark.started, benchmark.completed, benchmark.regression.

Step 6: CI Integration

Add to your .github/workflows/benchmark.yml:

- name: Run CLEAR+ benchmarks
  run: fcc benchmark run --suite baseline --output results.yaml

- name: Check for regressions
  run: fcc benchmark compare --baseline benchmarks/baseline.yaml --candidate results.yaml
  continue-on-error: true

Use --strict to fail the build on regressions.

Interpreting Results

When Efficacy Drops

Low efficacy usually means the simulation did not complete enough steps. Check max_steps and workflow graph complexity.

When Coverage Drops

Low coverage means some target personas were not activated during the simulation. Verify that the workflow graph includes paths to all target personas.

When Assurance Drops

Low assurance indicates constitution validation issues. Review hard-stop rules for the affected personas.

Summary

In this tutorial you learned how to:

  • Define benchmark specs with CLEAR+ thresholds
  • Group specs into suites and load from YAML
  • Run benchmarks with mock or real AI providers
  • Persist results and compare across releases
  • Integrate benchmarks into CI pipelines
  • Interpret dimension-level results

Next Steps