CLEAR+ Benchmarking Guide¶
Duration: 60 minutes
Level: Advanced
Module: fcc.evaluation
This tutorial walks you through configuring, running, and interpreting CLEAR+ benchmarks for FCC personas and workflows. You will define benchmark specs, execute suites, compare results across releases, and integrate benchmarks into CI pipelines.
Prerequisites¶
- Completed beginner tutorials and at least one simulation run
- Familiarity with
PersonaRegistry,WorkflowGraph, andSimulationEngine - FCC installed with dev dependencies (
pip install -e ".[dev]")
Understanding the Seven Dimensions¶
CLEAR+ measures agent system quality across seven dimensions:
| Dimension | Direction | What It Captures |
|---|---|---|
| Cost | Lower is better | Token consumption or API call count |
| Latency | Lower is better | Wall-clock execution time in milliseconds |
| Efficacy | Higher is better | How well the task was completed (0--1) |
| Assurance | Higher is better | Safety score from constitution validation (0--1) |
| Reliability | Higher is better | Consistency of results across repeated runs (0--1) |
| Coverage | Higher is better | What fraction of target personas were activated (0--1) |
| Explainability | Higher is better | What fraction of events carry meaningful payloads (0--1) |
Step 1: Create a Benchmark Spec¶
A BenchmarkSpec defines a single evaluation scenario:
from fcc.evaluation.benchmark import BenchmarkSpec
from fcc.evaluation.metrics import CLEARPlusMetrics
spec = BenchmarkSpec(
name="governance-workflow-test",
description="Test governance personas in extended workflow",
scenario="GEN-001",
personas=("DGS", "RAE", "EAG", "PIE"),
expected_outcomes=("governance_report", "compliance_summary"),
clear_thresholds=CLEARPlusMetrics(
cost=3000.0,
latency_ms=1500.0,
efficacy=0.70,
assurance=0.80,
reliability=0.90,
coverage=0.75,
explainability=0.50,
),
tags=("governance", "phase-14"),
workflow_graph="base_sequence",
)
Step 2: Build a Benchmark Suite¶
Group specs into a suite:
from fcc.evaluation.benchmark import BenchmarkSuite
suite = BenchmarkSuite(
name="phase-14-governance",
description="Phase 14 governance benchmarks",
specs=(spec,),
version="1.0",
)
Or load from YAML:
suite = BenchmarkSuite.from_yaml("src/fcc/data/evaluation/baseline_benchmarks.yaml")
print(f"Suite: {suite.name} ({len(suite.specs)} specs)")
Step 3: Run the Suite¶
from fcc.evaluation.runner import BenchmarkRunner
runner = BenchmarkRunner(mock=True, max_steps=50)
results = runner.run_suite(suite)
for result in results:
status = "PASS" if result.passed else "FAIL"
print(f"[{status}] {result.spec.name}")
print(f" Cost: {result.metrics.cost:.0f}")
print(f" Latency: {result.metrics.latency_ms:.0f} ms")
print(f" Efficacy: {result.metrics.efficacy:.2%}")
print(f" Assurance: {result.metrics.assurance:.2%}")
print(f" Reliability: {result.metrics.reliability:.2%}")
print(f" Coverage: {result.metrics.coverage:.2%}")
print(f" Explainability: {result.metrics.explainability:.2%}")
Step 4: Persist and Compare Results¶
Save results for historical tracking:
Compare against a previous baseline:
baseline = BenchmarkRunner.load_results("benchmarks/baseline_v0.8.yaml")
comparison = runner.compare(baseline, results, threshold=0.05)
if comparison.has_regressions:
print("REGRESSIONS:")
for r in comparison.regressions:
print(f" {r}")
else:
print("No regressions detected.")
print(f"Improvements: {len(comparison.improvements)}")
print(f"Unchanged: {len(comparison.unchanged)}")
Step 5: Event Bus Integration¶
Connect the runner to the event bus for real-time monitoring:
from fcc.messaging.bus import EventBus
bus = EventBus()
runner = BenchmarkRunner(mock=True, event_bus=bus)
# Subscribe to benchmark events
bus.subscribe_all(lambda e: print(f" EVENT: {e.event_type.value}"))
results = runner.run_suite(suite)
Events emitted: benchmark.started, benchmark.completed, benchmark.regression.
Step 6: CI Integration¶
Add to your .github/workflows/benchmark.yml:
- name: Run CLEAR+ benchmarks
run: fcc benchmark run --suite baseline --output results.yaml
- name: Check for regressions
run: fcc benchmark compare --baseline benchmarks/baseline.yaml --candidate results.yaml
continue-on-error: true
Use --strict to fail the build on regressions.
Interpreting Results¶
When Efficacy Drops¶
Low efficacy usually means the simulation did not complete enough steps.
Check max_steps and workflow graph complexity.
When Coverage Drops¶
Low coverage means some target personas were not activated during the simulation. Verify that the workflow graph includes paths to all target personas.
When Assurance Drops¶
Low assurance indicates constitution validation issues. Review hard-stop rules for the affected personas.
Summary¶
In this tutorial you learned how to:
- Define benchmark specs with CLEAR+ thresholds
- Group specs into suites and load from YAML
- Run benchmarks with mock or real AI providers
- Persist results and compare across releases
- Integrate benchmarks into CI pipelines
- Interpret dimension-level results
Next Steps¶
- Model Card Generation -- Generate structured documentation from benchmarks
- EU AI Act Compliance -- Map benchmarks to regulatory requirements
- Benchmark Interpretation Guide -- Deep dive into result analysis