Skip to content

Chapter 19: Evaluation & Benchmarking

Evaluation is what separates an experimental agent framework from a production-grade one. Without structured measurement, claims about persona quality, workflow efficiency, and governance compliance remain anecdotal. This chapter introduces the CLEAR+ evaluation methodology, the benchmark runner, model cards, datasheets, and the CI integration that ties everything together.

The flowchart below lays out the CLEAR+ evaluation pipeline: a BenchmarkSpec drives the runner and simulation, produces a seven-dimension metrics report, gates pass/fail against thresholds, and fans out to model cards and regression comparison.

flowchart TB
    subgraph CLEAR["CLEAR+ 7 Dimensions"]
        direction LR
        C[Cost<br/>Lower is better]
        L[Latency<br/>Lower is better]
        E[Efficacy<br/>Higher is better]
        A[Assurance<br/>Higher is better]
        R[Reliability<br/>Higher is better]
        COV[+Coverage<br/>Higher is better]
        EXP[+Explainability<br/>Higher is better]
    end

    SPEC[BenchmarkSpec] --> RUNNER[BenchmarkRunner]
    RUNNER --> SIM[Simulation Engine]
    SIM --> METRICS[CLEARPlusMetrics]
    METRICS --> THRESH{Meets Thresholds?}
    THRESH -->|Yes| PASS[PASS]
    THRESH -->|No| FAIL[FAIL + Violations]
    METRICS --> CARD[ModelCard Generation]
    METRICS --> COMP[BenchmarkComparison]
    COMP --> REG{Regressions?}

    style PASS fill:#4CAF50,color:#fff
    style FAIL fill:#d32f2f,color:#fff

Treating the seven dimensions separately is what lets teams set different thresholds for latency-sensitive versus safety-critical personas.

CLEAR+ Methodology

CLEAR+ is a 7-dimension evaluation framework purpose-built for multi-persona agent systems. The name is a mnemonic for the seven dimensions:

Dimension Direction Description
Cost Lower is better Total tokens or API calls consumed
Latency Lower is better End-to-end execution time (ms)
Efficacy Higher is better Task completion quality (0.0--1.0)
Assurance Higher is better Safety and confidence score (0.0--1.0)
Reliability Higher is better Consistency across repeated runs (0.0--1.0)
+Coverage Higher is better Persona / workflow activation ratio (0.0--1.0)
+Explainability Higher is better Trace and rationale completeness (0.0--1.0)

The first five dimensions follow the original CLEAR model. The two "plus" dimensions -- Coverage and Explainability -- extend it to capture concerns specific to multi-agent systems where incomplete persona activation or opaque reasoning are common failure modes.

CLEARPlusMetrics

The CLEARPlusMetrics frozen dataclass holds all seven dimensions:

from fcc.evaluation.metrics import CLEARPlusMetrics

metrics = CLEARPlusMetrics(
    cost=1200.0,
    latency_ms=450.0,
    efficacy=0.92,
    assurance=0.88,
    reliability=0.95,
    coverage=0.85,
    explainability=0.78,
)

# Inspect all dimension names
print(metrics.dimension_names)
# ('cost', 'latency_ms', 'efficacy', 'assurance', 'reliability', 'coverage', 'explainability')

# Convert to a numeric vector for downstream analysis
print(metrics.as_vector())
# (1200.0, 450.0, 0.92, 0.88, 0.95, 0.85, 0.78)

Threshold Checking

CLEARPlusMetrics includes a meets_thresholds method that compares actual measurements against expected bounds:

thresholds = CLEARPlusMetrics(
    cost=2000.0,       # max cost
    latency_ms=1000.0, # max latency
    efficacy=0.80,     # min efficacy
    assurance=0.70,    # min assurance
    reliability=0.90,  # min reliability
    coverage=0.75,     # min coverage
    explainability=0.60, # min explainability
)

passed, violations = metrics.meets_thresholds(thresholds)
print(f"Passed: {passed}")  # True
print(f"Violations: {violations}")  # []

For cost and latency (lower-is-better), the threshold is an upper bound. For the remaining five dimensions (higher-is-better), the threshold is a lower bound. Any violation produces a human-readable description in the violations list.

Building Metrics from Simulation Results

The factory method CLEARPlusMetrics.from_simulation_result extracts cost and latency from an AISimulationResult and accepts keyword arguments for the remaining dimensions:

metrics = CLEARPlusMetrics.from_simulation_result(
    simulation_result,
    efficacy=0.90,
    assurance=0.85,
    coverage=0.80,
    explainability=0.75,
)

Running CLEAR+ Benchmarks

Benchmark Specifications

A BenchmarkSpec defines what to benchmark:

from fcc.evaluation.benchmark import BenchmarkSpec
from fcc.evaluation.metrics import CLEARPlusMetrics

spec = BenchmarkSpec(
    name="core-workflow-baseline",
    description="Baseline benchmark for core 5-persona workflow",
    scenario="GEN-001",
    personas=("RC", "BC", "QR", "IG", "PM"),
    expected_outcomes=("research_brief", "blueprint", "quality_report"),
    clear_thresholds=CLEARPlusMetrics(
        cost=5000.0,
        latency_ms=2000.0,
        efficacy=0.75,
        assurance=0.70,
        reliability=0.85,
        coverage=0.80,
        explainability=0.60,
    ),
    tags=("core", "baseline"),
    workflow_graph="base_sequence",
)

Benchmark Suites

Group related specs into a BenchmarkSuite:

from fcc.evaluation.benchmark import BenchmarkSuite

suite = BenchmarkSuite(
    name="baseline",
    description="Baseline benchmarks for all core workflows",
    specs=(spec,),
    version="1.0",
)

Suites can also be loaded from YAML files:

suite = BenchmarkSuite.from_yaml("src/fcc/data/evaluation/baseline_benchmarks.yaml")

BenchmarkRunner

The BenchmarkRunner executes specs through the simulation engine and measures all seven CLEAR+ dimensions:

from fcc.evaluation.runner import BenchmarkRunner

runner = BenchmarkRunner(mock=True, max_steps=50)

# Run a full suite
results = runner.run_suite(suite)

for result in results:
    print(f"{result.spec.name}: {'PASS' if result.passed else 'FAIL'}")
    print(f"  Cost: {result.metrics.cost:.0f} tokens")
    print(f"  Latency: {result.metrics.latency_ms:.0f} ms")
    print(f"  Efficacy: {result.metrics.efficacy:.2f}")
    print(f"  Coverage: {result.metrics.coverage:.2f}")
    if result.details:
        for detail in result.details:
            print(f"  Violation: {detail}")

The runner computes each dimension as follows:

Dimension How It Is Measured
Cost result.total_tokens from the simulation engine
Latency result.total_latency_ms from AI client calls
Efficacy Step-completion ratio (total_steps / max_steps)
Assurance Constitution validation (hard-stop rule compliance)
Reliability 1.0 for single runs; averaged over repeated runs
Coverage activated_personas / target_personas ratio
Explainability Events-with-payloads / total-events ratio

Persisting Results

Save and load benchmark results for historical comparison:

# Save results
runner.serialize_results(results, "benchmarks/baseline_v0.9.yaml", fmt="yaml")
runner.serialize_results(results, "benchmarks/baseline_v0.9.json", fmt="json")

# Load results
loaded = BenchmarkRunner.load_results("benchmarks/baseline_v0.9.yaml")

Interpreting Results

Benchmark Comparison

The BenchmarkComparison model captures regressions, improvements, and unchanged benchmarks between two runs:

baseline_results = BenchmarkRunner.load_results("benchmarks/baseline_v0.8.yaml")
candidate_results = runner.run_suite(suite)

comparison = runner.compare(baseline_results, candidate_results, threshold=0.05)

print(f"Regressions: {comparison.regressions}")
print(f"Improvements: {comparison.improvements}")
print(f"Unchanged: {comparison.unchanged}")
print(f"Has regressions: {comparison.has_regressions}")

A regression is flagged when:

  • A higher-is-better dimension drops by more than the threshold (default 5%)
  • A lower-is-better dimension increases by more than the threshold

Regression Detection

Use detect_regressions for a quick pass/fail check:

regressions = runner.detect_regressions(baseline_results, candidate_results)
if regressions:
    print("REGRESSIONS DETECTED:")
    for r in regressions:
        print(f"  {r}")
else:
    print("No regressions detected.")

Reading Dimension Results

When reviewing CLEAR+ results, focus on the dimensions most relevant to your deployment context:

Context Priority Dimensions
Cost-sensitive deployments Cost, Latency
Safety-critical systems Assurance, Reliability
Audit / compliance Coverage, Explainability
General quality Efficacy, Coverage

CI Integration (Advisory Mode)

GitHub Actions Workflow

The benchmark runner integrates into CI pipelines in advisory mode -- regressions produce warnings but do not block merges:

# .github/workflows/benchmark.yml
name: CLEAR+ Benchmarks
on: [pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -e ".[dev]"

      - name: Run CLEAR+ benchmarks
        run: |
          fcc benchmark run --suite baseline --output results.yaml

      - name: Compare against baseline
        run: |
          fcc benchmark compare \
            --baseline benchmarks/baseline.yaml \
            --candidate results.yaml \
            --threshold 0.05
        continue-on-error: true

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: results.yaml

CLI Commands

The benchmark CLI provides three primary commands:

Command Description
fcc benchmark run --suite <name> Execute a benchmark suite
fcc benchmark compare --baseline <file> --candidate <file> Compare two result files
fcc benchmark report --results <file> Generate a human-readable report

Advisory vs. Blocking Mode

In advisory mode (the default), benchmark regressions appear as PR comments but do not fail the CI build. To enable blocking mode, add --strict to the compare command:

fcc benchmark compare --baseline baseline.yaml --candidate results.yaml --strict

In strict mode, the command exits with code 1 if any regressions are detected, causing the CI job to fail.

Benchmark Configuration

CI benchmark configuration is stored in src/fcc/data/evaluation/ci_benchmark_config.yaml. It defines which suites to run, threshold overrides, and notification settings.

Model Cards (Mitchell et al. 2019)

Model cards provide structured documentation for ML models. FCC extends the concept to personas, workflows, and persona categories.

ModelCard Structure

The ModelCard frozen dataclass captures 18 fields:

Field Description
model_name Display name of the persona, workflow, or category
model_version Version string
model_type "persona", "workflow", or "category"
owner Owning organisation
intended_use What the model is intended for
out_of_scope_uses Explicitly unsupported uses
ethical_considerations Known ethical concerns
limitations Known limitations
metrics CLEAR+ metrics (if benchmarked)
training_data_summary Data source summary
evaluation_results Benchmark pass/fail summary
risk_category EU AI Act risk classification
discernment_summary Discernment Matrix trait summary
dimension_profile_summary Persona dimension profile summary
riscear_summary R.I.S.C.E.A.R. specification summary
collaborators Persona IDs this model collaborates with
constitution_summary Constitution tier summary
last_updated ISO 8601 timestamp

Generating Model Cards

The ModelCardGenerator produces cards from FCC registries:

from fcc.evaluation.card_generator import ModelCardGenerator
from fcc.personas.registry import PersonaRegistry

registry = PersonaRegistry.from_data_dir()
generator = ModelCardGenerator(version="0.9.0")

# Single persona card
spec = registry.get("RC")
card = generator.from_persona(spec)
print(f"Card: {card.model_name}")
print(f"Intended use: {card.intended_use}")
print(f"Risk category: {card.risk_category}")

# All cards (102 persona + 7 workflow + 20 category)
all_cards = generator.from_registry(registry)
print(f"Total cards: {len(all_cards)}")

# Render as Markdown
md = generator.render_markdown(card)
print(md)

# Batch write to disk
count = generator.batch_render(all_cards, "output/model_cards", fmt="markdown")
print(f"Written {count} model card files")

Risk Category Mapping

The generator maps constitution tiers to EU AI Act risk categories:

Constitution Profile Risk Category
3+ hard-stop rules HIGH
Any mandatory patterns LIMITED
Only preferred patterns MINIMAL
No constitution None

Datasheets (Gebru et al. 2021)

Datasheets document the data sources that configure personas. The Datasheet frozen dataclass follows the 12-section structure from Gebru et al.:

from fcc.evaluation.card_generator import ModelCardGenerator
from fcc.personas.registry import PersonaRegistry

registry = PersonaRegistry.from_data_dir()
generator = ModelCardGenerator()

datasheet = generator.generate_datasheet("FCC Persona Dataset", registry)
print(f"Dataset: {datasheet.dataset_name}")
print(f"Composition: {datasheet.composition}")
print(f"Purpose: {datasheet.purpose}")
print(f"Limitations: {datasheet.limitations}")
Section Description
dataset_name Name of the dataset
description High-level overview
purpose Why the dataset exists
creator Who created it
composition What instances it contains
collection_process How data was gathered
preprocessing Transformations applied
uses Intended uses
distribution How it is distributed
maintenance Update and support plans
ethical_considerations Known ethical issues
limitations Known limitations

Best Practices for Evaluation

1. Establish Baselines Early

Run baseline benchmarks before making changes. Store results in version control alongside the code:

benchmarks/
  baseline_v0.8.yaml
  baseline_v0.9.yaml
  stress_v0.9.yaml

2. Use Mock Mode for CI

The BenchmarkRunner defaults to mock=True, which uses the MockAIClient for deterministic, zero-cost execution. This ensures CI benchmarks are fast and repeatable without API key dependencies.

Individual CLEAR+ dimension values depend on the simulation engine mode. Focus on trends over time:

  • Is efficacy improving release over release?
  • Is cost staying within budget constraints?
  • Is coverage increasing as new personas are added?

4. Generate Model Cards for Every Release

Include model card generation in your release pipeline:

fcc benchmark run --suite baseline --output results.yaml
fcc model-cards generate --benchmarks results.yaml --output docs/model_cards/

5. Pair Quantitative with Qualitative

CLEAR+ provides quantitative measurement. Pair it with:

  • Human review of simulation traces
  • Collaboration session scoring
  • Cross-reference matrix analysis

6. Document Limitations Honestly

Model cards should honestly document limitations. The limitations field is not optional -- every persona has constraints that users need to understand before deployment.

7. Review Risk Categories Regularly

As constitutions evolve, risk categories may change. Re-run the classifier periodically:

from fcc.compliance.classifier import AIActClassifier

classifier = AIActClassifier(constitution_registry=const_reg)
for pid in registry.ids:
    spec = registry.get(pid)
    risk = classifier.classify_persona(spec)
    print(f"  {pid}: {risk.value}")

Practical Exercises

Exercise 1: Run a Baseline Benchmark

Load the baseline benchmark suite from src/fcc/data/evaluation/baseline_benchmarks.yaml, run it with the BenchmarkRunner, and inspect the CLEAR+ dimensions for each spec.

Exercise 2: Detect Regressions

Run the baseline suite twice with different max_steps values (50 and 25). Compare the results and verify that the runner correctly detects efficacy regressions.

Exercise 3: Generate Model Cards

Use the ModelCardGenerator to produce model cards for all 102 personas. Write them to disk in Markdown format and inspect the risk category assignments.

Exercise 4: CI Advisory Pipeline

Set up a GitHub Actions workflow that runs CLEAR+ benchmarks on every pull request and posts a comparison summary as a PR comment.

Summary

Component Purpose Module
CLEARPlusMetrics 7-dimension measurement fcc.evaluation.metrics
BenchmarkSpec What to benchmark fcc.evaluation.benchmark
BenchmarkSuite Groups of specs fcc.evaluation.benchmark
BenchmarkRunner Execute and compare fcc.evaluation.runner
ModelCard Structured documentation fcc.evaluation.cards
Datasheet Data source documentation fcc.evaluation.cards
ModelCardGenerator Produce cards from registries fcc.evaluation.card_generator

CLEAR+ evaluation transforms FCC from a qualitative framework into a quantitatively measurable one. Model cards and datasheets add structured documentation that satisfies both internal review and external regulatory requirements.

Next Steps