Chapter 19: Evaluation & Benchmarking¶

Evaluation is what separates an experimental agent framework from a production-grade one. Without structured measurement, claims about persona quality, workflow efficiency, and governance compliance remain anecdotal. This chapter introduces the CLEAR+ evaluation methodology, the benchmark runner, model cards, datasheets, and the CI integration that ties everything together.

The flowchart below lays out the CLEAR+ evaluation pipeline: a BenchmarkSpec drives the runner and simulation, produces a seven-dimension metrics report, gates pass/fail against thresholds, and fans out to model cards and regression comparison.

flowchart TB
    subgraph CLEAR["CLEAR+ 7 Dimensions"]
        direction LR
        C[Cost<br/>Lower is better]
        L[Latency<br/>Lower is better]
        E[Efficacy<br/>Higher is better]
        A[Assurance<br/>Higher is better]
        R[Reliability<br/>Higher is better]
        COV[+Coverage<br/>Higher is better]
        EXP[+Explainability<br/>Higher is better]
    end

    SPEC[BenchmarkSpec] --> RUNNER[BenchmarkRunner]
    RUNNER --> SIM[Simulation Engine]
    SIM --> METRICS[CLEARPlusMetrics]
    METRICS --> THRESH{Meets Thresholds?}
    THRESH -->|Yes| PASS[PASS]
    THRESH -->|No| FAIL[FAIL + Violations]
    METRICS --> CARD[ModelCard Generation]
    METRICS --> COMP[BenchmarkComparison]
    COMP --> REG{Regressions?}

    style PASS fill:#4CAF50,color:#fff
    style FAIL fill:#d32f2f,color:#fff

Treating the seven dimensions separately is what lets teams set different thresholds for latency-sensitive versus safety-critical personas.

CLEAR+ Methodology¶

CLEAR+ is a 7-dimension evaluation framework purpose-built for multi-persona agent systems. The name is a mnemonic for the seven dimensions:

Dimension	Direction	Description
Cost	Lower is better	Total tokens or API calls consumed
Latency	Lower is better	End-to-end execution time (ms)
Efficacy	Higher is better	Task completion quality (0.0--1.0)
Assurance	Higher is better	Safety and confidence score (0.0--1.0)
Reliability	Higher is better	Consistency across repeated runs (0.0--1.0)
+Coverage	Higher is better	Persona / workflow activation ratio (0.0--1.0)
+Explainability	Higher is better	Trace and rationale completeness (0.0--1.0)

The first five dimensions follow the original CLEAR model. The two "plus" dimensions -- Coverage and Explainability -- extend it to capture concerns specific to multi-agent systems where incomplete persona activation or opaque reasoning are common failure modes.

CLEARPlusMetrics¶

The CLEARPlusMetrics frozen dataclass holds all seven dimensions:

from fcc.evaluation.metrics import CLEARPlusMetrics

metrics = CLEARPlusMetrics(
    cost=1200.0,
    latency_ms=450.0,
    efficacy=0.92,
    assurance=0.88,
    reliability=0.95,
    coverage=0.85,
    explainability=0.78,
)

# Inspect all dimension names
print(metrics.dimension_names)
# ('cost', 'latency_ms', 'efficacy', 'assurance', 'reliability', 'coverage', 'explainability')

# Convert to a numeric vector for downstream analysis
print(metrics.as_vector())
# (1200.0, 450.0, 0.92, 0.88, 0.95, 0.85, 0.78)

Threshold Checking¶

CLEARPlusMetrics includes a meets_thresholds method that compares actual measurements against expected bounds:

thresholds = CLEARPlusMetrics(
    cost=2000.0,       # max cost
    latency_ms=1000.0, # max latency
    efficacy=0.80,     # min efficacy
    assurance=0.70,    # min assurance
    reliability=0.90,  # min reliability
    coverage=0.75,     # min coverage
    explainability=0.60, # min explainability
)

passed, violations = metrics.meets_thresholds(thresholds)
print(f"Passed: {passed}")  # True
print(f"Violations: {violations}")  # []

For cost and latency (lower-is-better), the threshold is an upper bound. For the remaining five dimensions (higher-is-better), the threshold is a lower bound. Any violation produces a human-readable description in the violations list.

Building Metrics from Simulation Results¶

The factory method CLEARPlusMetrics.from_simulation_result extracts cost and latency from an AISimulationResult and accepts keyword arguments for the remaining dimensions:

metrics = CLEARPlusMetrics.from_simulation_result(
    simulation_result,
    efficacy=0.90,
    assurance=0.85,
    coverage=0.80,
    explainability=0.75,
)

Running CLEAR+ Benchmarks¶

Benchmark Specifications¶

A BenchmarkSpec defines what to benchmark:

from fcc.evaluation.benchmark import BenchmarkSpec
from fcc.evaluation.metrics import CLEARPlusMetrics

spec = BenchmarkSpec(
    name="core-workflow-baseline",
    description="Baseline benchmark for core 5-persona workflow",
    scenario="GEN-001",
    personas=("RC", "BC", "QR", "IG", "PM"),
    expected_outcomes=("research_brief", "blueprint", "quality_report"),
    clear_thresholds=CLEARPlusMetrics(
        cost=5000.0,
        latency_ms=2000.0,
        efficacy=0.75,
        assurance=0.70,
        reliability=0.85,
        coverage=0.80,
        explainability=0.60,
    ),
    tags=("core", "baseline"),
    workflow_graph="base_sequence",
)

Benchmark Suites¶

Group related specs into a BenchmarkSuite:

from fcc.evaluation.benchmark import BenchmarkSuite

suite = BenchmarkSuite(
    name="baseline",
    description="Baseline benchmarks for all core workflows",
    specs=(spec,),
    version="1.0",
)

Suites can also be loaded from YAML files:

suite = BenchmarkSuite.from_yaml("src/fcc/data/evaluation/baseline_benchmarks.yaml")

BenchmarkRunner¶

The BenchmarkRunner executes specs through the simulation engine and measures all seven CLEAR+ dimensions:

from fcc.evaluation.runner import BenchmarkRunner

runner = BenchmarkRunner(mock=True, max_steps=50)

# Run a full suite
results = runner.run_suite(suite)

for result in results:
    print(f"{result.spec.name}: {'PASS' if result.passed else 'FAIL'}")
    print(f"  Cost: {result.metrics.cost:.0f} tokens")
    print(f"  Latency: {result.metrics.latency_ms:.0f} ms")
    print(f"  Efficacy: {result.metrics.efficacy:.2f}")
    print(f"  Coverage: {result.metrics.coverage:.2f}")
    if result.details:
        for detail in result.details:
            print(f"  Violation: {detail}")

The runner computes each dimension as follows:

Dimension	How It Is Measured
Cost	`result.total_tokens` from the simulation engine
Latency	`result.total_latency_ms` from AI client calls
Efficacy	Step-completion ratio (`total_steps / max_steps`)
Assurance	Constitution validation (hard-stop rule compliance)
Reliability	1.0 for single runs; averaged over repeated runs
Coverage	`activated_personas / target_personas` ratio
Explainability	Events-with-payloads / total-events ratio

Persisting Results¶

Save and load benchmark results for historical comparison:

# Save results
runner.serialize_results(results, "benchmarks/baseline_v0.9.yaml", fmt="yaml")
runner.serialize_results(results, "benchmarks/baseline_v0.9.json", fmt="json")

# Load results
loaded = BenchmarkRunner.load_results("benchmarks/baseline_v0.9.yaml")

Interpreting Results¶

Benchmark Comparison¶

The BenchmarkComparison model captures regressions, improvements, and unchanged benchmarks between two runs:

baseline_results = BenchmarkRunner.load_results("benchmarks/baseline_v0.8.yaml")
candidate_results = runner.run_suite(suite)

comparison = runner.compare(baseline_results, candidate_results, threshold=0.05)

print(f"Regressions: {comparison.regressions}")
print(f"Improvements: {comparison.improvements}")
print(f"Unchanged: {comparison.unchanged}")
print(f"Has regressions: {comparison.has_regressions}")

A regression is flagged when:

A higher-is-better dimension drops by more than the threshold (default 5%)
A lower-is-better dimension increases by more than the threshold

Regression Detection¶

Use detect_regressions for a quick pass/fail check:

regressions = runner.detect_regressions(baseline_results, candidate_results)
if regressions:
    print("REGRESSIONS DETECTED:")
    for r in regressions:
        print(f"  {r}")
else:
    print("No regressions detected.")

Reading Dimension Results¶

When reviewing CLEAR+ results, focus on the dimensions most relevant to your deployment context:

Context	Priority Dimensions
Cost-sensitive deployments	Cost, Latency
Safety-critical systems	Assurance, Reliability
Audit / compliance	Coverage, Explainability
General quality	Efficacy, Coverage

CI Integration (Advisory Mode)¶

GitHub Actions Workflow¶

The benchmark runner integrates into CI pipelines in advisory mode -- regressions produce warnings but do not block merges:

# .github/workflows/benchmark.yml
name: CLEAR+ Benchmarks
on: [pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -e ".[dev]"

      - name: Run CLEAR+ benchmarks
        run: |
          fcc benchmark run --suite baseline --output results.yaml

      - name: Compare against baseline
        run: |
          fcc benchmark compare \
            --baseline benchmarks/baseline.yaml \
            --candidate results.yaml \
            --threshold 0.05
        continue-on-error: true

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: results.yaml

CLI Commands¶

The benchmark CLI provides three primary commands:

Command	Description
`fcc benchmark run --suite <name>`	Execute a benchmark suite
`fcc benchmark compare --baseline <file> --candidate <file>`	Compare two result files
`fcc benchmark report --results <file>`	Generate a human-readable report

Advisory vs. Blocking Mode¶

In advisory mode (the default), benchmark regressions appear as PR comments but do not fail the CI build. To enable blocking mode, add --strict to the compare command:

fcc benchmark compare --baseline baseline.yaml --candidate results.yaml --strict

In strict mode, the command exits with code 1 if any regressions are detected, causing the CI job to fail.

Benchmark Configuration¶

CI benchmark configuration is stored in src/fcc/data/evaluation/ci_benchmark_config.yaml. It defines which suites to run, threshold overrides, and notification settings.

Model Cards (Mitchell et al. 2019)¶

Model cards provide structured documentation for ML models. FCC extends the concept to personas, workflows, and persona categories.

ModelCard Structure¶

The ModelCard frozen dataclass captures 18 fields:

Field	Description
`model_name`	Display name of the persona, workflow, or category
`model_version`	Version string
`model_type`	`"persona"`, `"workflow"`, or `"category"`
`owner`	Owning organisation
`intended_use`	What the model is intended for
`out_of_scope_uses`	Explicitly unsupported uses
`ethical_considerations`	Known ethical concerns
`limitations`	Known limitations
`metrics`	CLEAR+ metrics (if benchmarked)
`training_data_summary`	Data source summary
`evaluation_results`	Benchmark pass/fail summary
`risk_category`	EU AI Act risk classification
`discernment_summary`	Discernment Matrix trait summary
`dimension_profile_summary`	Persona dimension profile summary
`riscear_summary`	R.I.S.C.E.A.R. specification summary
`collaborators`	Persona IDs this model collaborates with
`constitution_summary`	Constitution tier summary
`last_updated`	ISO 8601 timestamp

Generating Model Cards¶

The ModelCardGenerator produces cards from FCC registries:

from fcc.evaluation.card_generator import ModelCardGenerator
from fcc.personas.registry import PersonaRegistry

registry = PersonaRegistry.from_data_dir()
generator = ModelCardGenerator(version="0.9.0")

# Single persona card
spec = registry.get("RC")
card = generator.from_persona(spec)
print(f"Card: {card.model_name}")
print(f"Intended use: {card.intended_use}")
print(f"Risk category: {card.risk_category}")

# All cards (102 persona + 7 workflow + 20 category)
all_cards = generator.from_registry(registry)
print(f"Total cards: {len(all_cards)}")

# Render as Markdown
md = generator.render_markdown(card)
print(md)

# Batch write to disk
count = generator.batch_render(all_cards, "output/model_cards", fmt="markdown")
print(f"Written {count} model card files")

Risk Category Mapping¶

The generator maps constitution tiers to EU AI Act risk categories:

Constitution Profile	Risk Category
3+ hard-stop rules	HIGH
Any mandatory patterns	LIMITED
Only preferred patterns	MINIMAL
No constitution	None

Datasheets (Gebru et al. 2021)¶

Datasheets document the data sources that configure personas. The Datasheet frozen dataclass follows the 12-section structure from Gebru et al.:

from fcc.evaluation.card_generator import ModelCardGenerator
from fcc.personas.registry import PersonaRegistry

registry = PersonaRegistry.from_data_dir()
generator = ModelCardGenerator()

datasheet = generator.generate_datasheet("FCC Persona Dataset", registry)
print(f"Dataset: {datasheet.dataset_name}")
print(f"Composition: {datasheet.composition}")
print(f"Purpose: {datasheet.purpose}")
print(f"Limitations: {datasheet.limitations}")

Section	Description
`dataset_name`	Name of the dataset
`description`	High-level overview
`purpose`	Why the dataset exists
`creator`	Who created it
`composition`	What instances it contains
`collection_process`	How data was gathered
`preprocessing`	Transformations applied
`uses`	Intended uses
`distribution`	How it is distributed
`maintenance`	Update and support plans
`ethical_considerations`	Known ethical issues
`limitations`	Known limitations

Best Practices for Evaluation¶

1. Establish Baselines Early¶

Run baseline benchmarks before making changes. Store results in version control alongside the code:

benchmarks/
  baseline_v0.8.yaml
  baseline_v0.9.yaml
  stress_v0.9.yaml

2. Use Mock Mode for CI¶

The BenchmarkRunner defaults to mock=True, which uses the MockAIClient for deterministic, zero-cost execution. This ensures CI benchmarks are fast and repeatable without API key dependencies.

3. Track Trends, Not Absolutes¶

Individual CLEAR+ dimension values depend on the simulation engine mode. Focus on trends over time:

Is efficacy improving release over release?
Is cost staying within budget constraints?
Is coverage increasing as new personas are added?

4. Generate Model Cards for Every Release¶

Include model card generation in your release pipeline:

fcc benchmark run --suite baseline --output results.yaml
fcc model-cards generate --benchmarks results.yaml --output docs/model_cards/

5. Pair Quantitative with Qualitative¶

CLEAR+ provides quantitative measurement. Pair it with:

Human review of simulation traces
Collaboration session scoring
Cross-reference matrix analysis

6. Document Limitations Honestly¶

Model cards should honestly document limitations. The limitations field is not optional -- every persona has constraints that users need to understand before deployment.

7. Review Risk Categories Regularly¶

As constitutions evolve, risk categories may change. Re-run the classifier periodically:

from fcc.compliance.classifier import AIActClassifier

classifier = AIActClassifier(constitution_registry=const_reg)
for pid in registry.ids:
    spec = registry.get(pid)
    risk = classifier.classify_persona(spec)
    print(f"  {pid}: {risk.value}")

Practical Exercises¶

Exercise 1: Run a Baseline Benchmark¶

Load the baseline benchmark suite from src/fcc/data/evaluation/baseline_benchmarks.yaml, run it with the BenchmarkRunner, and inspect the CLEAR+ dimensions for each spec.

Exercise 2: Detect Regressions¶

Run the baseline suite twice with different max_steps values (50 and 25). Compare the results and verify that the runner correctly detects efficacy regressions.

Exercise 3: Generate Model Cards¶

Use the ModelCardGenerator to produce model cards for all 102 personas. Write them to disk in Markdown format and inspect the risk category assignments.

Exercise 4: CI Advisory Pipeline¶

Set up a GitHub Actions workflow that runs CLEAR+ benchmarks on every pull request and posts a comparison summary as a PR comment.

Summary¶

Component	Purpose	Module
CLEARPlusMetrics	7-dimension measurement	`fcc.evaluation.metrics`
BenchmarkSpec	What to benchmark	`fcc.evaluation.benchmark`
BenchmarkSuite	Groups of specs	`fcc.evaluation.benchmark`
BenchmarkRunner	Execute and compare	`fcc.evaluation.runner`
ModelCard	Structured documentation	`fcc.evaluation.cards`
Datasheet	Data source documentation	`fcc.evaluation.cards`
ModelCardGenerator	Produce cards from registries	`fcc.evaluation.card_generator`

CLEAR+ evaluation transforms FCC from a qualitative framework into a quantitatively measurable one. Model cards and datasheets add structured documentation that satisfies both internal review and external regulatory requirements.

Next Steps

Read Chapter 20 for automated compliance auditing

Explore the CLEAR+ Benchmarking Guide for hands-on practice

See the Model Card Generation Tutorial for step-by-step card creation