Chapter 19: Evaluation & Benchmarking¶
Evaluation is what separates an experimental agent framework from a production-grade one. Without structured measurement, claims about persona quality, workflow efficiency, and governance compliance remain anecdotal. This chapter introduces the CLEAR+ evaluation methodology, the benchmark runner, model cards, datasheets, and the CI integration that ties everything together.
The flowchart below lays out the CLEAR+ evaluation pipeline: a BenchmarkSpec drives the runner and simulation, produces a seven-dimension metrics report, gates pass/fail against thresholds, and fans out to model cards and regression comparison.
flowchart TB
subgraph CLEAR["CLEAR+ 7 Dimensions"]
direction LR
C[Cost<br/>Lower is better]
L[Latency<br/>Lower is better]
E[Efficacy<br/>Higher is better]
A[Assurance<br/>Higher is better]
R[Reliability<br/>Higher is better]
COV[+Coverage<br/>Higher is better]
EXP[+Explainability<br/>Higher is better]
end
SPEC[BenchmarkSpec] --> RUNNER[BenchmarkRunner]
RUNNER --> SIM[Simulation Engine]
SIM --> METRICS[CLEARPlusMetrics]
METRICS --> THRESH{Meets Thresholds?}
THRESH -->|Yes| PASS[PASS]
THRESH -->|No| FAIL[FAIL + Violations]
METRICS --> CARD[ModelCard Generation]
METRICS --> COMP[BenchmarkComparison]
COMP --> REG{Regressions?}
style PASS fill:#4CAF50,color:#fff
style FAIL fill:#d32f2f,color:#fff
Treating the seven dimensions separately is what lets teams set different thresholds for latency-sensitive versus safety-critical personas.
CLEAR+ Methodology¶
CLEAR+ is a 7-dimension evaluation framework purpose-built for multi-persona agent systems. The name is a mnemonic for the seven dimensions:
| Dimension | Direction | Description |
|---|---|---|
| Cost | Lower is better | Total tokens or API calls consumed |
| Latency | Lower is better | End-to-end execution time (ms) |
| Efficacy | Higher is better | Task completion quality (0.0--1.0) |
| Assurance | Higher is better | Safety and confidence score (0.0--1.0) |
| Reliability | Higher is better | Consistency across repeated runs (0.0--1.0) |
| +Coverage | Higher is better | Persona / workflow activation ratio (0.0--1.0) |
| +Explainability | Higher is better | Trace and rationale completeness (0.0--1.0) |
The first five dimensions follow the original CLEAR model. The two "plus" dimensions -- Coverage and Explainability -- extend it to capture concerns specific to multi-agent systems where incomplete persona activation or opaque reasoning are common failure modes.
CLEARPlusMetrics¶
The CLEARPlusMetrics frozen dataclass holds all seven dimensions:
from fcc.evaluation.metrics import CLEARPlusMetrics
metrics = CLEARPlusMetrics(
cost=1200.0,
latency_ms=450.0,
efficacy=0.92,
assurance=0.88,
reliability=0.95,
coverage=0.85,
explainability=0.78,
)
# Inspect all dimension names
print(metrics.dimension_names)
# ('cost', 'latency_ms', 'efficacy', 'assurance', 'reliability', 'coverage', 'explainability')
# Convert to a numeric vector for downstream analysis
print(metrics.as_vector())
# (1200.0, 450.0, 0.92, 0.88, 0.95, 0.85, 0.78)
Threshold Checking¶
CLEARPlusMetrics includes a meets_thresholds method that compares
actual measurements against expected bounds:
thresholds = CLEARPlusMetrics(
cost=2000.0, # max cost
latency_ms=1000.0, # max latency
efficacy=0.80, # min efficacy
assurance=0.70, # min assurance
reliability=0.90, # min reliability
coverage=0.75, # min coverage
explainability=0.60, # min explainability
)
passed, violations = metrics.meets_thresholds(thresholds)
print(f"Passed: {passed}") # True
print(f"Violations: {violations}") # []
For cost and latency (lower-is-better), the threshold is an upper bound. For the remaining five dimensions (higher-is-better), the threshold is a lower bound. Any violation produces a human-readable description in the violations list.
Building Metrics from Simulation Results¶
The factory method CLEARPlusMetrics.from_simulation_result extracts
cost and latency from an AISimulationResult and accepts keyword
arguments for the remaining dimensions:
metrics = CLEARPlusMetrics.from_simulation_result(
simulation_result,
efficacy=0.90,
assurance=0.85,
coverage=0.80,
explainability=0.75,
)
Running CLEAR+ Benchmarks¶
Benchmark Specifications¶
A BenchmarkSpec defines what to benchmark:
from fcc.evaluation.benchmark import BenchmarkSpec
from fcc.evaluation.metrics import CLEARPlusMetrics
spec = BenchmarkSpec(
name="core-workflow-baseline",
description="Baseline benchmark for core 5-persona workflow",
scenario="GEN-001",
personas=("RC", "BC", "QR", "IG", "PM"),
expected_outcomes=("research_brief", "blueprint", "quality_report"),
clear_thresholds=CLEARPlusMetrics(
cost=5000.0,
latency_ms=2000.0,
efficacy=0.75,
assurance=0.70,
reliability=0.85,
coverage=0.80,
explainability=0.60,
),
tags=("core", "baseline"),
workflow_graph="base_sequence",
)
Benchmark Suites¶
Group related specs into a BenchmarkSuite:
from fcc.evaluation.benchmark import BenchmarkSuite
suite = BenchmarkSuite(
name="baseline",
description="Baseline benchmarks for all core workflows",
specs=(spec,),
version="1.0",
)
Suites can also be loaded from YAML files:
BenchmarkRunner¶
The BenchmarkRunner executes specs through the simulation engine and
measures all seven CLEAR+ dimensions:
from fcc.evaluation.runner import BenchmarkRunner
runner = BenchmarkRunner(mock=True, max_steps=50)
# Run a full suite
results = runner.run_suite(suite)
for result in results:
print(f"{result.spec.name}: {'PASS' if result.passed else 'FAIL'}")
print(f" Cost: {result.metrics.cost:.0f} tokens")
print(f" Latency: {result.metrics.latency_ms:.0f} ms")
print(f" Efficacy: {result.metrics.efficacy:.2f}")
print(f" Coverage: {result.metrics.coverage:.2f}")
if result.details:
for detail in result.details:
print(f" Violation: {detail}")
The runner computes each dimension as follows:
| Dimension | How It Is Measured |
|---|---|
| Cost | result.total_tokens from the simulation engine |
| Latency | result.total_latency_ms from AI client calls |
| Efficacy | Step-completion ratio (total_steps / max_steps) |
| Assurance | Constitution validation (hard-stop rule compliance) |
| Reliability | 1.0 for single runs; averaged over repeated runs |
| Coverage | activated_personas / target_personas ratio |
| Explainability | Events-with-payloads / total-events ratio |
Persisting Results¶
Save and load benchmark results for historical comparison:
# Save results
runner.serialize_results(results, "benchmarks/baseline_v0.9.yaml", fmt="yaml")
runner.serialize_results(results, "benchmarks/baseline_v0.9.json", fmt="json")
# Load results
loaded = BenchmarkRunner.load_results("benchmarks/baseline_v0.9.yaml")
Interpreting Results¶
Benchmark Comparison¶
The BenchmarkComparison model captures regressions, improvements, and
unchanged benchmarks between two runs:
baseline_results = BenchmarkRunner.load_results("benchmarks/baseline_v0.8.yaml")
candidate_results = runner.run_suite(suite)
comparison = runner.compare(baseline_results, candidate_results, threshold=0.05)
print(f"Regressions: {comparison.regressions}")
print(f"Improvements: {comparison.improvements}")
print(f"Unchanged: {comparison.unchanged}")
print(f"Has regressions: {comparison.has_regressions}")
A regression is flagged when:
- A higher-is-better dimension drops by more than the threshold (default 5%)
- A lower-is-better dimension increases by more than the threshold
Regression Detection¶
Use detect_regressions for a quick pass/fail check:
regressions = runner.detect_regressions(baseline_results, candidate_results)
if regressions:
print("REGRESSIONS DETECTED:")
for r in regressions:
print(f" {r}")
else:
print("No regressions detected.")
Reading Dimension Results¶
When reviewing CLEAR+ results, focus on the dimensions most relevant to your deployment context:
| Context | Priority Dimensions |
|---|---|
| Cost-sensitive deployments | Cost, Latency |
| Safety-critical systems | Assurance, Reliability |
| Audit / compliance | Coverage, Explainability |
| General quality | Efficacy, Coverage |
CI Integration (Advisory Mode)¶
GitHub Actions Workflow¶
The benchmark runner integrates into CI pipelines in advisory mode -- regressions produce warnings but do not block merges:
# .github/workflows/benchmark.yml
name: CLEAR+ Benchmarks
on: [pull_request]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e ".[dev]"
- name: Run CLEAR+ benchmarks
run: |
fcc benchmark run --suite baseline --output results.yaml
- name: Compare against baseline
run: |
fcc benchmark compare \
--baseline benchmarks/baseline.yaml \
--candidate results.yaml \
--threshold 0.05
continue-on-error: true
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: results.yaml
CLI Commands¶
The benchmark CLI provides three primary commands:
| Command | Description |
|---|---|
fcc benchmark run --suite <name> |
Execute a benchmark suite |
fcc benchmark compare --baseline <file> --candidate <file> |
Compare two result files |
fcc benchmark report --results <file> |
Generate a human-readable report |
Advisory vs. Blocking Mode¶
In advisory mode (the default), benchmark regressions appear as PR
comments but do not fail the CI build. To enable blocking mode, add
--strict to the compare command:
In strict mode, the command exits with code 1 if any regressions are detected, causing the CI job to fail.
Benchmark Configuration¶
CI benchmark configuration is stored in
src/fcc/data/evaluation/ci_benchmark_config.yaml. It defines which
suites to run, threshold overrides, and notification settings.
Model Cards (Mitchell et al. 2019)¶
Model cards provide structured documentation for ML models. FCC extends the concept to personas, workflows, and persona categories.
ModelCard Structure¶
The ModelCard frozen dataclass captures 18 fields:
| Field | Description |
|---|---|
model_name |
Display name of the persona, workflow, or category |
model_version |
Version string |
model_type |
"persona", "workflow", or "category" |
owner |
Owning organisation |
intended_use |
What the model is intended for |
out_of_scope_uses |
Explicitly unsupported uses |
ethical_considerations |
Known ethical concerns |
limitations |
Known limitations |
metrics |
CLEAR+ metrics (if benchmarked) |
training_data_summary |
Data source summary |
evaluation_results |
Benchmark pass/fail summary |
risk_category |
EU AI Act risk classification |
discernment_summary |
Discernment Matrix trait summary |
dimension_profile_summary |
Persona dimension profile summary |
riscear_summary |
R.I.S.C.E.A.R. specification summary |
collaborators |
Persona IDs this model collaborates with |
constitution_summary |
Constitution tier summary |
last_updated |
ISO 8601 timestamp |
Generating Model Cards¶
The ModelCardGenerator produces cards from FCC registries:
from fcc.evaluation.card_generator import ModelCardGenerator
from fcc.personas.registry import PersonaRegistry
registry = PersonaRegistry.from_data_dir()
generator = ModelCardGenerator(version="0.9.0")
# Single persona card
spec = registry.get("RC")
card = generator.from_persona(spec)
print(f"Card: {card.model_name}")
print(f"Intended use: {card.intended_use}")
print(f"Risk category: {card.risk_category}")
# All cards (102 persona + 7 workflow + 20 category)
all_cards = generator.from_registry(registry)
print(f"Total cards: {len(all_cards)}")
# Render as Markdown
md = generator.render_markdown(card)
print(md)
# Batch write to disk
count = generator.batch_render(all_cards, "output/model_cards", fmt="markdown")
print(f"Written {count} model card files")
Risk Category Mapping¶
The generator maps constitution tiers to EU AI Act risk categories:
| Constitution Profile | Risk Category |
|---|---|
| 3+ hard-stop rules | HIGH |
| Any mandatory patterns | LIMITED |
| Only preferred patterns | MINIMAL |
| No constitution | None |
Datasheets (Gebru et al. 2021)¶
Datasheets document the data sources that configure personas. The
Datasheet frozen dataclass follows the 12-section structure from Gebru
et al.:
from fcc.evaluation.card_generator import ModelCardGenerator
from fcc.personas.registry import PersonaRegistry
registry = PersonaRegistry.from_data_dir()
generator = ModelCardGenerator()
datasheet = generator.generate_datasheet("FCC Persona Dataset", registry)
print(f"Dataset: {datasheet.dataset_name}")
print(f"Composition: {datasheet.composition}")
print(f"Purpose: {datasheet.purpose}")
print(f"Limitations: {datasheet.limitations}")
| Section | Description |
|---|---|
dataset_name |
Name of the dataset |
description |
High-level overview |
purpose |
Why the dataset exists |
creator |
Who created it |
composition |
What instances it contains |
collection_process |
How data was gathered |
preprocessing |
Transformations applied |
uses |
Intended uses |
distribution |
How it is distributed |
maintenance |
Update and support plans |
ethical_considerations |
Known ethical issues |
limitations |
Known limitations |
Best Practices for Evaluation¶
1. Establish Baselines Early¶
Run baseline benchmarks before making changes. Store results in version control alongside the code:
2. Use Mock Mode for CI¶
The BenchmarkRunner defaults to mock=True, which uses the
MockAIClient for deterministic, zero-cost execution. This ensures CI
benchmarks are fast and repeatable without API key dependencies.
3. Track Trends, Not Absolutes¶
Individual CLEAR+ dimension values depend on the simulation engine mode. Focus on trends over time:
- Is efficacy improving release over release?
- Is cost staying within budget constraints?
- Is coverage increasing as new personas are added?
4. Generate Model Cards for Every Release¶
Include model card generation in your release pipeline:
fcc benchmark run --suite baseline --output results.yaml
fcc model-cards generate --benchmarks results.yaml --output docs/model_cards/
5. Pair Quantitative with Qualitative¶
CLEAR+ provides quantitative measurement. Pair it with:
- Human review of simulation traces
- Collaboration session scoring
- Cross-reference matrix analysis
6. Document Limitations Honestly¶
Model cards should honestly document limitations. The limitations
field is not optional -- every persona has constraints that users need
to understand before deployment.
7. Review Risk Categories Regularly¶
As constitutions evolve, risk categories may change. Re-run the classifier periodically:
from fcc.compliance.classifier import AIActClassifier
classifier = AIActClassifier(constitution_registry=const_reg)
for pid in registry.ids:
spec = registry.get(pid)
risk = classifier.classify_persona(spec)
print(f" {pid}: {risk.value}")
Practical Exercises¶
Exercise 1: Run a Baseline Benchmark¶
Load the baseline benchmark suite from
src/fcc/data/evaluation/baseline_benchmarks.yaml, run it with the
BenchmarkRunner, and inspect the CLEAR+ dimensions for each spec.
Exercise 2: Detect Regressions¶
Run the baseline suite twice with different max_steps values (50 and
25). Compare the results and verify that the runner correctly detects
efficacy regressions.
Exercise 3: Generate Model Cards¶
Use the ModelCardGenerator to produce model cards for all 102 personas.
Write them to disk in Markdown format and inspect the risk category
assignments.
Exercise 4: CI Advisory Pipeline¶
Set up a GitHub Actions workflow that runs CLEAR+ benchmarks on every pull request and posts a comparison summary as a PR comment.
Summary¶
| Component | Purpose | Module |
|---|---|---|
| CLEARPlusMetrics | 7-dimension measurement | fcc.evaluation.metrics |
| BenchmarkSpec | What to benchmark | fcc.evaluation.benchmark |
| BenchmarkSuite | Groups of specs | fcc.evaluation.benchmark |
| BenchmarkRunner | Execute and compare | fcc.evaluation.runner |
| ModelCard | Structured documentation | fcc.evaluation.cards |
| Datasheet | Data source documentation | fcc.evaluation.cards |
| ModelCardGenerator | Produce cards from registries | fcc.evaluation.card_generator |
CLEAR+ evaluation transforms FCC from a qualitative framework into a quantitatively measurable one. Model cards and datasheets add structured documentation that satisfies both internal review and external regulatory requirements.
Next Steps
- Read Chapter 20 for automated compliance auditing
- Explore the CLEAR+ Benchmarking Guide for hands-on practice
- See the Model Card Generation Tutorial for step-by-step card creation