Reproducible Benchmarks¶
How to use the FCC CLEAR+ evaluation framework for reproducible, publishable benchmarking of multi-agent systems in research contexts.
Why Reproducibility Matters¶
Benchmarking claims in AI research are only meaningful when other researchers can reproduce them. FCC addresses the three pillars of benchmark reproducibility:
| Pillar | FCC Implementation |
|---|---|
| Deterministic execution | MockAIClient provides identical outputs across runs |
| Versioned specifications | BenchmarkSpec + BenchmarkSuite stored as YAML alongside code |
| Structured reporting | CLEARPlusMetrics with serialization to YAML/JSON |
CLEAR+ for Research¶
CLEAR+ provides seven standardised dimensions for evaluating multi-agent systems. This consistency enables cross-study comparison -- a persistent gap in current agent evaluation literature.
Dimension Reference¶
| Dimension | Range | Interpretation |
|---|---|---|
| Cost | 0+ (tokens) | Resource consumption; lower is better |
| Latency | 0+ (ms) | Wall-clock time; lower is better |
| Efficacy | 0.0--1.0 | Task completion quality |
| Assurance | 0.0--1.0 | Safety and constraint compliance |
| Reliability | 0.0--1.0 | Run-to-run consistency |
| Coverage | 0.0--1.0 | Persona activation ratio |
| Explainability | 0.0--1.0 | Trace and rationale completeness |
Statistical Reporting¶
For publishable results, run each benchmark multiple times and report summary statistics:
from fcc.evaluation.runner import BenchmarkRunner
from fcc.evaluation.benchmark import BenchmarkSuite
from fcc.evaluation.metrics import CLEARPlusMetrics
import statistics
suite = BenchmarkSuite.from_yaml("src/fcc/data/evaluation/baseline_benchmarks.yaml")
runner = BenchmarkRunner(mock=True)
# 10 runs for statistical significance
all_results = [runner.run_suite(suite) for _ in range(10)]
# Aggregate per dimension
for spec_idx in range(len(suite.specs)):
spec_name = suite.specs[spec_idx].name
efficacies = [run[spec_idx].metrics.efficacy for run in all_results]
coverages = [run[spec_idx].metrics.coverage for run in all_results]
print(f"{spec_name}:")
print(f" Efficacy: {statistics.mean(efficacies):.3f} +/- {statistics.stdev(efficacies):.3f}")
print(f" Coverage: {statistics.mean(coverages):.3f} +/- {statistics.stdev(coverages):.3f}")
Benchmark Suite Design¶
Baseline Suites¶
Establish baseline suites that remain stable across versions:
# benchmarks/baseline_v1.yaml
name: baseline-v1
description: Stable baseline for cross-version comparison
version: "1.0"
specs:
- name: core-5-persona
scenario: GEN-001
personas: [RC, BC, QR, IG, PM]
workflow_graph: base_sequence
clear_thresholds:
cost: 5000.0
latency_ms: 2000.0
efficacy: 0.70
assurance: 0.70
reliability: 0.85
coverage: 0.75
explainability: 0.50
Stress Suites¶
Create stress suites to test scalability:
from fcc.evaluation.benchmark import BenchmarkSpec, BenchmarkSuite
specs = []
for step_count in [10, 25, 50, 100]:
specs.append(BenchmarkSpec(
name=f"stress-{step_count}-steps",
description=f"Stress test with {step_count} max steps",
scenario="GEN-001",
personas=("RC", "BC", "QR"),
workflow_graph="base_sequence",
tags=("stress",),
))
stress_suite = BenchmarkSuite(
name="stress-tests",
specs=tuple(specs),
)
Regression Detection¶
Compare experimental conditions against baselines:
baseline = BenchmarkRunner.load_results("benchmarks/baseline_results.yaml")
candidate = runner.run_suite(suite)
comparison = runner.compare(baseline, candidate, threshold=0.05)
print(f"Regressions: {len(comparison.regressions)}")
print(f"Improvements: {len(comparison.improvements)}")
A regression is flagged when a higher-is-better dimension drops by more than the threshold, or a lower-is-better dimension increases by more than the threshold.
Persisting Results for Publication¶
Save results in both YAML (human-readable) and JSON (machine-parseable):
runner.serialize_results(results, "experiments/exp1_results.yaml", fmt="yaml")
runner.serialize_results(results, "experiments/exp1_results.json", fmt="json")
Include benchmark YAML files and result files in your paper's supplementary materials.
Model Cards for Transparency¶
Generate model cards to satisfy research transparency norms:
from fcc.evaluation.card_generator import ModelCardGenerator
generator = ModelCardGenerator()
cards = generator.from_registry(registry)
# Write Markdown cards for supplementary materials
generator.batch_render(cards, "supplementary/model_cards/", fmt="markdown")
Each card documents intended use, limitations, ethical considerations, CLEAR+ metrics, and risk categories -- aligning with Mitchell et al. 2019 recommendations adopted by major ML venues.
Datasheets for Dataset Documentation¶
The datasheet follows Gebru et al. 2021, covering composition, collection process, intended uses, and limitations. Include datasheets when publishing persona-based research.
FAIR Alignment¶
CLEAR+ benchmarks support FAIR principles:
| FAIR Principle | Implementation |
|---|---|
| Findable | Benchmark suites have unique names and versions |
| Accessible | Results serialised to YAML/JSON |
| Interoperable | CLEARPlusMetrics uses standard dimension names |
| Reusable | BenchmarkSpec captures full configuration for re-execution |
Related Resources¶
- Reproducibility Guide -- General FCC reproducibility
- FAIR Workflow -- FAIR-compliant research workflows
- CLEAR+ Benchmarking Guide -- Hands-on tutorial
- Guidebook Chapter 19 -- Full evaluation reference