Reproducible Benchmarks¶

How to use the FCC CLEAR+ evaluation framework for reproducible, publishable benchmarking of multi-agent systems in research contexts.

Why Reproducibility Matters¶

Benchmarking claims in AI research are only meaningful when other researchers can reproduce them. FCC addresses the three pillars of benchmark reproducibility:

Pillar	FCC Implementation
Deterministic execution	MockAIClient provides identical outputs across runs
Versioned specifications	BenchmarkSpec + BenchmarkSuite stored as YAML alongside code
Structured reporting	CLEARPlusMetrics with serialization to YAML/JSON

CLEAR+ for Research¶

CLEAR+ provides seven standardised dimensions for evaluating multi-agent systems. This consistency enables cross-study comparison -- a persistent gap in current agent evaluation literature.

Dimension Reference¶

Dimension	Range	Interpretation
Cost	0+ (tokens)	Resource consumption; lower is better
Latency	0+ (ms)	Wall-clock time; lower is better
Efficacy	0.0--1.0	Task completion quality
Assurance	0.0--1.0	Safety and constraint compliance
Reliability	0.0--1.0	Run-to-run consistency
Coverage	0.0--1.0	Persona activation ratio
Explainability	0.0--1.0	Trace and rationale completeness

Statistical Reporting¶

For publishable results, run each benchmark multiple times and report summary statistics:

from fcc.evaluation.runner import BenchmarkRunner
from fcc.evaluation.benchmark import BenchmarkSuite
from fcc.evaluation.metrics import CLEARPlusMetrics
import statistics

suite = BenchmarkSuite.from_yaml("src/fcc/data/evaluation/baseline_benchmarks.yaml")
runner = BenchmarkRunner(mock=True)

# 10 runs for statistical significance
all_results = [runner.run_suite(suite) for _ in range(10)]

# Aggregate per dimension
for spec_idx in range(len(suite.specs)):
    spec_name = suite.specs[spec_idx].name
    efficacies = [run[spec_idx].metrics.efficacy for run in all_results]
    coverages = [run[spec_idx].metrics.coverage for run in all_results]

    print(f"{spec_name}:")
    print(f"  Efficacy: {statistics.mean(efficacies):.3f} +/- {statistics.stdev(efficacies):.3f}")
    print(f"  Coverage: {statistics.mean(coverages):.3f} +/- {statistics.stdev(coverages):.3f}")

Benchmark Suite Design¶

Baseline Suites¶

Establish baseline suites that remain stable across versions:

# benchmarks/baseline_v1.yaml
name: baseline-v1
description: Stable baseline for cross-version comparison
version: "1.0"
specs:
  - name: core-5-persona
    scenario: GEN-001
    personas: [RC, BC, QR, IG, PM]
    workflow_graph: base_sequence
    clear_thresholds:
      cost: 5000.0
      latency_ms: 2000.0
      efficacy: 0.70
      assurance: 0.70
      reliability: 0.85
      coverage: 0.75
      explainability: 0.50

Stress Suites¶

Create stress suites to test scalability:

from fcc.evaluation.benchmark import BenchmarkSpec, BenchmarkSuite

specs = []
for step_count in [10, 25, 50, 100]:
    specs.append(BenchmarkSpec(
        name=f"stress-{step_count}-steps",
        description=f"Stress test with {step_count} max steps",
        scenario="GEN-001",
        personas=("RC", "BC", "QR"),
        workflow_graph="base_sequence",
        tags=("stress",),
    ))

stress_suite = BenchmarkSuite(
    name="stress-tests",
    specs=tuple(specs),
)

Regression Detection¶

Compare experimental conditions against baselines:

baseline = BenchmarkRunner.load_results("benchmarks/baseline_results.yaml")
candidate = runner.run_suite(suite)

comparison = runner.compare(baseline, candidate, threshold=0.05)
print(f"Regressions: {len(comparison.regressions)}")
print(f"Improvements: {len(comparison.improvements)}")

A regression is flagged when a higher-is-better dimension drops by more than the threshold, or a lower-is-better dimension increases by more than the threshold.

Persisting Results for Publication¶

Save results in both YAML (human-readable) and JSON (machine-parseable):

runner.serialize_results(results, "experiments/exp1_results.yaml", fmt="yaml")
runner.serialize_results(results, "experiments/exp1_results.json", fmt="json")

Include benchmark YAML files and result files in your paper's supplementary materials.

Model Cards for Transparency¶

Generate model cards to satisfy research transparency norms:

from fcc.evaluation.card_generator import ModelCardGenerator

generator = ModelCardGenerator()
cards = generator.from_registry(registry)

# Write Markdown cards for supplementary materials
generator.batch_render(cards, "supplementary/model_cards/", fmt="markdown")

Each card documents intended use, limitations, ethical considerations, CLEAR+ metrics, and risk categories -- aligning with Mitchell et al. 2019 recommendations adopted by major ML venues.

Datasheets for Dataset Documentation¶

datasheet = generator.generate_datasheet("FCC Persona Dataset", registry)

The datasheet follows Gebru et al. 2021, covering composition, collection process, intended uses, and limitations. Include datasheets when publishing persona-based research.

FAIR Alignment¶

CLEAR+ benchmarks support FAIR principles:

FAIR Principle	Implementation
Findable	Benchmark suites have unique names and versions
Accessible	Results serialised to YAML/JSON
Interoperable	CLEARPlusMetrics uses standard dimension names
Reusable	BenchmarkSpec captures full configuration for re-execution

Reproducibility Guide -- General FCC reproducibility
FAIR Workflow -- FAIR-compliant research workflows
CLEAR+ Benchmarking Guide -- Hands-on tutorial
Guidebook Chapter 19 -- Full evaluation reference