Open Science -- Phase 14 Addendum¶

This addendum extends the Open Science Demo with Phase 14 evaluation and compliance features that support reproducible, transparent scientific research.

New Research Capabilities¶

CLEAR+ Benchmarks for Research Reproducibility¶

Phase 14 adds structured benchmarking to the open science workflow. Researchers can now include CLEAR+ benchmark results alongside their simulation outputs, providing quantitative evidence of agent system quality.

from fcc.evaluation.runner import BenchmarkRunner
from fcc.evaluation.benchmark import BenchmarkSuite

# Run benchmarks as part of the research protocol
suite = BenchmarkSuite.from_yaml("src/fcc/data/evaluation/baseline_benchmarks.yaml")
runner = BenchmarkRunner(mock=True)
results = runner.run_suite(suite)

# Serialise for supplementary materials
runner.serialize_results(results, "supplementary/benchmarks.yaml")

Model Cards for Research Transparency¶

Model cards satisfy the transparency requirements of major ML publication venues. Generate cards as part of your research documentation pipeline:

from fcc.evaluation.card_generator import ModelCardGenerator

generator = ModelCardGenerator()
cards = generator.from_registry(registry, benchmarks=benchmark_map)
generator.batch_render(cards, "supplementary/model_cards/", fmt="markdown")

Each card includes: - Intended use and out-of-scope uses - R.I.S.C.E.A.R. specification summary - CLEAR+ evaluation metrics - Ethical considerations and limitations - EU AI Act risk category

FAIR-Aligned Evaluation¶

Findable¶

Benchmark suites have unique names and versions. Results are serialised with ISO 8601 timestamps for temporal ordering.

Accessible¶

Results are stored as YAML or JSON in standard file formats accessible by any programming language.

Interoperable¶

CLEARPlusMetrics uses standardised dimension names. The as_vector() method provides a numeric representation for statistical analysis.

Reusable¶

BenchmarkSpec captures the complete configuration needed to reproduce a benchmark run: scenario, personas, thresholds, workflow graph, and tags.

Datasheets for Dataset Documentation¶

The open science demo now includes datasheet generation for persona datasets, following Gebru et al. 2021:

datasheet = generator.generate_datasheet("FCC Persona Dataset", registry)

Include datasheets when publishing research that uses FCC persona configurations as experimental inputs.

Compliance for Research Ethics¶

IRB and Ethics Board Documentation¶

The compliance module generates documentation suitable for Institutional Review Board (IRB) submissions:

Risk classification reports show which personas involve HIGH-risk decision-making
Evidence graphs provide traceable audit trails
Model cards document ethical considerations per persona

EU AI Act Alignment for EU-Funded Research¶

EU-funded research projects must comply with the AI Act. The dual-regulation audit provides structured evidence:

from fcc.compliance.auditor import ComplianceAuditor

auditor = ComplianceAuditor(
    requirement_registry=req_registry,
    classifier=classifier,
    constitution_registry=const_reg,
)
eu_report, nist_report = auditor.dual_regulation_audit(registry)

Demo Walkthrough Updates¶

The open science demo now includes two additional steps:

Step 5: Benchmark Documentation - Runs a baseline benchmark suite - Generates a summary table of CLEAR+ dimensions - Saves results for supplementary materials

Step 6: Compliance Documentation - Classifies all research-relevant personas by risk tier - Generates model cards with ethical considerations - Produces a datasheet for the persona dataset - Creates an evidence graph for ethics board review

Statistical Reporting Template¶

For research publications, use this template for reporting CLEAR+ results:

Table N: CLEAR+ Benchmark Results (mean +/- std, n=10 runs)

| Spec               | Cost    | Latency | Efficacy   | Coverage   |
|---------------------|---------|---------|------------|------------|
| core-5-persona     | 0 +/- 0 | 0 +/- 0 | 1.00 +/- 0 | 1.00 +/- 0 |
| extended-workflow   | 0 +/- 0 | 0 +/- 0 | 1.00 +/- 0 | 1.00 +/- 0 |

Report all seven dimensions. Include the benchmark suite YAML in supplementary materials for reproducibility.

Tips¶

Run benchmarks with mock=True for deterministic, reproducible results
Include both YAML suite definitions and result files in supplementary materials
Generate model cards and datasheets as standard practice for ML research publications
Use the NIST crosswalk for US-funded research compliance