Skip to content

Open Science -- Phase 14 Addendum

This addendum extends the Open Science Demo with Phase 14 evaluation and compliance features that support reproducible, transparent scientific research.


New Research Capabilities

CLEAR+ Benchmarks for Research Reproducibility

Phase 14 adds structured benchmarking to the open science workflow. Researchers can now include CLEAR+ benchmark results alongside their simulation outputs, providing quantitative evidence of agent system quality.

from fcc.evaluation.runner import BenchmarkRunner
from fcc.evaluation.benchmark import BenchmarkSuite

# Run benchmarks as part of the research protocol
suite = BenchmarkSuite.from_yaml("src/fcc/data/evaluation/baseline_benchmarks.yaml")
runner = BenchmarkRunner(mock=True)
results = runner.run_suite(suite)

# Serialise for supplementary materials
runner.serialize_results(results, "supplementary/benchmarks.yaml")

Model Cards for Research Transparency

Model cards satisfy the transparency requirements of major ML publication venues. Generate cards as part of your research documentation pipeline:

from fcc.evaluation.card_generator import ModelCardGenerator

generator = ModelCardGenerator()
cards = generator.from_registry(registry, benchmarks=benchmark_map)
generator.batch_render(cards, "supplementary/model_cards/", fmt="markdown")

Each card includes: - Intended use and out-of-scope uses - R.I.S.C.E.A.R. specification summary - CLEAR+ evaluation metrics - Ethical considerations and limitations - EU AI Act risk category


FAIR-Aligned Evaluation

Findable

Benchmark suites have unique names and versions. Results are serialised with ISO 8601 timestamps for temporal ordering.

Accessible

Results are stored as YAML or JSON in standard file formats accessible by any programming language.

Interoperable

CLEARPlusMetrics uses standardised dimension names. The as_vector() method provides a numeric representation for statistical analysis.

Reusable

BenchmarkSpec captures the complete configuration needed to reproduce a benchmark run: scenario, personas, thresholds, workflow graph, and tags.


Datasheets for Dataset Documentation

The open science demo now includes datasheet generation for persona datasets, following Gebru et al. 2021:

datasheet = generator.generate_datasheet("FCC Persona Dataset", registry)

Include datasheets when publishing research that uses FCC persona configurations as experimental inputs.


Compliance for Research Ethics

IRB and Ethics Board Documentation

The compliance module generates documentation suitable for Institutional Review Board (IRB) submissions:

  • Risk classification reports show which personas involve HIGH-risk decision-making
  • Evidence graphs provide traceable audit trails
  • Model cards document ethical considerations per persona

EU AI Act Alignment for EU-Funded Research

EU-funded research projects must comply with the AI Act. The dual-regulation audit provides structured evidence:

from fcc.compliance.auditor import ComplianceAuditor

auditor = ComplianceAuditor(
    requirement_registry=req_registry,
    classifier=classifier,
    constitution_registry=const_reg,
)
eu_report, nist_report = auditor.dual_regulation_audit(registry)

Demo Walkthrough Updates

The open science demo now includes two additional steps:

Step 5: Benchmark Documentation - Runs a baseline benchmark suite - Generates a summary table of CLEAR+ dimensions - Saves results for supplementary materials

Step 6: Compliance Documentation - Classifies all research-relevant personas by risk tier - Generates model cards with ethical considerations - Produces a datasheet for the persona dataset - Creates an evidence graph for ethics board review


Statistical Reporting Template

For research publications, use this template for reporting CLEAR+ results:

Table N: CLEAR+ Benchmark Results (mean +/- std, n=10 runs)

| Spec               | Cost    | Latency | Efficacy   | Coverage   |
|---------------------|---------|---------|------------|------------|
| core-5-persona     | 0 +/- 0 | 0 +/- 0 | 1.00 +/- 0 | 1.00 +/- 0 |
| extended-workflow   | 0 +/- 0 | 0 +/- 0 | 1.00 +/- 0 | 1.00 +/- 0 |

Report all seven dimensions. Include the benchmark suite YAML in supplementary materials for reproducibility.


Tips

  • Run benchmarks with mock=True for deterministic, reproducible results
  • Include both YAML suite definitions and result files in supplementary materials
  • Generate model cards and datasheets as standard practice for ML research publications
  • Use the NIST crosswalk for US-funded research compliance