Experimental Design with FCC¶

How to design, pre-register, and execute FCC-based experiments that can be published, reproduced, and extended. This guide complements FAIR Workflow, Reproducibility, and Statistical Analysis.

Why FCC is an experimental instrument¶

FCC was built for the following experimental properties:

Determinism on demand -- mock mode produces bit-identical traces.
Variable isolation -- persona, scenario, workflow, and AI provider are orthogonal axes you can vary independently.
Event replay -- every run emits a reproducible event stream you can re-analyze.
Structured specs -- R.I.S.C.E.A.R. encodes persona behavior in a way you can version, diff, and cite.

The experimental pipeline looks like this:

Figure 1 shows the end-to-end FCC experimental pipeline, from hypothesis through pre-registration, run matrix, trace capture, and statistical analysis back to a publishable replication pack.

flowchart TD
    H["Hypothesis"] --> V["Variable isolation<br/>(persona / scenario / provider / workflow)"]
    V --> P["Pre-registration<br/>(hash of specs + seeds)"]
    P --> R["Run matrix<br/>(N × M × K)"]
    R --> T["Trace capture<br/>(event bus + JSON export)"]
    T --> A["Statistical analysis"]
    A --> C["Conclusions + replication pack"]
    C -.->|iterate| H
    style H fill:#e3f2fd
    style P fill:#fff3e0
    style C fill:#e8f5e9

The pipeline is deliberately linear up to the trace-capture stage so that every downstream statistic is derivable from the pre-registered spec hash and the persisted event stream. Pre-registration hashes the specs plus the seed list, which means any deviation from the original matrix is detectable by a single diff of two YAML files.

Step 1: Formulate a testable hypothesis¶

Good FCC hypotheses have these properties:

Specific treatment: "Replacing the evidence_gatherer persona with empirical_auditor changes CLEAR+ coverage score by at least 5 points on scenarios in the compliance family."
Falsifiable: you can specify a decision rule before running.
Scoped population: which scenarios, which providers, which workflow graph variant.

Anti-patterns:

"AI will do better than mock" (too vague -- at what? compared to what?)
"My persona is good" (no comparison, no measurement)

Step 2: Isolate your variables¶

FCC experiments have four orthogonal variation axes:

Axis	Example values	How to vary it
Persona variant	`evidence_gatherer` vs `empirical_auditor`	Swap in scenario `personas:` list
Scenario variant	`basic_fcc_cycle`, `compliance_deep_dive`	Use `ScenarioLoader.from_directory()`
Workflow variant	5-node, 20-node, 24-node, 55-node	Set `scenario.workflow_graph_id`
AI provider	`mock`, `anthropic`, `openai`, `ollama`, `litellm`	`SimulationEngine(..., mode=)`

Rule of thumb: vary one axis at a time for the main result, then run a small factorial for the ablations.

Step 3: Choose deterministic vs AI-powered mode¶

Reproducibility is a spectrum:

Figure 2 illustrates the three-point reproducibility spectrum available to an FCC experimenter, from bit-identical mock mode through deterministic AI mode to fully stochastic sampling.

flowchart LR
    M["mock mode<br/>bit-identical"] --> S["AI mode +<br/>temperature=0 +<br/>fixed seed"]
    S --> P["AI mode +<br/>temperature>0<br/>(stochastic)"]
    style M fill:#c8e6c9
    style S fill:#fff9c4
    style P fill:#ffcdd2

Choose the mode that matches the claim you want to make. Regression tests and methodology checks belong at the green end, where reruns must produce identical traces byte-for-byte; distributional claims belong at the red end, where you explicitly budget for N >= 30 runs per cell.

Mock mode is the gold standard for methodology validation (does the harness emit the right events?) and for CI regression tests.
AI mode, temperature=0 gives you functional reproducibility modulo provider-side model updates. Always log the exact model string.
AI mode, temperature>0 is appropriate for exploratory studies. Plan for N >= 30 runs per cell to characterize the distribution.

Step 4: Scenarios as experimental units¶

Treat each scenario as one experimental unit. A scenario specifies:

Personas involved
Workflow graph
Input data
Success criteria (quality gates)

A well-designed experimental scenario includes:

id: experiment_001_baseline
title: "Baseline: default evidence_gatherer"
personas:
  - evidence_gatherer
  - solution_architect
  - critique_lead
workflow_graph_id: "basic_5_node"
setup:
  ai_config:
    provider: anthropic
    model: claude-opus-4-6
    temperature: 0
    max_tokens: 2048
  seed: 42
inputs:
  - type: prompt
    content: "Draft a compliance outline for [topic]."
quality_gates:
  - coverage_min_0_7
  - citation_ratio_min_0_3

Duplicate the scenario with only the persona swap for your treatment arm. This lets you run experiment_001_baseline and experiment_001_treatment as a paired comparison.

Step 5: Capture traces and events¶

The trace object is your raw data. Persist every run:

from fcc.messaging.serialization import EventSerializer

# Subscribe an event recorder before running
recorder = EventSerializer.file_recorder(path="runs/exp_001_baseline.jsonl")
bus.subscribe(recorder)

trace = engine.run(scenario)
trace.to_json(path="runs/exp_001_baseline_trace.json")

The paired (events.jsonl, trace.json) is enough to re-derive every downstream statistic.

Step 6: Power analysis¶

Given the 147-persona catalog, full-factorial experiments explode quickly. Useful heuristics:

Design	N cells	N runs per cell	Typical total
One persona swap	2	30	60
Two-by-two (persona x provider)	4	30	120
Workflow ablation (4 graphs)	4	30	120
Full vertical sweep (6 verticals x 5 scenarios)	30	10	300
Full catalog audit (147 personas)	147	5	735

For a standard two-arm comparison on CLEAR+ scores with expected effect size d=0.5, 30 runs per arm gives you ~80% power at alpha=0.05.

When running 100+ cells, budget for multiple-comparisons correction -- see statistical-analysis.md.

Step 7: Pre-registration template¶

Before you run anything, save this block as preregistration.yaml in your experiment directory, commit to Git, and share the commit hash:

experiment_id: "fcc_exp_YYYY_MM_DD_author"
title: "One-line description"
fcc_version: "1.3.3"
hypothesis: |
  Specific, testable claim with operational definitions.
primary_outcome:
  metric: "CLEAR+ coverage score"
  how_computed: "Average over scenarios in `compliance` family"
  decision_rule: "Treatment wins if mean diff > 5 AND p < 0.05"
secondary_outcomes:
  - metric: "latency_ms"
  - metric: "citation_ratio"
arms:
  - id: baseline
    persona_override: evidence_gatherer
  - id: treatment
    persona_override: empirical_auditor
scenarios:
  - compliance_soc2
  - compliance_iso27001
  - compliance_hipaa
provider_matrix:
  - {provider: anthropic, model: claude-opus-4-6, temperature: 0}
  - {provider: mock}
runs_per_cell: 30
seeds: [1, 2, 3, ..., 30]
analysis_plan: |
  Paired t-test, bootstrap 95% CI, Holm-Bonferroni across 3 scenarios.
stopping_rule: "Sequential testing not used; run full matrix."

Step 8: The replication pack¶

For publication, ship a replication pack containing:

preregistration.yaml (the pre-reg above)
scenarios/ (all scenario YAMLs used)
personas/ (any custom personas)
runs/ (traces + event logs)
analysis.ipynb (the notebook that produced every figure)
environment.yaml (exact dependency versions)
README.md (one-command reproduction: make replicate)

The 27 publishing artifacts under publications/_output/ show how FCC itself documents this pattern.

Worked example: comparing two champions¶

Figure 3 traces a concrete paired experiment comparing a baseline and a treatment persona across three compliance scenarios, 30 seeds per arm, and a post-hoc paired t-test with bootstrap confidence interval.

sequenceDiagram
    participant E as Experimenter
    participant S as ScenarioLoader
    participant En as SimulationEngine
    participant B as EventBus
    participant A as Analyzer
    E->>S: load 3 compliance scenarios
    loop for persona in [baseline, treatment]
      loop for seed in 1..30
        E->>En: run(scenario, persona, seed)
        En->>B: emit 40-60 events
        B-->>A: store to JSONL
      end
    end
    E->>A: paired t-test, bootstrap CI
    A-->>E: treatment_delta = +7.3, 95% CI [+2.1, +11.8]

Total run budget for this design is 180 engine invocations (3 scenarios x 2 arms x 30 seeds), matched to the ~80% power heuristic for Cohen's d = 0.5. The reported delta and CI are the only line items that belong in an abstract — the raw event stream and the scenario manifests are what a reviewer needs to reproduce them.

Checklist before hitting "run"¶

Hypothesis is falsifiable and has a decision rule
Only one axis varied in the main comparison
Seeds are explicit, committed, and reproducible
Pre-registration YAML committed to Git before first run
Event recorder subscribed before engine.run()
Scenarios are in version control
Runs-per-cell justified by power analysis
Multiple-comparisons plan in place
Replication pack skeleton exists

Statistical analysis -- how to interpret the output of your experimental runs
FAIR workflow -- making your data and outputs Findable, Accessible, Interoperable, Reusable
Reproducibility -- environment pinning and seed management
Research methodology -- FCC as a research instrument
Notebook 19_evaluation_and_compliance.ipynb -- CLEAR+ benchmarks hands-on
src/fcc/evaluation/benchmark.py -- BenchmarkSpec / BenchmarkSuite APIs