Experimental Design with FCC¶
How to design, pre-register, and execute FCC-based experiments that can be published, reproduced, and extended. This guide complements FAIR Workflow, Reproducibility, and Statistical Analysis.
Why FCC is an experimental instrument¶
FCC was built for the following experimental properties:
- Determinism on demand -- mock mode produces bit-identical traces.
- Variable isolation -- persona, scenario, workflow, and AI provider are orthogonal axes you can vary independently.
- Event replay -- every run emits a reproducible event stream you can re-analyze.
- Structured specs -- R.I.S.C.E.A.R. encodes persona behavior in a way you can version, diff, and cite.
The experimental pipeline looks like this:
Figure 1 shows the end-to-end FCC experimental pipeline, from hypothesis through pre-registration, run matrix, trace capture, and statistical analysis back to a publishable replication pack.
flowchart TD
H["Hypothesis"] --> V["Variable isolation<br/>(persona / scenario / provider / workflow)"]
V --> P["Pre-registration<br/>(hash of specs + seeds)"]
P --> R["Run matrix<br/>(N × M × K)"]
R --> T["Trace capture<br/>(event bus + JSON export)"]
T --> A["Statistical analysis"]
A --> C["Conclusions + replication pack"]
C -.->|iterate| H
style H fill:#e3f2fd
style P fill:#fff3e0
style C fill:#e8f5e9
The pipeline is deliberately linear up to the trace-capture stage so that every downstream statistic is derivable from the pre-registered spec hash and the persisted event stream. Pre-registration hashes the specs plus the seed list, which means any deviation from the original matrix is detectable by a single diff of two YAML files.
Step 1: Formulate a testable hypothesis¶
Good FCC hypotheses have these properties:
- Specific treatment: "Replacing the
evidence_gathererpersona withempirical_auditorchanges CLEAR+ coverage score by at least 5 points on scenarios in thecompliancefamily." - Falsifiable: you can specify a decision rule before running.
- Scoped population: which scenarios, which providers, which workflow graph variant.
Anti-patterns:
- "AI will do better than mock" (too vague -- at what? compared to what?)
- "My persona is good" (no comparison, no measurement)
Step 2: Isolate your variables¶
FCC experiments have four orthogonal variation axes:
| Axis | Example values | How to vary it |
|---|---|---|
| Persona variant | evidence_gatherer vs empirical_auditor |
Swap in scenario personas: list |
| Scenario variant | basic_fcc_cycle, compliance_deep_dive |
Use ScenarioLoader.from_directory() |
| Workflow variant | 5-node, 20-node, 24-node, 55-node | Set scenario.workflow_graph_id |
| AI provider | mock, anthropic, openai, ollama, litellm |
SimulationEngine(..., mode=) |
Rule of thumb: vary one axis at a time for the main result, then run a small factorial for the ablations.
Step 3: Choose deterministic vs AI-powered mode¶
Reproducibility is a spectrum:
Figure 2 illustrates the three-point reproducibility spectrum available to an FCC experimenter, from bit-identical mock mode through deterministic AI mode to fully stochastic sampling.
flowchart LR
M["mock mode<br/>bit-identical"] --> S["AI mode +<br/>temperature=0 +<br/>fixed seed"]
S --> P["AI mode +<br/>temperature>0<br/>(stochastic)"]
style M fill:#c8e6c9
style S fill:#fff9c4
style P fill:#ffcdd2
Choose the mode that matches the claim you want to make. Regression tests and methodology checks belong at the green end, where reruns must produce identical traces byte-for-byte; distributional claims belong at the red end, where you explicitly budget for N >= 30 runs per cell.
- Mock mode is the gold standard for methodology validation (does the harness emit the right events?) and for CI regression tests.
- AI mode, temperature=0 gives you functional reproducibility modulo provider-side model updates. Always log the exact model string.
- AI mode, temperature>0 is appropriate for exploratory studies. Plan for N >= 30 runs per cell to characterize the distribution.
Step 4: Scenarios as experimental units¶
Treat each scenario as one experimental unit. A scenario specifies:
- Personas involved
- Workflow graph
- Input data
- Success criteria (quality gates)
A well-designed experimental scenario includes:
id: experiment_001_baseline
title: "Baseline: default evidence_gatherer"
personas:
- evidence_gatherer
- solution_architect
- critique_lead
workflow_graph_id: "basic_5_node"
setup:
ai_config:
provider: anthropic
model: claude-opus-4-6
temperature: 0
max_tokens: 2048
seed: 42
inputs:
- type: prompt
content: "Draft a compliance outline for [topic]."
quality_gates:
- coverage_min_0_7
- citation_ratio_min_0_3
Duplicate the scenario with only the persona swap for your treatment arm. This
lets you run experiment_001_baseline and experiment_001_treatment as a
paired comparison.
Step 5: Capture traces and events¶
The trace object is your raw data. Persist every run:
from fcc.messaging.serialization import EventSerializer
# Subscribe an event recorder before running
recorder = EventSerializer.file_recorder(path="runs/exp_001_baseline.jsonl")
bus.subscribe(recorder)
trace = engine.run(scenario)
trace.to_json(path="runs/exp_001_baseline_trace.json")
The paired (events.jsonl, trace.json) is enough to re-derive every downstream
statistic.
Step 6: Power analysis¶
Given the 147-persona catalog, full-factorial experiments explode quickly. Useful heuristics:
| Design | N cells | N runs per cell | Typical total |
|---|---|---|---|
| One persona swap | 2 | 30 | 60 |
| Two-by-two (persona x provider) | 4 | 30 | 120 |
| Workflow ablation (4 graphs) | 4 | 30 | 120 |
| Full vertical sweep (6 verticals x 5 scenarios) | 30 | 10 | 300 |
| Full catalog audit (147 personas) | 147 | 5 | 735 |
For a standard two-arm comparison on CLEAR+ scores with expected effect size d=0.5, 30 runs per arm gives you ~80% power at alpha=0.05.
When running 100+ cells, budget for multiple-comparisons correction -- see statistical-analysis.md.
Step 7: Pre-registration template¶
Before you run anything, save this block as preregistration.yaml in your
experiment directory, commit to Git, and share the commit hash:
experiment_id: "fcc_exp_YYYY_MM_DD_author"
title: "One-line description"
fcc_version: "1.3.3"
hypothesis: |
Specific, testable claim with operational definitions.
primary_outcome:
metric: "CLEAR+ coverage score"
how_computed: "Average over scenarios in `compliance` family"
decision_rule: "Treatment wins if mean diff > 5 AND p < 0.05"
secondary_outcomes:
- metric: "latency_ms"
- metric: "citation_ratio"
arms:
- id: baseline
persona_override: evidence_gatherer
- id: treatment
persona_override: empirical_auditor
scenarios:
- compliance_soc2
- compliance_iso27001
- compliance_hipaa
provider_matrix:
- {provider: anthropic, model: claude-opus-4-6, temperature: 0}
- {provider: mock}
runs_per_cell: 30
seeds: [1, 2, 3, ..., 30]
analysis_plan: |
Paired t-test, bootstrap 95% CI, Holm-Bonferroni across 3 scenarios.
stopping_rule: "Sequential testing not used; run full matrix."
Step 8: The replication pack¶
For publication, ship a replication pack containing:
preregistration.yaml(the pre-reg above)scenarios/(all scenario YAMLs used)personas/(any custom personas)runs/(traces + event logs)analysis.ipynb(the notebook that produced every figure)environment.yaml(exact dependency versions)README.md(one-command reproduction:make replicate)
The 27 publishing artifacts under publications/_output/ show how FCC itself
documents this pattern.
Worked example: comparing two champions¶
Figure 3 traces a concrete paired experiment comparing a baseline and a treatment persona across three compliance scenarios, 30 seeds per arm, and a post-hoc paired t-test with bootstrap confidence interval.
sequenceDiagram
participant E as Experimenter
participant S as ScenarioLoader
participant En as SimulationEngine
participant B as EventBus
participant A as Analyzer
E->>S: load 3 compliance scenarios
loop for persona in [baseline, treatment]
loop for seed in 1..30
E->>En: run(scenario, persona, seed)
En->>B: emit 40-60 events
B-->>A: store to JSONL
end
end
E->>A: paired t-test, bootstrap CI
A-->>E: treatment_delta = +7.3, 95% CI [+2.1, +11.8]
Total run budget for this design is 180 engine invocations (3 scenarios x 2 arms x 30 seeds), matched to the ~80% power heuristic for Cohen's d = 0.5. The reported delta and CI are the only line items that belong in an abstract — the raw event stream and the scenario manifests are what a reviewer needs to reproduce them.
Checklist before hitting "run"¶
- Hypothesis is falsifiable and has a decision rule
- Only one axis varied in the main comparison
- Seeds are explicit, committed, and reproducible
- Pre-registration YAML committed to Git before first run
- Event recorder subscribed before
engine.run() - Scenarios are in version control
- Runs-per-cell justified by power analysis
- Multiple-comparisons plan in place
- Replication pack skeleton exists
Related resources¶
- Statistical analysis -- how to interpret the output of your experimental runs
- FAIR workflow -- making your data and outputs Findable, Accessible, Interoperable, Reusable
- Reproducibility -- environment pinning and seed management
- Research methodology -- FCC as a research instrument
- Notebook
19_evaluation_and_compliance.ipynb-- CLEAR+ benchmarks hands-on src/fcc/evaluation/benchmark.py-- BenchmarkSpec / BenchmarkSuite APIs