Reproducibility Guide¶

Reproducibility is foundational to any credible use of FCC -- whether in academic research, regulated enterprise deployments, or continuous integration pipelines. This guide covers the mechanisms FCC provides for producing deterministic, verifiable, and repeatable results.

Deterministic Simulation Mode¶

The simulation engine supports two modes. Deterministic mode is the cornerstone of reproducibility.

How Deterministic Mode Works¶

In deterministic mode, the simulation engine uses mock response generators instead of calling an LLM. Every persona activation produces a predictable output derived from its R.I.S.C.E.A.R. specification and the input payload. The same scenario with the same registry and the same workflow graph will always produce the same trace.

from fcc._resources import get_personas_dir
from fcc.personas.registry import PersonaRegistry
from fcc.simulation.engine import SimulationEngine

registry = PersonaRegistry.from_yaml_directory(get_personas_dir())
engine = SimulationEngine(registry=registry, mode="deterministic")

# Run the same scenario twice
trace_a = engine.run_scenario("GEN-001")
trace_b = engine.run_scenario("GEN-001")

# Traces are identical
assert trace_a == trace_b

When to Use Each Mode¶

Mode	Use Case	Reproducibility	Cost
Deterministic	Unit tests, CI/CD, baseline comparisons, controlled experiments	Exact	Free
AI-powered	Production workflows, response quality evaluation, behavioral studies	Approximate (temperature-dependent)	LLM API costs

Achieving Approximate Reproducibility in AI Mode¶

When using AI-powered mode, exact reproducibility is not guaranteed because LLM responses vary. However, you can minimize variance:

Set temperature to 0. This makes the LLM's sampling nearly deterministic (some providers still introduce minor variance).
Pin the model version. Use explicit model identifiers like gpt-4o-2024-08-06 instead of aliases like gpt-4o.
Fix the random seed if the provider supports it. OpenAI's API accepts a seed parameter.
Run multiple trials. Report mean and standard deviation across runs rather than single-run results.

from fcc.simulation.prompts import PromptTemplate

# Configure for maximum reproducibility
template = PromptTemplate(
    system_prompt="...",
    user_template="...",
    model="gpt-4o-2024-08-06",  # Pinned model version
    temperature=0.0,              # Minimal sampling variance
    max_tokens=1024,
)

Trace Format Documentation¶

Every simulation run produces a trace -- a complete record of every persona activation, message, and validation result. Traces conform to the JSON Schema at data/schemas/trace.schema.json.

Trace Schema Overview¶

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["scenario_id", "workflow_graph", "mode", "entries"],
  "properties": {
    "scenario_id": { "type": "string" },
    "workflow_graph": { "type": "string" },
    "mode": { "enum": ["deterministic", "ai_powered"] },
    "started_at": { "type": "string", "format": "date-time" },
    "completed_at": { "type": "string", "format": "date-time" },
    "entries": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["persona_id", "phase", "input", "output"],
        "properties": {
          "persona_id": { "type": "string" },
          "phase": { "enum": ["Find", "Create", "Critique"] },
          "input": { "type": "string" },
          "output": { "type": "string" },
          "timestamp": { "type": "string", "format": "date-time" }
        }
      }
    },
    "validation_results": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["rule", "passed"],
        "properties": {
          "rule": { "type": "string" },
          "passed": { "type": "boolean" },
          "details": { "type": "string" }
        }
      }
    }
  }
}

Storing and Versioning Traces¶

For research reproducibility, traces should be stored alongside the code that produced them:

experiment/
  traces/
    trial_001.json
    trial_002.json
    trial_003.json
  config/
    registry_version.txt    # FCC package version
    scenario.json           # Scenario configuration
    llm_config.json         # Model, temperature, max_tokens
  analysis/
    summary.py              # Analysis script
    results.csv             # Processed results

Version Pinning¶

Persona definitions, workflow graphs, quality gates, and schemas are all versioned as part of the FCC package. Locking the package version ensures that the same persona registry is used across runs.

Pinning in Requirements Files¶

# requirements.txt
fcc-agent-team-ext==0.1.0

Pinning in pyproject.toml¶

[project]
dependencies = [
    "fcc-agent-team-ext==0.1.0",
]

Verifying the Registry¶

After installation, verify the registry contents programmatically:

from fcc._resources import get_personas_dir
from fcc.personas.registry import PersonaRegistry

registry = PersonaRegistry.from_yaml_directory(get_personas_dir())

# Verify expected persona count
assert len(registry) == 24, f"Expected 24 personas, got {len(registry)}"

# Verify specific personas exist
for pid in ["RC", "BC", "DE", "RB", "UG", "RCHM", "BCHM", "UGCH", "RBCH"]:
    assert registry.get(pid) is not None, f"Missing persona: {pid}"

# Verify cross-reference count
from fcc.personas.cross_reference import CrossReferenceMatrix
matrix = CrossReferenceMatrix.from_yaml(get_personas_dir() / "cross_reference.yaml")
assert len(matrix) == 106, f"Expected 106 cross-references, got {len(matrix)}"

Schema Validation for Data Integrity¶

FCC uses JSON Schema validation to ensure data integrity at every level.

Validating Persona Files¶

from fcc._resources import get_personas_dir, get_schemas_dir
from fcc.personas.registry import PersonaRegistry

registry = PersonaRegistry.from_yaml_validated(
    get_personas_dir() / "core_personas.yaml",
    get_schemas_dir() / "persona.schema.json",
)

Validating Traces¶

import json
import jsonschema
from fcc._resources import get_schemas_dir

with open("trace_output.json") as f:
    trace = json.load(f)

with open(get_schemas_dir() / "trace.schema.json") as f:
    schema = json.load(f)

jsonschema.validate(instance=trace, schema=schema)
print("Trace is valid.")

Validating Cross-References¶

with open(get_schemas_dir() / "cross_reference.schema.json") as f:
    schema = json.load(f)

with open(get_personas_dir() / "cross_reference.yaml") as f:
    import yaml
    data = yaml.safe_load(f)

jsonschema.validate(instance=data, schema=schema)
print("Cross-reference matrix is valid.")

Comparing Traces Across Runs¶

For experiments that run the same scenario multiple times (especially in AI-powered mode), you need systematic comparison.

Structural Comparison¶

Check that the same personas were activated in the same order:

import json

def load_trace(path):
    with open(path) as f:
        return json.load(f)

def compare_structure(trace_a, trace_b):
    """Compare activation sequence and phase coverage."""
    seq_a = [(e["persona_id"], e["phase"]) for e in trace_a["entries"]]
    seq_b = [(e["persona_id"], e["phase"]) for e in trace_b["entries"]]
    return seq_a == seq_b

trace_1 = load_trace("trace_001.json")
trace_2 = load_trace("trace_002.json")

if compare_structure(trace_1, trace_2):
    print("Activation sequences match.")
else:
    print("Activation sequences differ -- investigate.")

Content Similarity¶

For AI-powered traces where exact content match is unlikely, use semantic similarity:

def content_similarity(trace_a, trace_b):
    """Compute average content overlap between matching entries."""
    from difflib import SequenceMatcher

    similarities = []
    for ea, eb in zip(trace_a["entries"], trace_b["entries"]):
        if ea["persona_id"] == eb["persona_id"]:
            ratio = SequenceMatcher(None, ea["output"], eb["output"]).ratio()
            similarities.append(ratio)

    if similarities:
        avg = sum(similarities) / len(similarities)
        print(f"Average content similarity: {avg:.1%}")
        return avg
    return 0.0

Quality Gate Consistency¶

Compare whether the same quality gates pass across runs:

def compare_gates(trace_a, trace_b):
    """Compare quality gate results between two traces."""
    gates_a = {r["rule"]: r["passed"] for r in trace_a.get("validation_results", [])}
    gates_b = {r["rule"]: r["passed"] for r in trace_b.get("validation_results", [])}

    all_rules = sorted(set(gates_a) | set(gates_b))
    for rule in all_rules:
        a_pass = gates_a.get(rule, "N/A")
        b_pass = gates_b.get(rule, "N/A")
        match = "MATCH" if a_pass == b_pass else "DIFFER"
        print(f"  {rule}: {a_pass} vs {b_pass} [{match}]")

Reproducibility Checklist¶

Use this checklist when publishing results based on FCC:

Research Methodology -- Using FCC as a research instrument
Quality Gates -- Validation specifications
Workflow Graphs -- Graph definitions