Skip to content

Reproducibility Guide

Reproducibility is foundational to any credible use of FCC -- whether in academic research, regulated enterprise deployments, or continuous integration pipelines. This guide covers the mechanisms FCC provides for producing deterministic, verifiable, and repeatable results.

Deterministic Simulation Mode

The simulation engine supports two modes. Deterministic mode is the cornerstone of reproducibility.

How Deterministic Mode Works

In deterministic mode, the simulation engine uses mock response generators instead of calling an LLM. Every persona activation produces a predictable output derived from its R.I.S.C.E.A.R. specification and the input payload. The same scenario with the same registry and the same workflow graph will always produce the same trace.

from fcc._resources import get_personas_dir
from fcc.personas.registry import PersonaRegistry
from fcc.simulation.engine import SimulationEngine

registry = PersonaRegistry.from_yaml_directory(get_personas_dir())
engine = SimulationEngine(registry=registry, mode="deterministic")

# Run the same scenario twice
trace_a = engine.run_scenario("GEN-001")
trace_b = engine.run_scenario("GEN-001")

# Traces are identical
assert trace_a == trace_b

When to Use Each Mode

Mode Use Case Reproducibility Cost
Deterministic Unit tests, CI/CD, baseline comparisons, controlled experiments Exact Free
AI-powered Production workflows, response quality evaluation, behavioral studies Approximate (temperature-dependent) LLM API costs

Achieving Approximate Reproducibility in AI Mode

When using AI-powered mode, exact reproducibility is not guaranteed because LLM responses vary. However, you can minimize variance:

  1. Set temperature to 0. This makes the LLM's sampling nearly deterministic (some providers still introduce minor variance).
  2. Pin the model version. Use explicit model identifiers like gpt-4o-2024-08-06 instead of aliases like gpt-4o.
  3. Fix the random seed if the provider supports it. OpenAI's API accepts a seed parameter.
  4. Run multiple trials. Report mean and standard deviation across runs rather than single-run results.
from fcc.simulation.prompts import PromptTemplate

# Configure for maximum reproducibility
template = PromptTemplate(
    system_prompt="...",
    user_template="...",
    model="gpt-4o-2024-08-06",  # Pinned model version
    temperature=0.0,              # Minimal sampling variance
    max_tokens=1024,
)

Trace Format Documentation

Every simulation run produces a trace -- a complete record of every persona activation, message, and validation result. Traces conform to the JSON Schema at data/schemas/trace.schema.json.

Trace Schema Overview

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["scenario_id", "workflow_graph", "mode", "entries"],
  "properties": {
    "scenario_id": { "type": "string" },
    "workflow_graph": { "type": "string" },
    "mode": { "enum": ["deterministic", "ai_powered"] },
    "started_at": { "type": "string", "format": "date-time" },
    "completed_at": { "type": "string", "format": "date-time" },
    "entries": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["persona_id", "phase", "input", "output"],
        "properties": {
          "persona_id": { "type": "string" },
          "phase": { "enum": ["Find", "Create", "Critique"] },
          "input": { "type": "string" },
          "output": { "type": "string" },
          "timestamp": { "type": "string", "format": "date-time" }
        }
      }
    },
    "validation_results": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["rule", "passed"],
        "properties": {
          "rule": { "type": "string" },
          "passed": { "type": "boolean" },
          "details": { "type": "string" }
        }
      }
    }
  }
}

Storing and Versioning Traces

For research reproducibility, traces should be stored alongside the code that produced them:

experiment/
  traces/
    trial_001.json
    trial_002.json
    trial_003.json
  config/
    registry_version.txt    # FCC package version
    scenario.json           # Scenario configuration
    llm_config.json         # Model, temperature, max_tokens
  analysis/
    summary.py              # Analysis script
    results.csv             # Processed results

Version Pinning

Persona definitions, workflow graphs, quality gates, and schemas are all versioned as part of the FCC package. Locking the package version ensures that the same persona registry is used across runs.

Pinning in Requirements Files

# requirements.txt
fcc-agent-team-ext==0.1.0

Pinning in pyproject.toml

[project]
dependencies = [
    "fcc-agent-team-ext==0.1.0",
]

Verifying the Registry

After installation, verify the registry contents programmatically:

from fcc._resources import get_personas_dir
from fcc.personas.registry import PersonaRegistry

registry = PersonaRegistry.from_yaml_directory(get_personas_dir())

# Verify expected persona count
assert len(registry) == 24, f"Expected 24 personas, got {len(registry)}"

# Verify specific personas exist
for pid in ["RC", "BC", "DE", "RB", "UG", "RCHM", "BCHM", "UGCH", "RBCH"]:
    assert registry.get(pid) is not None, f"Missing persona: {pid}"

# Verify cross-reference count
from fcc.personas.cross_reference import CrossReferenceMatrix
matrix = CrossReferenceMatrix.from_yaml(get_personas_dir() / "cross_reference.yaml")
assert len(matrix) == 106, f"Expected 106 cross-references, got {len(matrix)}"

Schema Validation for Data Integrity

FCC uses JSON Schema validation to ensure data integrity at every level.

Validating Persona Files

from fcc._resources import get_personas_dir, get_schemas_dir
from fcc.personas.registry import PersonaRegistry

registry = PersonaRegistry.from_yaml_validated(
    get_personas_dir() / "core_personas.yaml",
    get_schemas_dir() / "persona.schema.json",
)

Validating Traces

import json
import jsonschema
from fcc._resources import get_schemas_dir

with open("trace_output.json") as f:
    trace = json.load(f)

with open(get_schemas_dir() / "trace.schema.json") as f:
    schema = json.load(f)

jsonschema.validate(instance=trace, schema=schema)
print("Trace is valid.")

Validating Cross-References

with open(get_schemas_dir() / "cross_reference.schema.json") as f:
    schema = json.load(f)

with open(get_personas_dir() / "cross_reference.yaml") as f:
    import yaml
    data = yaml.safe_load(f)

jsonschema.validate(instance=data, schema=schema)
print("Cross-reference matrix is valid.")

Comparing Traces Across Runs

For experiments that run the same scenario multiple times (especially in AI-powered mode), you need systematic comparison.

Structural Comparison

Check that the same personas were activated in the same order:

import json

def load_trace(path):
    with open(path) as f:
        return json.load(f)

def compare_structure(trace_a, trace_b):
    """Compare activation sequence and phase coverage."""
    seq_a = [(e["persona_id"], e["phase"]) for e in trace_a["entries"]]
    seq_b = [(e["persona_id"], e["phase"]) for e in trace_b["entries"]]
    return seq_a == seq_b

trace_1 = load_trace("trace_001.json")
trace_2 = load_trace("trace_002.json")

if compare_structure(trace_1, trace_2):
    print("Activation sequences match.")
else:
    print("Activation sequences differ -- investigate.")

Content Similarity

For AI-powered traces where exact content match is unlikely, use semantic similarity:

def content_similarity(trace_a, trace_b):
    """Compute average content overlap between matching entries."""
    from difflib import SequenceMatcher

    similarities = []
    for ea, eb in zip(trace_a["entries"], trace_b["entries"]):
        if ea["persona_id"] == eb["persona_id"]:
            ratio = SequenceMatcher(None, ea["output"], eb["output"]).ratio()
            similarities.append(ratio)

    if similarities:
        avg = sum(similarities) / len(similarities)
        print(f"Average content similarity: {avg:.1%}")
        return avg
    return 0.0

Quality Gate Consistency

Compare whether the same quality gates pass across runs:

def compare_gates(trace_a, trace_b):
    """Compare quality gate results between two traces."""
    gates_a = {r["rule"]: r["passed"] for r in trace_a.get("validation_results", [])}
    gates_b = {r["rule"]: r["passed"] for r in trace_b.get("validation_results", [])}

    all_rules = sorted(set(gates_a) | set(gates_b))
    for rule in all_rules:
        a_pass = gates_a.get(rule, "N/A")
        b_pass = gates_b.get(rule, "N/A")
        match = "MATCH" if a_pass == b_pass else "DIFFER"
        print(f"  {rule}: {a_pass} vs {b_pass} [{match}]")

Reproducibility Checklist

Use this checklist when publishing results based on FCC:

  • FCC package version recorded (e.g., fcc-agent-team-ext==0.1.0)
  • Python version recorded (e.g., Python 3.11.5)
  • LLM provider and model version recorded (if AI-powered mode)
  • Temperature and max_tokens recorded
  • Scenario ID and workflow graph identified
  • Raw traces archived alongside analysis code
  • Schema validation passed on all input data
  • Schema validation passed on all output traces
  • Deterministic baseline run completed for structural comparison
  • Multiple trials conducted with variance reported