Reproducibility Guide¶
Reproducibility is foundational to any credible use of FCC -- whether in academic research, regulated enterprise deployments, or continuous integration pipelines. This guide covers the mechanisms FCC provides for producing deterministic, verifiable, and repeatable results.
Deterministic Simulation Mode¶
The simulation engine supports two modes. Deterministic mode is the cornerstone of reproducibility.
How Deterministic Mode Works¶
In deterministic mode, the simulation engine uses mock response generators instead of calling an LLM. Every persona activation produces a predictable output derived from its R.I.S.C.E.A.R. specification and the input payload. The same scenario with the same registry and the same workflow graph will always produce the same trace.
from fcc._resources import get_personas_dir
from fcc.personas.registry import PersonaRegistry
from fcc.simulation.engine import SimulationEngine
registry = PersonaRegistry.from_yaml_directory(get_personas_dir())
engine = SimulationEngine(registry=registry, mode="deterministic")
# Run the same scenario twice
trace_a = engine.run_scenario("GEN-001")
trace_b = engine.run_scenario("GEN-001")
# Traces are identical
assert trace_a == trace_b
When to Use Each Mode¶
| Mode | Use Case | Reproducibility | Cost |
|---|---|---|---|
| Deterministic | Unit tests, CI/CD, baseline comparisons, controlled experiments | Exact | Free |
| AI-powered | Production workflows, response quality evaluation, behavioral studies | Approximate (temperature-dependent) | LLM API costs |
Achieving Approximate Reproducibility in AI Mode¶
When using AI-powered mode, exact reproducibility is not guaranteed because LLM responses vary. However, you can minimize variance:
- Set temperature to 0. This makes the LLM's sampling nearly deterministic (some providers still introduce minor variance).
- Pin the model version. Use explicit model identifiers like
gpt-4o-2024-08-06instead of aliases likegpt-4o. - Fix the random seed if the provider supports it. OpenAI's API accepts a
seedparameter. - Run multiple trials. Report mean and standard deviation across runs rather than single-run results.
from fcc.simulation.prompts import PromptTemplate
# Configure for maximum reproducibility
template = PromptTemplate(
system_prompt="...",
user_template="...",
model="gpt-4o-2024-08-06", # Pinned model version
temperature=0.0, # Minimal sampling variance
max_tokens=1024,
)
Trace Format Documentation¶
Every simulation run produces a trace -- a complete record of every persona activation, message, and validation result. Traces conform to the JSON Schema at data/schemas/trace.schema.json.
Trace Schema Overview¶
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["scenario_id", "workflow_graph", "mode", "entries"],
"properties": {
"scenario_id": { "type": "string" },
"workflow_graph": { "type": "string" },
"mode": { "enum": ["deterministic", "ai_powered"] },
"started_at": { "type": "string", "format": "date-time" },
"completed_at": { "type": "string", "format": "date-time" },
"entries": {
"type": "array",
"items": {
"type": "object",
"required": ["persona_id", "phase", "input", "output"],
"properties": {
"persona_id": { "type": "string" },
"phase": { "enum": ["Find", "Create", "Critique"] },
"input": { "type": "string" },
"output": { "type": "string" },
"timestamp": { "type": "string", "format": "date-time" }
}
}
},
"validation_results": {
"type": "array",
"items": {
"type": "object",
"required": ["rule", "passed"],
"properties": {
"rule": { "type": "string" },
"passed": { "type": "boolean" },
"details": { "type": "string" }
}
}
}
}
}
Storing and Versioning Traces¶
For research reproducibility, traces should be stored alongside the code that produced them:
experiment/
traces/
trial_001.json
trial_002.json
trial_003.json
config/
registry_version.txt # FCC package version
scenario.json # Scenario configuration
llm_config.json # Model, temperature, max_tokens
analysis/
summary.py # Analysis script
results.csv # Processed results
Version Pinning¶
Persona definitions, workflow graphs, quality gates, and schemas are all versioned as part of the FCC package. Locking the package version ensures that the same persona registry is used across runs.
Pinning in Requirements Files¶
Pinning in pyproject.toml¶
Verifying the Registry¶
After installation, verify the registry contents programmatically:
from fcc._resources import get_personas_dir
from fcc.personas.registry import PersonaRegistry
registry = PersonaRegistry.from_yaml_directory(get_personas_dir())
# Verify expected persona count
assert len(registry) == 24, f"Expected 24 personas, got {len(registry)}"
# Verify specific personas exist
for pid in ["RC", "BC", "DE", "RB", "UG", "RCHM", "BCHM", "UGCH", "RBCH"]:
assert registry.get(pid) is not None, f"Missing persona: {pid}"
# Verify cross-reference count
from fcc.personas.cross_reference import CrossReferenceMatrix
matrix = CrossReferenceMatrix.from_yaml(get_personas_dir() / "cross_reference.yaml")
assert len(matrix) == 106, f"Expected 106 cross-references, got {len(matrix)}"
Schema Validation for Data Integrity¶
FCC uses JSON Schema validation to ensure data integrity at every level.
Validating Persona Files¶
from fcc._resources import get_personas_dir, get_schemas_dir
from fcc.personas.registry import PersonaRegistry
registry = PersonaRegistry.from_yaml_validated(
get_personas_dir() / "core_personas.yaml",
get_schemas_dir() / "persona.schema.json",
)
Validating Traces¶
import json
import jsonschema
from fcc._resources import get_schemas_dir
with open("trace_output.json") as f:
trace = json.load(f)
with open(get_schemas_dir() / "trace.schema.json") as f:
schema = json.load(f)
jsonschema.validate(instance=trace, schema=schema)
print("Trace is valid.")
Validating Cross-References¶
with open(get_schemas_dir() / "cross_reference.schema.json") as f:
schema = json.load(f)
with open(get_personas_dir() / "cross_reference.yaml") as f:
import yaml
data = yaml.safe_load(f)
jsonschema.validate(instance=data, schema=schema)
print("Cross-reference matrix is valid.")
Comparing Traces Across Runs¶
For experiments that run the same scenario multiple times (especially in AI-powered mode), you need systematic comparison.
Structural Comparison¶
Check that the same personas were activated in the same order:
import json
def load_trace(path):
with open(path) as f:
return json.load(f)
def compare_structure(trace_a, trace_b):
"""Compare activation sequence and phase coverage."""
seq_a = [(e["persona_id"], e["phase"]) for e in trace_a["entries"]]
seq_b = [(e["persona_id"], e["phase"]) for e in trace_b["entries"]]
return seq_a == seq_b
trace_1 = load_trace("trace_001.json")
trace_2 = load_trace("trace_002.json")
if compare_structure(trace_1, trace_2):
print("Activation sequences match.")
else:
print("Activation sequences differ -- investigate.")
Content Similarity¶
For AI-powered traces where exact content match is unlikely, use semantic similarity:
def content_similarity(trace_a, trace_b):
"""Compute average content overlap between matching entries."""
from difflib import SequenceMatcher
similarities = []
for ea, eb in zip(trace_a["entries"], trace_b["entries"]):
if ea["persona_id"] == eb["persona_id"]:
ratio = SequenceMatcher(None, ea["output"], eb["output"]).ratio()
similarities.append(ratio)
if similarities:
avg = sum(similarities) / len(similarities)
print(f"Average content similarity: {avg:.1%}")
return avg
return 0.0
Quality Gate Consistency¶
Compare whether the same quality gates pass across runs:
def compare_gates(trace_a, trace_b):
"""Compare quality gate results between two traces."""
gates_a = {r["rule"]: r["passed"] for r in trace_a.get("validation_results", [])}
gates_b = {r["rule"]: r["passed"] for r in trace_b.get("validation_results", [])}
all_rules = sorted(set(gates_a) | set(gates_b))
for rule in all_rules:
a_pass = gates_a.get(rule, "N/A")
b_pass = gates_b.get(rule, "N/A")
match = "MATCH" if a_pass == b_pass else "DIFFER"
print(f" {rule}: {a_pass} vs {b_pass} [{match}]")
Reproducibility Checklist¶
Use this checklist when publishing results based on FCC:
- FCC package version recorded (e.g.,
fcc-agent-team-ext==0.1.0) - Python version recorded (e.g.,
Python 3.11.5) - LLM provider and model version recorded (if AI-powered mode)
- Temperature and max_tokens recorded
- Scenario ID and workflow graph identified
- Raw traces archived alongside analysis code
- Schema validation passed on all input data
- Schema validation passed on all output traces
- Deterministic baseline run completed for structural comparison
- Multiple trials conducted with variance reported
Related Resources¶
- Research Methodology -- Using FCC as a research instrument
- Quality Gates -- Validation specifications
- Workflow Graphs -- Graph definitions