Statistical Analysis of FCC Experiments¶
This guide shows how to analyze FCC trace data and CLEAR+ benchmark outputs using standard statistical tools. It assumes you have followed experimental-design.md and have one or more JSONL event logs plus benchmark reports on disk.
The analysis pipeline¶
Figure 1 shows the canonical fan-in/fan-out analysis pipeline: JSONL event logs and CLEAR+ benchmark reports converge on a single pandas DataFrame, which then fans out to hypothesis tests, bootstrap confidence intervals, and inter-rater agreement before being aggregated into the final report.
flowchart LR
E["Event logs<br/>(JSONL)"] --> D["DataFrame<br/>(pandas)"]
B["Benchmark<br/>reports"] --> D
D --> T["Hypothesis<br/>tests"]
D --> C["Bootstrap<br/>CIs"]
D --> K["Inter-rater<br/>agreement"]
T --> R["Report"]
C --> R
K --> R
style E fill:#e3f2fd
style B fill:#e3f2fd
style R fill:#e8f5e9
Keeping ingestion and analysis on opposite sides of the DataFrame makes it easy to swap providers or rerun a single test without re-parsing the raw logs. Each branch in the fan-out corresponds to one of the three claim types a published FCC result typically makes: point estimates, uncertainty bounds, and rater reliability.
Loading traces into pandas¶
Every FCC run produces an event log. Flatten to a DataFrame:
import json
import pandas as pd
from pathlib import Path
def load_runs(run_dir: str) -> pd.DataFrame:
records = []
for path in Path(run_dir).glob("*.jsonl"):
for line in path.read_text().splitlines():
ev = json.loads(line)
records.append({
"run_id": path.stem,
"event_type": ev["event_type"],
"persona_id": ev["payload"].get("persona_id"),
"phase": ev["payload"].get("phase"),
"score": ev["payload"].get("score"),
"latency_ms": ev["payload"].get("latency_ms"),
"timestamp": ev["timestamp"],
})
return pd.DataFrame(records)
df = load_runs("runs/")
For CLEAR+ benchmark reports, use the BenchmarkSuite API:
from fcc.evaluation.benchmark import BenchmarkSuite
suite = BenchmarkSuite.load("runs/benchmark_report.json")
report_df = suite.to_dataframe()
# columns: scenario, persona, provider, cost, latency, efficacy,
# assurance, reliability, coverage, explainability
Comparing two personas (paired)¶
When the same scenarios are run with two different personas, the runs are paired (scenario is the blocking variable).
from scipy import stats
baseline = report_df[report_df.persona == "evidence_gatherer"]
treatment = report_df[report_df.persona == "empirical_auditor"]
paired = baseline.merge(treatment, on="scenario", suffixes=("_b", "_t"))
# Paired t-test
t_stat, p_value = stats.ttest_rel(paired.coverage_t, paired.coverage_b)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")
# Wilcoxon signed-rank if coverage is skewed
w_stat, w_p = stats.wilcoxon(paired.coverage_t, paired.coverage_b)
print(f"Wilcoxon W = {w_stat}, p = {w_p:.4f}")
Report effect size:
diff = paired.coverage_t - paired.coverage_b
cohens_d = diff.mean() / diff.std(ddof=1)
print(f"Cohen's d = {cohens_d:.2f}")
Comparing N AI providers (ANOVA-style)¶
When you run the same persona-scenario pair across multiple providers
(mock, anthropic, openai, ollama, litellm):
providers = ["mock", "anthropic", "openai", "ollama", "litellm"]
groups = [report_df[report_df.provider == p].efficacy for p in providers]
f_stat, p_value = stats.f_oneway(*groups)
print(f"One-way ANOVA: F = {f_stat:.2f}, p = {p_value:.4f}")
# If overall test is significant, Tukey HSD for pairwise comparisons
from statsmodels.stats.multicomp import pairwise_tukeyhsd
flat = report_df[report_df.provider.isin(providers)]
tukey = pairwise_tukeyhsd(flat.efficacy, flat.provider, alpha=0.05)
print(tukey)
Kruskal-Wallis is the nonparametric alternative when the residuals aren't normal.
A/B on workflow variants¶
When varying the workflow graph (5-node vs 20-node vs 55-node), treat each variant as a group and apply the same ANOVA recipe. If the outcome is binary (pass/fail on quality gate), switch to a chi-square test:
contingency = pd.crosstab(report_df.workflow_graph_id, report_df.passed)
chi2, p, dof, exp = stats.chi2_contingency(contingency)
Bootstrap confidence intervals on CLEAR+ scores¶
CLEAR+ scores are bounded [0, 100] and often skewed. Bootstrap CIs are robust:
import numpy as np
def bootstrap_ci(x, n_boot=10_000, ci=0.95):
rng = np.random.default_rng(seed=42)
means = [rng.choice(x, size=len(x), replace=True).mean()
for _ in range(n_boot)]
lo, hi = np.quantile(means, [(1-ci)/2, 1-(1-ci)/2])
return np.mean(x), lo, hi
mean, lo, hi = bootstrap_ci(report_df.coverage.values)
print(f"Coverage: {mean:.2f} [95% CI: {lo:.2f}, {hi:.2f}]")
For paired bootstrap on a treatment-vs-baseline delta:
delta = (paired.coverage_t - paired.coverage_b).values
mean, lo, hi = bootstrap_ci(delta)
print(f"Treatment delta: {mean:+.2f} [95% CI: {lo:+.2f}, {hi:+.2f}]")
Inter-persona agreement (Cohen's kappa)¶
When two critique personas rate the same deliverables, use Cohen's kappa to quantify agreement beyond chance:
from sklearn.metrics import cohen_kappa_score
# Binary pass/fail from two critique personas
kappa = cohen_kappa_score(
df_ratings.critique_lead_verdict,
df_ratings.compliance_guardian_verdict,
)
print(f"Cohen's kappa = {kappa:.3f}")
Interpretation (Landis & Koch):
| Kappa | Agreement |
|---|---|
| < 0.00 | Poor |
| 0.00-0.20 | Slight |
| 0.21-0.40 | Fair |
| 0.41-0.60 | Moderate |
| 0.61-0.80 | Substantial |
| 0.81-1.00 | Almost perfect |
For more than two raters, use Fleiss' kappa (statsmodels.stats.inter_rater).
Multiple comparisons¶
When you audit all 147 personas (see ADR-0003 for context) you run the risk of false discoveries. Apply a correction:
from statsmodels.stats.multitest import multipletests
p_values = [...] # one per persona
reject, p_corrected, _, _ = multipletests(p_values, alpha=0.05, method="holm")
n_significant = reject.sum()
print(f"{n_significant} of {len(p_values)} survive Holm-Bonferroni")
For exploratory work, Benjamini-Hochberg (FDR control) is less conservative:
Interpreting a benchmark report¶
Example snippet from a BenchmarkReport.to_json():
{
"suite": "baseline_v1_3_3",
"results": [
{
"scenario": "compliance_soc2",
"persona": "compliance_guardian",
"provider": "anthropic",
"metrics": {
"cost": 0.042,
"latency": 1820,
"efficacy": 0.87,
"assurance": 0.91,
"reliability": 0.94,
"coverage": 0.78,
"explainability": 0.82
}
}
]
}
How to read it:
- Cost: USD per run. Lower is better. Use to compare providers fairly.
- Latency: ms per run. Impacted by provider + max_tokens + workflow size.
- Efficacy: 0-1, did the deliverable achieve the stated goal. Computed by quality gates.
- Assurance: 0-1, confidence in traceability and auditability.
- Reliability: 0-1, variance across repeated runs (1 - coefficient of variation).
- Coverage: 0-1, fraction of required topics addressed.
- Explainability: 0-1, presence of reasoning traces, citations, and rationale.
When comparing arms, always report mean, 95% CI, and effect size -- not just the p-value.
Reporting template¶
A publishable results paragraph should answer all of these:
- What was compared? (treatment vs baseline)
- What N? (runs per cell x cells)
- What test? (paired t, Wilcoxon, ANOVA, chi2)
- What effect? (mean diff + 95% CI + effect size)
- What correction? (Holm, BH, none)
- What software? ("FCC 1.3.3, scipy 1.11, pandas 2.1")
Related resources¶
- Experimental design -- upstream planning
- Reproducibility -- seed and environment pinning
- Reproducible benchmarks -- CLEAR+ CI setup
- Notebook
19_evaluation_and_compliance.ipynb-- hands-on CLEAR+ src/fcc/evaluation/benchmark.py-- BenchmarkSuite, BenchmarkComparisonsrc/fcc/evaluation/metrics.py-- CLEARPlusMetrics implementation