Statistical Analysis of FCC Experiments¶

This guide shows how to analyze FCC trace data and CLEAR+ benchmark outputs using standard statistical tools. It assumes you have followed experimental-design.md and have one or more JSONL event logs plus benchmark reports on disk.

The analysis pipeline¶

Figure 1 shows the canonical fan-in/fan-out analysis pipeline: JSONL event logs and CLEAR+ benchmark reports converge on a single pandas DataFrame, which then fans out to hypothesis tests, bootstrap confidence intervals, and inter-rater agreement before being aggregated into the final report.

flowchart LR
    E["Event logs<br/>(JSONL)"] --> D["DataFrame<br/>(pandas)"]
    B["Benchmark<br/>reports"] --> D
    D --> T["Hypothesis<br/>tests"]
    D --> C["Bootstrap<br/>CIs"]
    D --> K["Inter-rater<br/>agreement"]
    T --> R["Report"]
    C --> R
    K --> R
    style E fill:#e3f2fd
    style B fill:#e3f2fd
    style R fill:#e8f5e9

Keeping ingestion and analysis on opposite sides of the DataFrame makes it easy to swap providers or rerun a single test without re-parsing the raw logs. Each branch in the fan-out corresponds to one of the three claim types a published FCC result typically makes: point estimates, uncertainty bounds, and rater reliability.

Loading traces into pandas¶

Every FCC run produces an event log. Flatten to a DataFrame:

import json
import pandas as pd
from pathlib import Path

def load_runs(run_dir: str) -> pd.DataFrame:
    records = []
    for path in Path(run_dir).glob("*.jsonl"):
        for line in path.read_text().splitlines():
            ev = json.loads(line)
            records.append({
                "run_id": path.stem,
                "event_type": ev["event_type"],
                "persona_id": ev["payload"].get("persona_id"),
                "phase": ev["payload"].get("phase"),
                "score": ev["payload"].get("score"),
                "latency_ms": ev["payload"].get("latency_ms"),
                "timestamp": ev["timestamp"],
            })
    return pd.DataFrame(records)

df = load_runs("runs/")

For CLEAR+ benchmark reports, use the BenchmarkSuite API:

from fcc.evaluation.benchmark import BenchmarkSuite

suite = BenchmarkSuite.load("runs/benchmark_report.json")
report_df = suite.to_dataframe()
# columns: scenario, persona, provider, cost, latency, efficacy,
#          assurance, reliability, coverage, explainability

Comparing two personas (paired)¶

When the same scenarios are run with two different personas, the runs are paired (scenario is the blocking variable).

from scipy import stats

baseline = report_df[report_df.persona == "evidence_gatherer"]
treatment = report_df[report_df.persona == "empirical_auditor"]

paired = baseline.merge(treatment, on="scenario", suffixes=("_b", "_t"))

# Paired t-test
t_stat, p_value = stats.ttest_rel(paired.coverage_t, paired.coverage_b)
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# Wilcoxon signed-rank if coverage is skewed
w_stat, w_p = stats.wilcoxon(paired.coverage_t, paired.coverage_b)
print(f"Wilcoxon W = {w_stat}, p = {w_p:.4f}")

Report effect size:

diff = paired.coverage_t - paired.coverage_b
cohens_d = diff.mean() / diff.std(ddof=1)
print(f"Cohen's d = {cohens_d:.2f}")

Comparing N AI providers (ANOVA-style)¶

When you run the same persona-scenario pair across multiple providers (mock, anthropic, openai, ollama, litellm):

providers = ["mock", "anthropic", "openai", "ollama", "litellm"]
groups = [report_df[report_df.provider == p].efficacy for p in providers]

f_stat, p_value = stats.f_oneway(*groups)
print(f"One-way ANOVA: F = {f_stat:.2f}, p = {p_value:.4f}")

# If overall test is significant, Tukey HSD for pairwise comparisons
from statsmodels.stats.multicomp import pairwise_tukeyhsd

flat = report_df[report_df.provider.isin(providers)]
tukey = pairwise_tukeyhsd(flat.efficacy, flat.provider, alpha=0.05)
print(tukey)

Kruskal-Wallis is the nonparametric alternative when the residuals aren't normal.

A/B on workflow variants¶

When varying the workflow graph (5-node vs 20-node vs 55-node), treat each variant as a group and apply the same ANOVA recipe. If the outcome is binary (pass/fail on quality gate), switch to a chi-square test:

contingency = pd.crosstab(report_df.workflow_graph_id, report_df.passed)
chi2, p, dof, exp = stats.chi2_contingency(contingency)

Bootstrap confidence intervals on CLEAR+ scores¶

CLEAR+ scores are bounded [0, 100] and often skewed. Bootstrap CIs are robust:

import numpy as np

def bootstrap_ci(x, n_boot=10_000, ci=0.95):
    rng = np.random.default_rng(seed=42)
    means = [rng.choice(x, size=len(x), replace=True).mean()
             for _ in range(n_boot)]
    lo, hi = np.quantile(means, [(1-ci)/2, 1-(1-ci)/2])
    return np.mean(x), lo, hi

mean, lo, hi = bootstrap_ci(report_df.coverage.values)
print(f"Coverage: {mean:.2f} [95% CI: {lo:.2f}, {hi:.2f}]")

For paired bootstrap on a treatment-vs-baseline delta:

delta = (paired.coverage_t - paired.coverage_b).values
mean, lo, hi = bootstrap_ci(delta)
print(f"Treatment delta: {mean:+.2f} [95% CI: {lo:+.2f}, {hi:+.2f}]")

Inter-persona agreement (Cohen's kappa)¶

When two critique personas rate the same deliverables, use Cohen's kappa to quantify agreement beyond chance:

from sklearn.metrics import cohen_kappa_score

# Binary pass/fail from two critique personas
kappa = cohen_kappa_score(
    df_ratings.critique_lead_verdict,
    df_ratings.compliance_guardian_verdict,
)
print(f"Cohen's kappa = {kappa:.3f}")

Interpretation (Landis & Koch):

Kappa	Agreement
< 0.00	Poor
0.00-0.20	Slight
0.21-0.40	Fair
0.41-0.60	Moderate
0.61-0.80	Substantial
0.81-1.00	Almost perfect

For more than two raters, use Fleiss' kappa (statsmodels.stats.inter_rater).

Multiple comparisons¶

When you audit all 147 personas (see ADR-0003 for context) you run the risk of false discoveries. Apply a correction:

from statsmodels.stats.multitest import multipletests

p_values = [...]  # one per persona
reject, p_corrected, _, _ = multipletests(p_values, alpha=0.05, method="holm")

n_significant = reject.sum()
print(f"{n_significant} of {len(p_values)} survive Holm-Bonferroni")

For exploratory work, Benjamini-Hochberg (FDR control) is less conservative:

reject, p_corrected, _, _ = multipletests(p_values, alpha=0.05, method="fdr_bh")

Interpreting a benchmark report¶

Example snippet from a BenchmarkReport.to_json():

{
  "suite": "baseline_v1_3_3",
  "results": [
    {
      "scenario": "compliance_soc2",
      "persona": "compliance_guardian",
      "provider": "anthropic",
      "metrics": {
        "cost": 0.042,
        "latency": 1820,
        "efficacy": 0.87,
        "assurance": 0.91,
        "reliability": 0.94,
        "coverage": 0.78,
        "explainability": 0.82
      }
    }
  ]
}

How to read it:

Cost: USD per run. Lower is better. Use to compare providers fairly.
Latency: ms per run. Impacted by provider + max_tokens + workflow size.
Efficacy: 0-1, did the deliverable achieve the stated goal. Computed by quality gates.
Assurance: 0-1, confidence in traceability and auditability.
Reliability: 0-1, variance across repeated runs (1 - coefficient of variation).
Coverage: 0-1, fraction of required topics addressed.
Explainability: 0-1, presence of reasoning traces, citations, and rationale.

When comparing arms, always report mean, 95% CI, and effect size -- not just the p-value.

Reporting template¶

A publishable results paragraph should answer all of these:

What was compared? (treatment vs baseline)
What N? (runs per cell x cells)
What test? (paired t, Wilcoxon, ANOVA, chi2)
What effect? (mean diff + 95% CI + effect size)
What correction? (Holm, BH, none)
What software? ("FCC 1.3.3, scipy 1.11, pandas 2.1")

Experimental design -- upstream planning
Reproducibility -- seed and environment pinning
Reproducible benchmarks -- CLEAR+ CI setup
Notebook 19_evaluation_and_compliance.ipynb -- hands-on CLEAR+
src/fcc/evaluation/benchmark.py -- BenchmarkSuite, BenchmarkComparison
src/fcc/evaluation/metrics.py -- CLEARPlusMetrics implementation