Evaluation Overview Demo¶
Overview
This demo provides a comprehensive overview of the FCC evaluation ecosystem, including CLEAR+ benchmarking, model cards, compliance auditing, and the evaluation dashboard.
What You'll Learn¶
- The 7 CLEAR+ evaluation dimensions (Cost, Latency, Efficacy, Assurance, Reliability, Coverage, Explainability)
- How to run benchmarks and generate model cards
- How the evaluation pipeline integrates with compliance
Prerequisites¶
- FCC framework installed (
pip install -e .) - Basic understanding of personas and workflows
The FCC Evaluation Ecosystem¶
flowchart TD
subgraph Evaluation["Evaluation Framework"]
CLEAR["CLEAR+ Metrics\n7 dimensions"]
Bench["BenchmarkRunner\nMock + AI modes"]
Cards["ModelCardGenerator\n128 cards"]
end
subgraph Compliance["Compliance Framework"]
EUAI["EU AI Act\n256+ requirements"]
NIST["NIST AI RMF\n29 subcategories"]
Audit["ComplianceAuditor"]
end
CLEAR --> Bench
Bench --> Cards
Cards --> Audit
EUAI --> Audit
NIST --> Audit
style Evaluation fill:#E3F2FD,stroke:#1565C0
style Compliance fill:#FFF3E0,stroke:#E65100
Quick Start¶
Run a Benchmark¶
Generate Model Cards¶
Run a Full Evaluation¶
CLEAR+ Dimensions¶
| Dimension | Description | Metric |
|---|---|---|
| Cost | Resource consumption per evaluation | Tokens, API calls |
| Latency | Response time under load | p50, p95, p99 |
| Efficacy | Quality of persona outputs | Score 0-5 |
| Assurance | Compliance and safety guarantees | Pass rate % |
| Reliability | Consistency across runs | Std deviation |
| Coverage | Breadth of scenario coverage | % of scenarios |
| Explainability | Transparency of decisions | Trace completeness |