CLEAR+ Evaluation and Benchmarking Prompts¶
Six prompts to exercise FCC's CLEAR+ evaluation framework (Cost, Latency, Efficacy, Assurance, Reliability, Coverage, Explainability). These prompts assume familiarity with the fcc benchmark CLI, BenchmarkSpec, and the ModelCardGenerator. The default runner is always-simulate/mock so that experiments stay deterministic and cheap.
Personas Used¶
| Persona ID | Full Name | Category | Role in Prompts |
|---|---|---|---|
| BenchmarkRunner | (system role) | evaluation | Executes suites and collects metrics |
| EvaluationAnalyst | (champion-style) | evaluation | Interprets CLEAR+ scores, calls regressions |
| ModelCardGenerator | (system role) | evaluation | Produces Mitchell-format model cards |
Prompt 1: Defining a Benchmark Spec¶
Audience: Scientific Difficulty: intermediate Personas: EvaluationAnalyst
Context¶
A team wants to benchmark a new "legal contract clause extractor" persona.
Prompt¶
Draft a complete BenchmarkSpec YAML for a contract-clause-extraction
persona. Include:
- `id`, `version`, `task` = "extraction"
- `dataset`: 120 synthetic NDAs with gold-standard clause spans
- CLEAR+ weights favoring Assurance (0.25), Explainability (0.20),
Efficacy (0.20), Reliability (0.15), Coverage (0.10), Cost (0.05),
Latency (0.05). Justify each weight in two sentences.
- Pass/fail thresholds per dimension
- Three stress variants (adversarial phrasing, OCR noise, long docs)
Conclude with a critique listing which weight is most fragile to
stakeholder renegotiation and why.
Expected Output¶
- BenchmarkSpec YAML
- Per-weight justification
- Fragility critique
Prompt 2: Interpreting CLEAR+ Dimensions¶
Audience: Scientific Difficulty: beginner Personas: EvaluationAnalyst
Prompt¶
Given this benchmark result:
- Cost: 0.82, Latency: 0.71, Efficacy: 0.88, Assurance: 0.54,
Reliability: 0.79, Coverage: 0.66, Explainability: 0.41
Produce a one-page executive brief that (a) translates each
dimension into plain English, (b) identifies the two dimensions a
risk officer would escalate first, (c) recommends three concrete
persona-level or workflow-level changes to address them.
Expected Output¶
- Executive brief
- Prioritized remediation list
Tips¶
- Keep "Assurance" framed as evidence-of-correctness rather than confidence.
Prompt 3: Comparing Two Models¶
Audience: Scientific Difficulty: intermediate Personas: EvaluationAnalyst, BenchmarkRunner
Prompt¶
Compare Model A (local 8B via ollama) and Model B (hosted 70B via
LiteLLM) on the `persona-triage-suite-v2` benchmark.
Find: load both BenchmarkResult JSON files and produce a side-by-side
exhibit with per-dimension deltas.
Create: write a BenchmarkComparison with the winner per dimension,
a composite CLEAR+ score, and a normalized cost-per-correct-answer.
Critique: identify the single dimension where Model A's "win" might
be an artifact of the mock runner rather than true capability.
Expected Output¶
- Comparison exhibit
- Composite score calculation
- Artifact-risk critique
Prompt 4: Regression Detection in CI¶
Audience: Scientific Difficulty: intermediate Personas: BenchmarkRunner, EvaluationAnalyst
Prompt¶
Inspect the last 10 CI benchmark runs for the `sky-parlour-champion`
persona. Determine:
- Rolling mean and std for each CLEAR+ dimension
- Any point where a dimension crossed 2-sigma
- Whether the crossing coincided with a persona YAML change
Produce a regression report suitable for a GitHub PR comment with
a "block merge / request review / informational" recommendation.
Expected Output¶
- Rolling stats table
- Crossing timeline
- Merge recommendation
Prompt 5: Authoring a Mitchell-Format Model Card¶
Audience: Scientific Difficulty: intermediate Personas: ModelCardGenerator
Prompt¶
Generate a Mitchell-format (Model Cards for Model Reporting, 2019)
card for a new "credit-risk-assistant" persona. Include all nine
sections, a Datasheet (Gebru et al. 2018) stub for the training data,
and a risk-category banner sourced from the EU AI Act classifier.
Ensure every quantitative claim in Intended Use, Evaluation Data, and
Quantitative Analysis is traceable to a benchmark run ID.
Expected Output¶
- Complete model card
- Datasheet stub
- Risk-category banner
Prompt 6: Stress Benchmark Design¶
Audience: Scientific Difficulty: advanced Personas: EvaluationAnalyst
Prompt¶
Design three stress tests for a multilingual-summarizer persona that
specifically probe Reliability and Coverage. For each stress test
describe the generator strategy, the metric that would detect
failure, and an explicit abort criterion.
Expected Output¶
- Three stress specs
- Abort criteria
See Also¶
src/fcc/evaluation/andsrc/fcc/data/evaluation/- Guidebook Chapter 14 (Evaluation and Benchmarking)
- Notebook 17 (CLEAR+ walkthrough)