CLEAR+ Evaluation and Benchmarking Prompts¶

Six prompts to exercise FCC's CLEAR+ evaluation framework (Cost, Latency, Efficacy, Assurance, Reliability, Coverage, Explainability). These prompts assume familiarity with the fcc benchmark CLI, BenchmarkSpec, and the ModelCardGenerator. The default runner is always-simulate/mock so that experiments stay deterministic and cheap.

Personas Used¶

Persona ID	Full Name	Category	Role in Prompts
BenchmarkRunner	(system role)	evaluation	Executes suites and collects metrics
EvaluationAnalyst	(champion-style)	evaluation	Interprets CLEAR+ scores, calls regressions
ModelCardGenerator	(system role)	evaluation	Produces Mitchell-format model cards

Prompt 1: Defining a Benchmark Spec¶

Audience: Scientific Difficulty: intermediate Personas: EvaluationAnalyst

Context¶

A team wants to benchmark a new "legal contract clause extractor" persona.

Prompt¶

Draft a complete BenchmarkSpec YAML for a contract-clause-extraction
persona. Include:
- `id`, `version`, `task` = "extraction"
- `dataset`: 120 synthetic NDAs with gold-standard clause spans
- CLEAR+ weights favoring Assurance (0.25), Explainability (0.20),
  Efficacy (0.20), Reliability (0.15), Coverage (0.10), Cost (0.05),
  Latency (0.05). Justify each weight in two sentences.
- Pass/fail thresholds per dimension
- Three stress variants (adversarial phrasing, OCR noise, long docs)

Conclude with a critique listing which weight is most fragile to
stakeholder renegotiation and why.

Expected Output¶

BenchmarkSpec YAML
Per-weight justification
Fragility critique

Prompt 2: Interpreting CLEAR+ Dimensions¶

Audience: Scientific Difficulty: beginner Personas: EvaluationAnalyst

Prompt¶

Given this benchmark result:
- Cost: 0.82, Latency: 0.71, Efficacy: 0.88, Assurance: 0.54,
  Reliability: 0.79, Coverage: 0.66, Explainability: 0.41

Produce a one-page executive brief that (a) translates each
dimension into plain English, (b) identifies the two dimensions a
risk officer would escalate first, (c) recommends three concrete
persona-level or workflow-level changes to address them.

Expected Output¶

Executive brief
Prioritized remediation list

Tips¶

Keep "Assurance" framed as evidence-of-correctness rather than confidence.

Prompt 3: Comparing Two Models¶

Audience: Scientific Difficulty: intermediate Personas: EvaluationAnalyst, BenchmarkRunner

Prompt¶

Compare Model A (local 8B via ollama) and Model B (hosted 70B via
LiteLLM) on the `persona-triage-suite-v2` benchmark.

Find: load both BenchmarkResult JSON files and produce a side-by-side
exhibit with per-dimension deltas.

Create: write a BenchmarkComparison with the winner per dimension,
a composite CLEAR+ score, and a normalized cost-per-correct-answer.

Critique: identify the single dimension where Model A's "win" might
be an artifact of the mock runner rather than true capability.

Expected Output¶

Comparison exhibit
Composite score calculation
Artifact-risk critique

Prompt 4: Regression Detection in CI¶

Audience: Scientific Difficulty: intermediate Personas: BenchmarkRunner, EvaluationAnalyst

Prompt¶

Inspect the last 10 CI benchmark runs for the `sky-parlour-champion`
persona. Determine:
- Rolling mean and std for each CLEAR+ dimension
- Any point where a dimension crossed 2-sigma
- Whether the crossing coincided with a persona YAML change

Produce a regression report suitable for a GitHub PR comment with
a "block merge / request review / informational" recommendation.

Expected Output¶

Rolling stats table
Crossing timeline
Merge recommendation

Prompt 5: Authoring a Mitchell-Format Model Card¶

Audience: Scientific Difficulty: intermediate Personas: ModelCardGenerator

Prompt¶

Generate a Mitchell-format (Model Cards for Model Reporting, 2019)
card for a new "credit-risk-assistant" persona. Include all nine
sections, a Datasheet (Gebru et al. 2018) stub for the training data,
and a risk-category banner sourced from the EU AI Act classifier.
Ensure every quantitative claim in Intended Use, Evaluation Data, and
Quantitative Analysis is traceable to a benchmark run ID.

Expected Output¶

Complete model card
Datasheet stub
Risk-category banner

Prompt 6: Stress Benchmark Design¶

Audience: Scientific Difficulty: advanced Personas: EvaluationAnalyst

Prompt¶

Design three stress tests for a multilingual-summarizer persona that
specifically probe Reliability and Coverage. For each stress test
describe the generator strategy, the metric that would detect
failure, and an explicit abort criterion.

Expected Output¶

Three stress specs
Abort criteria

CLEAR+ Evaluation and Benchmarking Prompts¶

Personas Used¶

Prompt 1: Defining a Benchmark Spec¶

Context¶

Prompt¶

Expected Output¶

Prompt 2: Interpreting CLEAR+ Dimensions¶

Prompt¶

Expected Output¶

Tips¶

Prompt 3: Comparing Two Models¶

Prompt¶

Expected Output¶

Prompt 4: Regression Detection in CI¶

Prompt¶

Expected Output¶

Prompt 5: Authoring a Mitchell-Format Model Card¶

Prompt¶

Expected Output¶

Prompt 6: Stress Benchmark Design¶

Prompt¶

Expected Output¶

See Also¶