Skip to content

CLEAR+ Evaluation and Benchmarking Prompts

Six prompts to exercise FCC's CLEAR+ evaluation framework (Cost, Latency, Efficacy, Assurance, Reliability, Coverage, Explainability). These prompts assume familiarity with the fcc benchmark CLI, BenchmarkSpec, and the ModelCardGenerator. The default runner is always-simulate/mock so that experiments stay deterministic and cheap.

Personas Used

Persona ID Full Name Category Role in Prompts
BenchmarkRunner (system role) evaluation Executes suites and collects metrics
EvaluationAnalyst (champion-style) evaluation Interprets CLEAR+ scores, calls regressions
ModelCardGenerator (system role) evaluation Produces Mitchell-format model cards

Prompt 1: Defining a Benchmark Spec

Audience: Scientific Difficulty: intermediate Personas: EvaluationAnalyst

Context

A team wants to benchmark a new "legal contract clause extractor" persona.

Prompt

Draft a complete BenchmarkSpec YAML for a contract-clause-extraction
persona. Include:
- `id`, `version`, `task` = "extraction"
- `dataset`: 120 synthetic NDAs with gold-standard clause spans
- CLEAR+ weights favoring Assurance (0.25), Explainability (0.20),
  Efficacy (0.20), Reliability (0.15), Coverage (0.10), Cost (0.05),
  Latency (0.05). Justify each weight in two sentences.
- Pass/fail thresholds per dimension
- Three stress variants (adversarial phrasing, OCR noise, long docs)

Conclude with a critique listing which weight is most fragile to
stakeholder renegotiation and why.

Expected Output

  • BenchmarkSpec YAML
  • Per-weight justification
  • Fragility critique

Prompt 2: Interpreting CLEAR+ Dimensions

Audience: Scientific Difficulty: beginner Personas: EvaluationAnalyst

Prompt

Given this benchmark result:
- Cost: 0.82, Latency: 0.71, Efficacy: 0.88, Assurance: 0.54,
  Reliability: 0.79, Coverage: 0.66, Explainability: 0.41

Produce a one-page executive brief that (a) translates each
dimension into plain English, (b) identifies the two dimensions a
risk officer would escalate first, (c) recommends three concrete
persona-level or workflow-level changes to address them.

Expected Output

  • Executive brief
  • Prioritized remediation list

Tips

  • Keep "Assurance" framed as evidence-of-correctness rather than confidence.

Prompt 3: Comparing Two Models

Audience: Scientific Difficulty: intermediate Personas: EvaluationAnalyst, BenchmarkRunner

Prompt

Compare Model A (local 8B via ollama) and Model B (hosted 70B via
LiteLLM) on the `persona-triage-suite-v2` benchmark.

Find: load both BenchmarkResult JSON files and produce a side-by-side
exhibit with per-dimension deltas.

Create: write a BenchmarkComparison with the winner per dimension,
a composite CLEAR+ score, and a normalized cost-per-correct-answer.

Critique: identify the single dimension where Model A's "win" might
be an artifact of the mock runner rather than true capability.

Expected Output

  • Comparison exhibit
  • Composite score calculation
  • Artifact-risk critique

Prompt 4: Regression Detection in CI

Audience: Scientific Difficulty: intermediate Personas: BenchmarkRunner, EvaluationAnalyst

Prompt

Inspect the last 10 CI benchmark runs for the `sky-parlour-champion`
persona. Determine:
- Rolling mean and std for each CLEAR+ dimension
- Any point where a dimension crossed 2-sigma
- Whether the crossing coincided with a persona YAML change

Produce a regression report suitable for a GitHub PR comment with
a "block merge / request review / informational" recommendation.

Expected Output

  • Rolling stats table
  • Crossing timeline
  • Merge recommendation

Prompt 5: Authoring a Mitchell-Format Model Card

Audience: Scientific Difficulty: intermediate Personas: ModelCardGenerator

Prompt

Generate a Mitchell-format (Model Cards for Model Reporting, 2019)
card for a new "credit-risk-assistant" persona. Include all nine
sections, a Datasheet (Gebru et al. 2018) stub for the training data,
and a risk-category banner sourced from the EU AI Act classifier.
Ensure every quantitative claim in Intended Use, Evaluation Data, and
Quantitative Analysis is traceable to a benchmark run ID.

Expected Output

  • Complete model card
  • Datasheet stub
  • Risk-category banner

Prompt 6: Stress Benchmark Design

Audience: Scientific Difficulty: advanced Personas: EvaluationAnalyst

Prompt

Design three stress tests for a multilingual-summarizer persona that
specifically probe Reliability and Coverage. For each stress test
describe the generator strategy, the metric that would detect
failure, and an explicit abort criterion.

Expected Output

  • Three stress specs
  • Abort criteria

See Also

  • src/fcc/evaluation/ and src/fcc/data/evaluation/
  • Guidebook Chapter 14 (Evaluation and Benchmarking)
  • Notebook 17 (CLEAR+ walkthrough)