Skip to content

Plugin Development: Benchmark Scorer Prompts

This file collects six end-to-end prompts for building a new Scorer plugin that participates in the CLEAR+ evaluation framework (7 dimensions: Cost, Latency, Efficacy, Assurance, Reliability, Coverage, Explainability). Scorers are one of the 11 plugin types and hook into BenchmarkRunner and BenchmarkSuite via src/fcc/evaluation/. Each prompt pins personas, R.I.S.C.E.A.R. slots, and the deliverable shape.

Table of Contents

  1. Choose a CLEAR+ Dimension
  2. Skeleton the Scorer
  3. Integrate with BenchmarkRunner
  4. Calibrate Against Baseline Benchmarks
  5. Test the Scorer
  6. Ship with CI Integration

1. Choose a CLEAR+ Dimension

Personas/subsystems invoked. dal, ra. R.I.S.C.E.A.R. slot: Role + Input.

You are the Data Analyst Lead (dal). Risk Analyst (ra) reviews.

TASK: Propose a new scorer that deepens one of the 7 CLEAR+ dimensions.
For the chosen dimension:
1. Argue why existing metrics under CLEARPlusMetrics are insufficient.
2. Define the new sub-metric (formula or evaluation procedure).
3. State value range and sign convention (higher-is-better or lower).
4. Map to at least 1 of the 256+ EU AI Act requirements if relevant.

CONSTRAINTS:
- Do not propose a 8th CLEAR+ dimension. Stay within the existing 7.
- Must be computable from data already emitted by SimulationEngine +
  EventBus.

Deliverable: 4-section Markdown, 250 to 400 words.

Expected output notes. Stays within 7 dimensions; sign convention stated; optional AI Act mapping.


2. Skeleton the Scorer

Personas/subsystems invoked. tr. R.I.S.C.E.A.R. slot: Expected Output.

You are the Technical Reviewer (tr).

TASK: Produce the Python skeleton for the scorer at
src/fcc/plugins/<scorer_name>.py.

Requirements:
1. Subclass the Scorer plugin ABC.
2. Expose score(benchmark_result) -> float and metadata().
3. Use a dataclass for config.

CONSTRAINTS:
- No Pydantic.
- Pure function semantics; no I/O inside score().

Deliverable: a .py file body, 50 to 100 lines with docstrings.

Expected output notes. Dataclass config, pure score(), no I/O in score.


3. Integrate with BenchmarkRunner

Personas/subsystems invoked. tr, dal. R.I.S.C.E.A.R. slot: Role Collaborators.

You are the Technical Reviewer (tr). Data Analyst Lead (dal) reviews.

TASK: Show how the scorer plugs into BenchmarkRunner
(src/fcc/evaluation/runner.py) so that:
1. BenchmarkSpec.dimensions can request the new sub-metric.
2. BenchmarkResult carries the score with name + value + rationale.
3. BenchmarkComparison diffs pre/post runs.

CONSTRAINTS:
- BenchmarkRunner default stays mock-first ("always-simulate").
- No changes to BenchmarkSpec public API that would break existing
  benchmarks.

Deliverable: a diff-style patch plus a 3-bullet summary of
backward-compat guarantees.

Expected output notes. Backward compatibility preserved; comparison wired; 3 guarantees listed.


4. Calibrate Against Baseline Benchmarks

Personas/subsystems invoked. dal, ra. R.I.S.C.E.A.R. slot: Constraints.

You are the Data Analyst Lead (dal). Risk Analyst (ra) reviews.

TASK: Calibrate the scorer against the packaged baseline benchmarks at
src/fcc/data/evaluation/baseline_benchmarks.yaml and the stress set
stress_benchmarks.yaml.

Produce:
1. A 5-row table: benchmark name, expected score band, rationale.
2. Thresholds for PASS/WARN/FAIL used by CI.
3. A note on cross-dimension interactions (e.g., improving Coverage
   but degrading Latency).

CONSTRAINTS:
- Do not modify the baseline YAMLs; add config in a separate file.
- Respect ci_benchmark_config.yaml structure for CI wiring.

Deliverable: a YAML snippet for the new config plus the 5-row table.

Expected output notes. Baselines unchanged, YAML follows existing structure, 5 rows.


5. Test the Scorer

Personas/subsystems invoked. tr. R.I.S.C.E.A.R. slot: Expected Output.

You are the Technical Reviewer (tr).

TASK: Produce pytest tests at tests/plugins/test_<scorer_name>.py.

Cases:
1. Score is deterministic for a fixed BenchmarkResult.
2. Score monotonic w.r.t. the defined axis (construct two results
   where only one variable differs).
3. Score is bounded within the stated value range.
4. Integration: BenchmarkRunner picks up the plugin via the plugin
   registry.

CONSTRAINTS:
- pytest fixtures.
- >= 98% line and >= 80% branch coverage.
- No AI provider calls; use the mock provider.

Deliverable: pytest file body, 80 to 150 lines.

Expected output notes. 4 cases; monotonicity test present; mock provider only.


6. Ship with CI Integration

Personas/subsystems invoked. sre, ra. R.I.S.C.E.A.R. slot: Role Adoption Checklist.

You are the Site Reliability Engineer (sre). Risk Analyst (ra) reviews.

TASK: Produce the CI + governance bundle:
1. Update to .github/workflows for the CI benchmark job to include the
   new scorer (summary, not full YAML).
2. CHANGELOG entry (MINOR).
3. Model-card appendix describing the new sub-metric (2 paragraphs).
4. ConstitutionRegistry entry at the preferred tier.
5. A 6-item Role Adoption Checklist.

CONSTRAINTS:
- CI must remain green with default configuration (new scorer
  optional).
- Do not raise existing CLEAR+ thresholds.

Deliverable: Markdown with 5 sections, total 300 to 500 words.

Expected output notes. CI stays green by default; MINOR bump; 6-item checklist present.