Plugin Development: Benchmark Scorer Prompts¶
This file collects six end-to-end prompts for building a new Scorer plugin that participates in the CLEAR+ evaluation framework (7 dimensions: Cost, Latency, Efficacy, Assurance, Reliability, Coverage, Explainability). Scorers are one of the 11 plugin types and hook into BenchmarkRunner and BenchmarkSuite via src/fcc/evaluation/. Each prompt pins personas, R.I.S.C.E.A.R. slots, and the deliverable shape.
Table of Contents¶
- Choose a CLEAR+ Dimension
- Skeleton the Scorer
- Integrate with BenchmarkRunner
- Calibrate Against Baseline Benchmarks
- Test the Scorer
- Ship with CI Integration
1. Choose a CLEAR+ Dimension¶
Personas/subsystems invoked. dal, ra. R.I.S.C.E.A.R. slot: Role + Input.
You are the Data Analyst Lead (dal). Risk Analyst (ra) reviews.
TASK: Propose a new scorer that deepens one of the 7 CLEAR+ dimensions.
For the chosen dimension:
1. Argue why existing metrics under CLEARPlusMetrics are insufficient.
2. Define the new sub-metric (formula or evaluation procedure).
3. State value range and sign convention (higher-is-better or lower).
4. Map to at least 1 of the 256+ EU AI Act requirements if relevant.
CONSTRAINTS:
- Do not propose a 8th CLEAR+ dimension. Stay within the existing 7.
- Must be computable from data already emitted by SimulationEngine +
EventBus.
Deliverable: 4-section Markdown, 250 to 400 words.
Expected output notes. Stays within 7 dimensions; sign convention stated; optional AI Act mapping.
2. Skeleton the Scorer¶
Personas/subsystems invoked. tr. R.I.S.C.E.A.R. slot: Expected Output.
You are the Technical Reviewer (tr).
TASK: Produce the Python skeleton for the scorer at
src/fcc/plugins/<scorer_name>.py.
Requirements:
1. Subclass the Scorer plugin ABC.
2. Expose score(benchmark_result) -> float and metadata().
3. Use a dataclass for config.
CONSTRAINTS:
- No Pydantic.
- Pure function semantics; no I/O inside score().
Deliverable: a .py file body, 50 to 100 lines with docstrings.
Expected output notes. Dataclass config, pure score(), no I/O in score.
3. Integrate with BenchmarkRunner¶
Personas/subsystems invoked. tr, dal. R.I.S.C.E.A.R. slot: Role Collaborators.
You are the Technical Reviewer (tr). Data Analyst Lead (dal) reviews.
TASK: Show how the scorer plugs into BenchmarkRunner
(src/fcc/evaluation/runner.py) so that:
1. BenchmarkSpec.dimensions can request the new sub-metric.
2. BenchmarkResult carries the score with name + value + rationale.
3. BenchmarkComparison diffs pre/post runs.
CONSTRAINTS:
- BenchmarkRunner default stays mock-first ("always-simulate").
- No changes to BenchmarkSpec public API that would break existing
benchmarks.
Deliverable: a diff-style patch plus a 3-bullet summary of
backward-compat guarantees.
Expected output notes. Backward compatibility preserved; comparison wired; 3 guarantees listed.
4. Calibrate Against Baseline Benchmarks¶
Personas/subsystems invoked. dal, ra. R.I.S.C.E.A.R. slot: Constraints.
You are the Data Analyst Lead (dal). Risk Analyst (ra) reviews.
TASK: Calibrate the scorer against the packaged baseline benchmarks at
src/fcc/data/evaluation/baseline_benchmarks.yaml and the stress set
stress_benchmarks.yaml.
Produce:
1. A 5-row table: benchmark name, expected score band, rationale.
2. Thresholds for PASS/WARN/FAIL used by CI.
3. A note on cross-dimension interactions (e.g., improving Coverage
but degrading Latency).
CONSTRAINTS:
- Do not modify the baseline YAMLs; add config in a separate file.
- Respect ci_benchmark_config.yaml structure for CI wiring.
Deliverable: a YAML snippet for the new config plus the 5-row table.
Expected output notes. Baselines unchanged, YAML follows existing structure, 5 rows.
5. Test the Scorer¶
Personas/subsystems invoked. tr. R.I.S.C.E.A.R. slot: Expected Output.
You are the Technical Reviewer (tr).
TASK: Produce pytest tests at tests/plugins/test_<scorer_name>.py.
Cases:
1. Score is deterministic for a fixed BenchmarkResult.
2. Score monotonic w.r.t. the defined axis (construct two results
where only one variable differs).
3. Score is bounded within the stated value range.
4. Integration: BenchmarkRunner picks up the plugin via the plugin
registry.
CONSTRAINTS:
- pytest fixtures.
- >= 98% line and >= 80% branch coverage.
- No AI provider calls; use the mock provider.
Deliverable: pytest file body, 80 to 150 lines.
Expected output notes. 4 cases; monotonicity test present; mock provider only.
6. Ship with CI Integration¶
Personas/subsystems invoked. sre, ra. R.I.S.C.E.A.R. slot: Role Adoption Checklist.
You are the Site Reliability Engineer (sre). Risk Analyst (ra) reviews.
TASK: Produce the CI + governance bundle:
1. Update to .github/workflows for the CI benchmark job to include the
new scorer (summary, not full YAML).
2. CHANGELOG entry (MINOR).
3. Model-card appendix describing the new sub-metric (2 paragraphs).
4. ConstitutionRegistry entry at the preferred tier.
5. A 6-item Role Adoption Checklist.
CONSTRAINTS:
- CI must remain green with default configuration (new scorer
optional).
- Do not raise existing CLEAR+ thresholds.
Deliverable: Markdown with 5 sections, total 300 to 500 words.
Expected output notes. CI stays green by default; MINOR bump; 6-item checklist present.