Dataset Card¶
A Dataset Card (a.k.a. datasheet) records a dataset's origin, composition, intended use, collection procedure, preprocessing steps, distribution channel, and maintenance plan. Datasheets surface the assumptions and biases baked into training and evaluation corpora so downstream consumers — model trainers, regulators, Forensic Auditors — can assess fitness-for-purpose before committing. Produce this artifact during the Critique phase whenever a dataset is created, acquired, substantially transformed, or published to another project.
Template¶
Section 1: Dataset Overview¶
Instructions: Give the dataset a stable name + version, name the steward, record dates, and declare license / IP status up front. FAIR² compliance should cross-link to the FAIR self-assessment (OPEN-SCI-002).
| Field | Value |
|---|---|
| Dataset name | [FILL] |
| Version | [FILL] |
| Date created / updated | [FILL] |
| Creator / steward | [FILL] |
| License / IP status | [FILL] |
| FAIR² compliance | [None / Partial / Full — link to OPEN-SCI-002] |
| Summary (1-2 sentences) | [FILL] |
Section 2: Motivation¶
Instructions: Why does this dataset exist? Who funded it, what tasks it was built for, and — just as important — what it is explicitly not recommended for.
- Purpose of creation:
[FILL] - Funding / sponsorship:
[FILL] - Intended tasks:
[FILL] - Not recommended for:
[FILL]
Section 3: Composition¶
Instructions: Describe the structure: instance count, instance type, feature schema, label distribution, missing-data handling, and PII presence. Every sensitive column must be flagged.
- Instance count / type / format:
[FILL] - Feature schema (with sensitivity flags):
[FILL] - Label distribution (if labelled):
[FILL] - Missing-data handling:
[FILL] - PII present? handling:
[FILL]
Section 4: Collection & Preprocessing¶
Instructions: Document the source, method, time period, sampling strategy, and ethical / consent posture. Preprocessing must be reproducible from a committed script.
- Source(s):
[FILL] - Collection method:
[FILL] - Time period + geography:
[FILL] - Sampling strategy:
[FILL] - Consent / ethical review:
[FILL] - Preprocessing steps + script:
[FILL]
Section 5: Distribution & Uses¶
Instructions: How is the dataset accessed (filesystem, MCP resource URI, A2A skill, registry)? Enumerate known consumers with purpose so future auditors can detect drift.
- Access method / endpoint:
[FILL] - Export formats:
[FILL] - Known uses:
[FILL] - Previous versions:
[FILL]
Section 6: Maintenance & Ethics¶
Instructions: Name the data steward, update frequency, versioning policy, retention period, deprecation plan, and issue-reporting channel. Ethical considerations must cover potential-for-harm and bias sources.
- Data steward + update frequency:
[FILL] - Versioning + retention + deprecation:
[FILL] - Issue reporting:
[FILL] - Potential for harm / bias sources / mitigations:
[FILL]
Adoption Checklist¶
- All required sections completed
- Artifact peer-reviewed by at least one R.I.S.C.E.A.R. peer
- Stored in the project's designated docs location
- Linked from README or equivalent index
- Versioned + date-stamped alongside the dataset it describes
References¶
- PHOENIX v4.0.0 —
docs/resources/templates/open-science/dataset-card.md - Gebru, T. et al. (2021) — Datasheets for Datasets, CACM 64(12)
- Pushkarna, M. et al. (2022) — Data Cards, ACM FAccT
- FAIR² Open Specification (October 2025)
- Hutchinson, B. et al. (2023) — Open Datasheets, arXiv
FCC integration¶
This template is referenced from the Forensic Auditor persona
(src/fcc/data/personas/forensic_auditor.yaml) as part of the
Critique-phase evidence set. Every dataset under audit must have a
current Dataset Card; the auditor cross-links the card to the DMP
(OPEN-SCI-010) and the FAIR self-assessment (OPEN-SCI-002). The Dataset
Card is also the canonical surface for src/fcc/evaluation/cards.py's
Datasheet model. See also
src/fcc/data/governance/open_science_gates.yaml.