Dataset Card¶

A Dataset Card (a.k.a. datasheet) records a dataset's origin, composition, intended use, collection procedure, preprocessing steps, distribution channel, and maintenance plan. Datasheets surface the assumptions and biases baked into training and evaluation corpora so downstream consumers — model trainers, regulators, Forensic Auditors — can assess fitness-for-purpose before committing. Produce this artifact during the Critique phase whenever a dataset is created, acquired, substantially transformed, or published to another project.

Template¶

Section 1: Dataset Overview¶

Instructions: Give the dataset a stable name + version, name the steward, record dates, and declare license / IP status up front. FAIR² compliance should cross-link to the FAIR self-assessment (OPEN-SCI-002).

Field	Value
Dataset name	`[FILL]`
Version	`[FILL]`
Date created / updated	`[FILL]`
Creator / steward	`[FILL]`
License / IP status	`[FILL]`
FAIR² compliance	`[None / Partial / Full — link to OPEN-SCI-002]`
Summary (1-2 sentences)	`[FILL]`

Section 2: Motivation¶

Instructions: Why does this dataset exist? Who funded it, what tasks it was built for, and — just as important — what it is explicitly not recommended for.

Purpose of creation: [FILL]
Funding / sponsorship: [FILL]
Intended tasks: [FILL]
Not recommended for: [FILL]

Section 3: Composition¶

Instructions: Describe the structure: instance count, instance type, feature schema, label distribution, missing-data handling, and PII presence. Every sensitive column must be flagged.

Instance count / type / format: [FILL]
Feature schema (with sensitivity flags): [FILL]
Label distribution (if labelled): [FILL]
Missing-data handling: [FILL]
PII present? handling: [FILL]

Section 4: Collection & Preprocessing¶

Instructions: Document the source, method, time period, sampling strategy, and ethical / consent posture. Preprocessing must be reproducible from a committed script.

Source(s): [FILL]
Collection method: [FILL]
Time period + geography: [FILL]
Sampling strategy: [FILL]
Consent / ethical review: [FILL]
Preprocessing steps + script: [FILL]

Section 5: Distribution & Uses¶

Instructions: How is the dataset accessed (filesystem, MCP resource URI, A2A skill, registry)? Enumerate known consumers with purpose so future auditors can detect drift.

Access method / endpoint: [FILL]
Export formats: [FILL]
Known uses: [FILL]
Previous versions: [FILL]

Section 6: Maintenance & Ethics¶

Instructions: Name the data steward, update frequency, versioning policy, retention period, deprecation plan, and issue-reporting channel. Ethical considerations must cover potential-for-harm and bias sources.

Data steward + update frequency: [FILL]
Versioning + retention + deprecation: [FILL]
Issue reporting: [FILL]
Potential for harm / bias sources / mitigations: [FILL]

Adoption Checklist¶

All required sections completed
Artifact peer-reviewed by at least one R.I.S.C.E.A.R. peer
Stored in the project's designated docs location
Linked from README or equivalent index
Versioned + date-stamped alongside the dataset it describes

References¶

PHOENIX v4.0.0 — docs/resources/templates/open-science/dataset-card.md
Gebru, T. et al. (2021) — Datasheets for Datasets, CACM 64(12)
Pushkarna, M. et al. (2022) — Data Cards, ACM FAccT
FAIR² Open Specification (October 2025)
Hutchinson, B. et al. (2023) — Open Datasheets, arXiv

FCC integration¶

This template is referenced from the Forensic Auditor persona (src/fcc/data/personas/forensic_auditor.yaml) as part of the Critique-phase evidence set. Every dataset under audit must have a current Dataset Card; the auditor cross-links the card to the DMP (OPEN-SCI-010) and the FAIR self-assessment (OPEN-SCI-002). The Dataset Card is also the canonical surface for src/fcc/evaluation/cards.py's Datasheet model. See also src/fcc/data/governance/open_science_gates.yaml.