Skip to content

Dataset Card

A Dataset Card (a.k.a. datasheet) records a dataset's origin, composition, intended use, collection procedure, preprocessing steps, distribution channel, and maintenance plan. Datasheets surface the assumptions and biases baked into training and evaluation corpora so downstream consumers — model trainers, regulators, Forensic Auditors — can assess fitness-for-purpose before committing. Produce this artifact during the Critique phase whenever a dataset is created, acquired, substantially transformed, or published to another project.

Template

Section 1: Dataset Overview

Instructions: Give the dataset a stable name + version, name the steward, record dates, and declare license / IP status up front. FAIR² compliance should cross-link to the FAIR self-assessment (OPEN-SCI-002).

Field Value
Dataset name [FILL]
Version [FILL]
Date created / updated [FILL]
Creator / steward [FILL]
License / IP status [FILL]
FAIR² compliance [None / Partial / Full — link to OPEN-SCI-002]
Summary (1-2 sentences) [FILL]

Section 2: Motivation

Instructions: Why does this dataset exist? Who funded it, what tasks it was built for, and — just as important — what it is explicitly not recommended for.

  • Purpose of creation: [FILL]
  • Funding / sponsorship: [FILL]
  • Intended tasks: [FILL]
  • Not recommended for: [FILL]

Section 3: Composition

Instructions: Describe the structure: instance count, instance type, feature schema, label distribution, missing-data handling, and PII presence. Every sensitive column must be flagged.

  • Instance count / type / format: [FILL]
  • Feature schema (with sensitivity flags): [FILL]
  • Label distribution (if labelled): [FILL]
  • Missing-data handling: [FILL]
  • PII present? handling: [FILL]

Section 4: Collection & Preprocessing

Instructions: Document the source, method, time period, sampling strategy, and ethical / consent posture. Preprocessing must be reproducible from a committed script.

  • Source(s): [FILL]
  • Collection method: [FILL]
  • Time period + geography: [FILL]
  • Sampling strategy: [FILL]
  • Consent / ethical review: [FILL]
  • Preprocessing steps + script: [FILL]

Section 5: Distribution & Uses

Instructions: How is the dataset accessed (filesystem, MCP resource URI, A2A skill, registry)? Enumerate known consumers with purpose so future auditors can detect drift.

  • Access method / endpoint: [FILL]
  • Export formats: [FILL]
  • Known uses: [FILL]
  • Previous versions: [FILL]

Section 6: Maintenance & Ethics

Instructions: Name the data steward, update frequency, versioning policy, retention period, deprecation plan, and issue-reporting channel. Ethical considerations must cover potential-for-harm and bias sources.

  • Data steward + update frequency: [FILL]
  • Versioning + retention + deprecation: [FILL]
  • Issue reporting: [FILL]
  • Potential for harm / bias sources / mitigations: [FILL]

Adoption Checklist

  • All required sections completed
  • Artifact peer-reviewed by at least one R.I.S.C.E.A.R. peer
  • Stored in the project's designated docs location
  • Linked from README or equivalent index
  • Versioned + date-stamped alongside the dataset it describes

References

  • PHOENIX v4.0.0 — docs/resources/templates/open-science/dataset-card.md
  • Gebru, T. et al. (2021) — Datasheets for Datasets, CACM 64(12)
  • Pushkarna, M. et al. (2022) — Data Cards, ACM FAccT
  • FAIR² Open Specification (October 2025)
  • Hutchinson, B. et al. (2023) — Open Datasheets, arXiv

FCC integration

This template is referenced from the Forensic Auditor persona (src/fcc/data/personas/forensic_auditor.yaml) as part of the Critique-phase evidence set. Every dataset under audit must have a current Dataset Card; the auditor cross-links the card to the DMP (OPEN-SCI-010) and the FAIR self-assessment (OPEN-SCI-002). The Dataset Card is also the canonical surface for src/fcc/evaluation/cards.py's Datasheet model. See also src/fcc/data/governance/open_science_gates.yaml.