Skip to content

Data Scientist Learning Path

A structured path for data scientists who want to use the FCC framework for ML lifecycle management, model evaluation, and data pipeline orchestration. This path emphasizes the ML-focused personas, semantic search, knowledge graphs, and the RAG pipeline.

Estimated time: 12--16 hours

Prerequisites: Python 3.10+, familiarity with ML workflows (training, evaluation, deployment), comfort with Jupyter notebooks and pandas/numpy.


Quick-Start Checklist

# Activity Resource Time
1 Install FCC and run your first simulation Quickstart 15 min
2 Explore ML Lifecycle personas Notebook 01: FCC Fundamentals 20 min
3 Understand persona dimensions and profiling Notebook 03: Persona Exploration 25 min
4 Study semantic search and embedding indexes Notebook 15: Semantic Search 45 min
5 Build and query knowledge graphs Notebook 16: Knowledge Graphs 45 min
6 Implement a RAG pipeline with persona-aware retrieval Notebook 17: RAG Pipeline 45 min
7 Complete Lab 11: ML Lifecycle Integration Guidebook Ch. 12: Labs 60 min
8 Complete Lab 12: Model Evaluation Pipeline Guidebook Ch. 12: Labs 60 min
9 Operate the Collaboration Dashboard Streamlit: collaboration_dashboard.py 30 min
10 Run a simulation replay to analyze agent decisions Streamlit: simulation_replay.py 30 min

Key Personas for Data Scientists

The FCC framework defines 20 ML-focused personas across two categories. These are the ones most relevant to your daily work.

ML Lifecycle (9 personas)

ID Name Phase Why It Matters
DSS Data Sourcing Specialist Find Discovers and evaluates data sources with provenance tracking
ENA Exploratory Notebook Analyst Find Structures EDA workflows with reproducible notebooks
FAR Feature Architect Create Designs feature stores and transformation pipelines
MAR Model Architect Create Selects architectures, defines training strategies
ESC Experiment Scientist Create Manages experiment tracking, hyperparameter sweeps
IOR Inference Optimizer Build Optimizes models for production inference (quantization, distillation)
IRE Interpretability Researcher Critique Explains model decisions with SHAP, LIME, attention maps
IAN Impact Analyst Critique Measures real-world model impact and drift
MOS Model Operations Specialist Ops Manages model registries, A/B tests, rollback policies

ML Models (selected)

ID Name Phase Why It Matters
NNS Neural Network Specialist Build Deep learning architectures, training, compression
GBT Gradient Boosting Trainer Build XGBoost/LightGBM training and tuning
RFS Random Forest Specialist Build Ensemble methods, feature importance analysis
POR Pipeline Orchestrator Build End-to-end pipeline DAGs (from data_engineering)

Skill Progression

Stage 1: Foundation (2--3 hours)

Goal: Understand how FCC models the ML lifecycle as a persona-driven workflow.

  • Complete the Quickstart and Notebook 01
  • Browse ML Lifecycle and ML Models personas in the learn_personas.py Streamlit app
  • Read Guidebook Chapter 3 (R.I.S.C.E.A.R. Specification) to understand how each persona is formally defined

Milestone: You can explain how DSS, FAR, MAR, ESC, and MOS map to your existing ML workflow stages.

Stage 2: Search and Knowledge (3--4 hours)

Goal: Use semantic search to find relevant personas and actions, and build knowledge graphs of your ML artifacts.

  • Complete Notebook 15 (Semantic Search): build a PersonaSearchIndex, query by description
  • Complete Notebook 16 (Knowledge Graphs): construct a KG with persona nodes, action edges, and artifact nodes
  • Export a knowledge graph to RDF or JSON-LD for integration with existing tools

Milestone: You have built a searchable index of all 102 personas and can find the right persona for a given task in under 5 seconds.

Stage 3: RAG Pipeline Integration (3--4 hours)

Goal: Build a retrieval-augmented generation pipeline that uses FCC personas as domain experts.

  • Complete Notebook 17 (RAG Pipeline): chunk documents, build a retrieval index, run persona-aware queries
  • Experiment with different chunking strategies (sentence, paragraph, sliding window, semantic)
  • Configure the SemanticRetriever to weight results by persona relevance

Milestone: You can ask a question and receive an answer grounded in both your documents and the relevant FCC persona's expertise.

Stage 4: Hands-On Labs (4--5 hours)

Goal: Apply everything in structured lab exercises.

  • Lab 11: Wire up a complete ML lifecycle workflow using DSS -> FAR -> MAR -> ESC -> MOS
  • Lab 12: Build an evaluation pipeline that uses IRE and IAN to critique model outputs
  • Run simulations with the fcc simulate CLI and analyze traces

Milestone: You have a working end-to-end ML workflow orchestrated by FCC personas with quality gates at each phase transition.


Application Purpose When to Use
collaboration_dashboard.py Track multi-persona collaboration sessions During team simulations
simulation_replay.py Replay and analyze simulation traces After running experiments
persona_explorer.py Browse full persona catalog with search When selecting personas for a workflow
ecosystem_dashboard.py View cross-project health and dependencies When integrating with other projects

Connections to Other Paths

  • DevOps Engineer path: After completing this path, learn how to deploy your ML pipelines with the DevOps Engineer path
  • Researcher path: Dive deeper into knowledge graphs and federation with the Researcher path
  • Product Manager path: Understand governance and stakeholder communication with the Product Manager path

Further Reading