Data Scientist Learning Path¶

A structured path for data scientists who want to use the FCC framework for ML lifecycle management, model evaluation, and data pipeline orchestration. This path emphasizes the ML-focused personas, semantic search, knowledge graphs, and the RAG pipeline.

Estimated time: 12--16 hours

Prerequisites: Python 3.10+, familiarity with ML workflows (training, evaluation, deployment), comfort with Jupyter notebooks and pandas/numpy.

Quick-Start Checklist¶

#	Activity	Resource	Time
1	Install FCC and run your first simulation	Quickstart	15 min
2	Explore ML Lifecycle personas	Notebook 01: FCC Fundamentals	20 min
3	Understand persona dimensions and profiling	Notebook 03: Persona Exploration	25 min
4	Study semantic search and embedding indexes	Notebook 15: Semantic Search	45 min
5	Build and query knowledge graphs	Notebook 16: Knowledge Graphs	45 min
6	Implement a RAG pipeline with persona-aware retrieval	Notebook 17: RAG Pipeline	45 min
7	Complete Lab 11: ML Lifecycle Integration	Guidebook Ch. 12: Labs	60 min
8	Complete Lab 12: Model Evaluation Pipeline	Guidebook Ch. 12: Labs	60 min
9	Operate the Collaboration Dashboard	Streamlit: collaboration_dashboard.py	30 min
10	Run a simulation replay to analyze agent decisions	Streamlit: simulation_replay.py	30 min

Key Personas for Data Scientists¶

The FCC framework defines 20 ML-focused personas across two categories. These are the ones most relevant to your daily work.

ML Lifecycle (9 personas)¶

ID	Name	Phase	Why It Matters
DSS	Data Sourcing Specialist	Find	Discovers and evaluates data sources with provenance tracking
ENA	Exploratory Notebook Analyst	Find	Structures EDA workflows with reproducible notebooks
FAR	Feature Architect	Create	Designs feature stores and transformation pipelines
MAR	Model Architect	Create	Selects architectures, defines training strategies
ESC	Experiment Scientist	Create	Manages experiment tracking, hyperparameter sweeps
IOR	Inference Optimizer	Build	Optimizes models for production inference (quantization, distillation)
IRE	Interpretability Researcher	Critique	Explains model decisions with SHAP, LIME, attention maps
IAN	Impact Analyst	Critique	Measures real-world model impact and drift
MOS	Model Operations Specialist	Ops	Manages model registries, A/B tests, rollback policies

ML Models (selected)¶

ID	Name	Phase	Why It Matters
NNS	Neural Network Specialist	Build	Deep learning architectures, training, compression
GBT	Gradient Boosting Trainer	Build	XGBoost/LightGBM training and tuning
RFS	Random Forest Specialist	Build	Ensemble methods, feature importance analysis
POR	Pipeline Orchestrator	Build	End-to-end pipeline DAGs (from `data_engineering`)

Skill Progression¶

Stage 1: Foundation (2--3 hours)¶

Goal: Understand how FCC models the ML lifecycle as a persona-driven workflow.

Complete the Quickstart and Notebook 01
Browse ML Lifecycle and ML Models personas in the learn_personas.py Streamlit app
Read Guidebook Chapter 3 (R.I.S.C.E.A.R. Specification) to understand how each persona is formally defined

Milestone: You can explain how DSS, FAR, MAR, ESC, and MOS map to your existing ML workflow stages.

Stage 2: Search and Knowledge (3--4 hours)¶

Goal: Use semantic search to find relevant personas and actions, and build knowledge graphs of your ML artifacts.

Complete Notebook 15 (Semantic Search): build a PersonaSearchIndex, query by description
Complete Notebook 16 (Knowledge Graphs): construct a KG with persona nodes, action edges, and artifact nodes
Export a knowledge graph to RDF or JSON-LD for integration with existing tools

Milestone: You have built a searchable index of all 102 personas and can find the right persona for a given task in under 5 seconds.

Stage 3: RAG Pipeline Integration (3--4 hours)¶

Goal: Build a retrieval-augmented generation pipeline that uses FCC personas as domain experts.

Complete Notebook 17 (RAG Pipeline): chunk documents, build a retrieval index, run persona-aware queries
Experiment with different chunking strategies (sentence, paragraph, sliding window, semantic)
Configure the SemanticRetriever to weight results by persona relevance

Milestone: You can ask a question and receive an answer grounded in both your documents and the relevant FCC persona's expertise.

Stage 4: Hands-On Labs (4--5 hours)¶

Goal: Apply everything in structured lab exercises.

Lab 11: Wire up a complete ML lifecycle workflow using DSS -> FAR -> MAR -> ESC -> MOS
Lab 12: Build an evaluation pipeline that uses IRE and IAN to critique model outputs
Run simulations with the fcc simulate CLI and analyze traces

Milestone: You have a working end-to-end ML workflow orchestrated by FCC personas with quality gates at each phase transition.

Recommended Streamlit Apps¶

Application	Purpose	When to Use
`collaboration_dashboard.py`	Track multi-persona collaboration sessions	During team simulations
`simulation_replay.py`	Replay and analyze simulation traces	After running experiments
`persona_explorer.py`	Browse full persona catalog with search	When selecting personas for a workflow
`ecosystem_dashboard.py`	View cross-project health and dependencies	When integrating with other projects

Connections to Other Paths¶

DevOps Engineer path: After completing this path, learn how to deploy your ML pipelines with the DevOps Engineer path
Researcher path: Dive deeper into knowledge graphs and federation with the Researcher path
Product Manager path: Understand governance and stakeholder communication with the Product Manager path