Data Scientist Learning Path¶
A structured path for data scientists who want to use the FCC framework for ML lifecycle management, model evaluation, and data pipeline orchestration. This path emphasizes the ML-focused personas, semantic search, knowledge graphs, and the RAG pipeline.
Estimated time: 12--16 hours
Prerequisites: Python 3.10+, familiarity with ML workflows (training, evaluation, deployment), comfort with Jupyter notebooks and pandas/numpy.
Quick-Start Checklist¶
| # | Activity | Resource | Time |
|---|---|---|---|
| 1 | Install FCC and run your first simulation | Quickstart | 15 min |
| 2 | Explore ML Lifecycle personas | Notebook 01: FCC Fundamentals | 20 min |
| 3 | Understand persona dimensions and profiling | Notebook 03: Persona Exploration | 25 min |
| 4 | Study semantic search and embedding indexes | Notebook 15: Semantic Search | 45 min |
| 5 | Build and query knowledge graphs | Notebook 16: Knowledge Graphs | 45 min |
| 6 | Implement a RAG pipeline with persona-aware retrieval | Notebook 17: RAG Pipeline | 45 min |
| 7 | Complete Lab 11: ML Lifecycle Integration | Guidebook Ch. 12: Labs | 60 min |
| 8 | Complete Lab 12: Model Evaluation Pipeline | Guidebook Ch. 12: Labs | 60 min |
| 9 | Operate the Collaboration Dashboard | Streamlit: collaboration_dashboard.py | 30 min |
| 10 | Run a simulation replay to analyze agent decisions | Streamlit: simulation_replay.py | 30 min |
Key Personas for Data Scientists¶
The FCC framework defines 20 ML-focused personas across two categories. These are the ones most relevant to your daily work.
ML Lifecycle (9 personas)¶
| ID | Name | Phase | Why It Matters |
|---|---|---|---|
| DSS | Data Sourcing Specialist | Find | Discovers and evaluates data sources with provenance tracking |
| ENA | Exploratory Notebook Analyst | Find | Structures EDA workflows with reproducible notebooks |
| FAR | Feature Architect | Create | Designs feature stores and transformation pipelines |
| MAR | Model Architect | Create | Selects architectures, defines training strategies |
| ESC | Experiment Scientist | Create | Manages experiment tracking, hyperparameter sweeps |
| IOR | Inference Optimizer | Build | Optimizes models for production inference (quantization, distillation) |
| IRE | Interpretability Researcher | Critique | Explains model decisions with SHAP, LIME, attention maps |
| IAN | Impact Analyst | Critique | Measures real-world model impact and drift |
| MOS | Model Operations Specialist | Ops | Manages model registries, A/B tests, rollback policies |
ML Models (selected)¶
| ID | Name | Phase | Why It Matters |
|---|---|---|---|
| NNS | Neural Network Specialist | Build | Deep learning architectures, training, compression |
| GBT | Gradient Boosting Trainer | Build | XGBoost/LightGBM training and tuning |
| RFS | Random Forest Specialist | Build | Ensemble methods, feature importance analysis |
| POR | Pipeline Orchestrator | Build | End-to-end pipeline DAGs (from data_engineering) |
Skill Progression¶
Stage 1: Foundation (2--3 hours)¶
Goal: Understand how FCC models the ML lifecycle as a persona-driven workflow.
- Complete the Quickstart and Notebook 01
- Browse ML Lifecycle and ML Models personas in the
learn_personas.pyStreamlit app - Read Guidebook Chapter 3 (R.I.S.C.E.A.R. Specification) to understand how each persona is formally defined
Milestone: You can explain how DSS, FAR, MAR, ESC, and MOS map to your existing ML workflow stages.
Stage 2: Search and Knowledge (3--4 hours)¶
Goal: Use semantic search to find relevant personas and actions, and build knowledge graphs of your ML artifacts.
- Complete Notebook 15 (Semantic Search): build a
PersonaSearchIndex, query by description - Complete Notebook 16 (Knowledge Graphs): construct a KG with persona nodes, action edges, and artifact nodes
- Export a knowledge graph to RDF or JSON-LD for integration with existing tools
Milestone: You have built a searchable index of all 102 personas and can find the right persona for a given task in under 5 seconds.
Stage 3: RAG Pipeline Integration (3--4 hours)¶
Goal: Build a retrieval-augmented generation pipeline that uses FCC personas as domain experts.
- Complete Notebook 17 (RAG Pipeline): chunk documents, build a retrieval index, run persona-aware queries
- Experiment with different chunking strategies (sentence, paragraph, sliding window, semantic)
- Configure the
SemanticRetrieverto weight results by persona relevance
Milestone: You can ask a question and receive an answer grounded in both your documents and the relevant FCC persona's expertise.
Stage 4: Hands-On Labs (4--5 hours)¶
Goal: Apply everything in structured lab exercises.
- Lab 11: Wire up a complete ML lifecycle workflow using DSS -> FAR -> MAR -> ESC -> MOS
- Lab 12: Build an evaluation pipeline that uses IRE and IAN to critique model outputs
- Run simulations with the
fcc simulateCLI and analyze traces
Milestone: You have a working end-to-end ML workflow orchestrated by FCC personas with quality gates at each phase transition.
Recommended Streamlit Apps¶
| Application | Purpose | When to Use |
|---|---|---|
collaboration_dashboard.py |
Track multi-persona collaboration sessions | During team simulations |
simulation_replay.py |
Replay and analyze simulation traces | After running experiments |
persona_explorer.py |
Browse full persona catalog with search | When selecting personas for a workflow |
ecosystem_dashboard.py |
View cross-project health and dependencies | When integrating with other projects |
Connections to Other Paths¶
- DevOps Engineer path: After completing this path, learn how to deploy your ML pipelines with the DevOps Engineer path
- Researcher path: Dive deeper into knowledge graphs and federation with the Researcher path
- Product Manager path: Understand governance and stakeholder communication with the Product Manager path