RAG Document Pipeline Data Flow¶
This diagram traces how a source document is chunked, embedded, indexed, retrieved, and composed into a persona-aware answer via the FCC RAG stack. It spans rag/chunking.py, search/index.py, rag/retriever.py, rag/pipeline.py, and the shared EmbeddingProvider protocol. The pipeline supports six chunking strategies, pluggable embedding backends, and persona-conditioned system prompts so retrieval stays compatible with R.I.S.C.E.A.R. role definitions. The final RAGResult carries its source chunks so downstream consumers can audit provenance.
The diagram below traces the document-to-answer pipeline.
flowchart LR
subgraph Ingest
I1[(Source document)]
I2[DocumentChunker.chunk]
I3{ChunkingStrategy}
I4[DocumentChunk]
end
subgraph Index
X1[EmbeddingProvider.embed]
X2[SearchIndex.index]
X3[(Vector store)]
end
subgraph Query
Q1[Query string]
Q2[EmbeddingProvider.embed]
Q3[SemanticRetriever.retrieve]
Q4[RetrievalResult]
end
subgraph Answer
A1[RAGPipeline.ask]
A2[PersonaSpec system prompt]
A3[BaseAIClient.complete]
A4[RAGResult]
end
I1 --> I2 --> I3
I3 -- FIXED_SIZE / PARAGRAPH / SEMANTIC / YAML_BLOCK / CODE_FUNCTION / PARENT_CHILD --> I4
I4 --> X1 --> X2 --> X3
Q1 --> Q2 --> Q3
X3 --> Q3
Q3 --> Q4 --> A1
A2 --> A1 --> A3 --> A4
At ingest time the pipeline selects one of six ChunkingStrategy values, producing DocumentChunk records whose parent_chunk_id preserves hierarchy for PARENT_CHILD strategies. Each chunk is embedded and indexed. The SearchIndex is backend-agnostic: the default MockEmbeddingProvider returns 384-dimensional vectors so CI stays offline-friendly.
At query time, RAGPipeline.ask(question, persona_id) embeds the question, retrieves top-k RetrievalResult(chunk, score, context) tuples, and hands them with the persona system prompt to the configured BaseAIClient. The RAGResult(question, answer, sources, persona_id, model) bundles provenance so callers can audit which chunks fed which answer.
Data shapes¶
- Stage 1 - Input: Source document path plus chosen
ChunkingStrategyenum value. - Stage 2 - Chunk:
DocumentChunker.chunkemits a list ofDocumentChunkwith offsets and metadata. - Stage 3 - Embed:
EmbeddingProvider.embed(chunk.text)produces a dense vector. - Stage 4 - Index:
SearchIndex.index(chunk_id, embedding, metadata)persists to the backend. - Stage 5 - Retrieve:
SemanticRetriever.retrieve(query, k)returns rankedRetrievalResulttuples. - Stage 6 - Compose:
RAGPipelinemerges retrieved chunks with thePersonaSpecsystem prompt. - Stage 7 - Answer:
BaseAIClient.completereturns raw text; pipeline wraps it inRAGResult(question, answer, sources, persona_id, model).
See also¶
- Source:
src/fcc/rag/chunking.py:19,37,src/fcc/rag/retriever.py:19,src/fcc/rag/pipeline.py:59 - Source:
src/fcc/search/index.py,src/fcc/search/embeddings.py - Related class diagram:
../class-diagrams/rag-pipeline.md - For audience tier:
docs/for-scientists/literature-review-agents.md