Skip to content

RAG Document Pipeline Data Flow

This diagram traces how a source document is chunked, embedded, indexed, retrieved, and composed into a persona-aware answer via the FCC RAG stack. It spans rag/chunking.py, search/index.py, rag/retriever.py, rag/pipeline.py, and the shared EmbeddingProvider protocol. The pipeline supports six chunking strategies, pluggable embedding backends, and persona-conditioned system prompts so retrieval stays compatible with R.I.S.C.E.A.R. role definitions. The final RAGResult carries its source chunks so downstream consumers can audit provenance.

The diagram below traces the document-to-answer pipeline.

flowchart LR
    subgraph Ingest
        I1[(Source document)]
        I2[DocumentChunker.chunk]
        I3{ChunkingStrategy}
        I4[DocumentChunk]
    end
    subgraph Index
        X1[EmbeddingProvider.embed]
        X2[SearchIndex.index]
        X3[(Vector store)]
    end
    subgraph Query
        Q1[Query string]
        Q2[EmbeddingProvider.embed]
        Q3[SemanticRetriever.retrieve]
        Q4[RetrievalResult]
    end
    subgraph Answer
        A1[RAGPipeline.ask]
        A2[PersonaSpec system prompt]
        A3[BaseAIClient.complete]
        A4[RAGResult]
    end
    I1 --> I2 --> I3
    I3 -- FIXED_SIZE / PARAGRAPH / SEMANTIC / YAML_BLOCK / CODE_FUNCTION / PARENT_CHILD --> I4
    I4 --> X1 --> X2 --> X3
    Q1 --> Q2 --> Q3
    X3 --> Q3
    Q3 --> Q4 --> A1
    A2 --> A1 --> A3 --> A4

At ingest time the pipeline selects one of six ChunkingStrategy values, producing DocumentChunk records whose parent_chunk_id preserves hierarchy for PARENT_CHILD strategies. Each chunk is embedded and indexed. The SearchIndex is backend-agnostic: the default MockEmbeddingProvider returns 384-dimensional vectors so CI stays offline-friendly.

At query time, RAGPipeline.ask(question, persona_id) embeds the question, retrieves top-k RetrievalResult(chunk, score, context) tuples, and hands them with the persona system prompt to the configured BaseAIClient. The RAGResult(question, answer, sources, persona_id, model) bundles provenance so callers can audit which chunks fed which answer.

Data shapes

  • Stage 1 - Input: Source document path plus chosen ChunkingStrategy enum value.
  • Stage 2 - Chunk: DocumentChunker.chunk emits a list of DocumentChunk with offsets and metadata.
  • Stage 3 - Embed: EmbeddingProvider.embed(chunk.text) produces a dense vector.
  • Stage 4 - Index: SearchIndex.index(chunk_id, embedding, metadata) persists to the backend.
  • Stage 5 - Retrieve: SemanticRetriever.retrieve(query, k) returns ranked RetrievalResult tuples.
  • Stage 6 - Compose: RAGPipeline merges retrieved chunks with the PersonaSpec system prompt.
  • Stage 7 - Answer: BaseAIClient.complete returns raw text; pipeline wraps it in RAGResult(question, answer, sources, persona_id, model).

See also

  • Source: src/fcc/rag/chunking.py:19,37, src/fcc/rag/retriever.py:19, src/fcc/rag/pipeline.py:59
  • Source: src/fcc/search/index.py, src/fcc/search/embeddings.py
  • Related class diagram: ../class-diagrams/rag-pipeline.md
  • For audience tier: docs/for-scientists/literature-review-agents.md