RAG Persona-Aware Retrieval¶

This diagram traces a single persona-aware RAG query from question string to final RAGResult. The entry point is RAGPipeline.query(question, persona_id) around src/fcc/rag/pipeline.py:100, which coordinates the persona registry, chunker, semantic retriever, embedding provider, and LLM client. Developers read this trace to understand how the persona's RISCEAR block is promoted to the system prompt — this is what makes the retrieval "persona-aware" rather than a vanilla semantic-search plus LLM call. The flow is deterministic in the mock path and ready for Anthropic, OpenAI, Ollama, or LiteLLM backends in production.

The sequence below shows one query end to end, from persona resolution through the grounded LLM completion.

sequenceDiagram
    participant Caller
    participant RAGPipeline
    participant PersonaRegistry
    participant DocumentChunker
    participant EmbeddingProvider
    participant SemanticRetriever
    participant SearchIndex
    participant AIClient
    participant RAGResult

    Caller->>RAGPipeline: query(question, persona_id)
    RAGPipeline->>PersonaRegistry: by_id(persona_id)
    PersonaRegistry-->>RAGPipeline: PersonaSpec
    Note over RAGPipeline: system_prompt = persona.riscear
    RAGPipeline->>DocumentChunker: chunk_document(source)
    DocumentChunker-->>RAGPipeline: list[Chunk]
    RAGPipeline->>EmbeddingProvider: embed(question)
    EmbeddingProvider-->>RAGPipeline: query_vector
    RAGPipeline->>SemanticRetriever: retrieve(query_vector, k)
    SemanticRetriever->>SearchIndex: search(query_vector, k)
    SearchIndex-->>SemanticRetriever: top_k_chunks
    SemanticRetriever-->>RAGPipeline: list[Chunk]
    RAGPipeline->>AIClient: complete(system_prompt, question + context)
    AIClient-->>RAGPipeline: answer
    RAGPipeline->>RAGResult: new(question, answer, sources, persona_id, model)
    RAGResult-->>RAGPipeline: RAGResult
    RAGPipeline-->>Caller: RAGResult

Failure modes come from three places. A missing persona raises KeyError from the registry and aborts the query — callers typically wrap this. An empty top_k_chunks list (the retriever finds nothing above the similarity floor) does not raise but produces a RAGResult with empty sources; the LLM is still called with just the question and the persona prompt, so the caller must inspect sources to distinguish "grounded" from "ungrounded" answers. An AIClient.complete timeout surfaces as an exception; there is no retry inside the pipeline because retry policy is a provider concern. To observe in production, wrap query in a @traced span and emit a custom metric for rag_sources_per_query — drops in this metric are the earliest signal of index degradation.

The persona's RISCEAR block becomes the LLM system prompt verbatim, which means changes to a persona YAML propagate to RAG behaviour without any code change. This is powerful but also means a malformed persona can silently degrade answer quality; the compliance pipeline catches most of these before they ship. The RAGResult preserves both the persona_id and the model identifier so that downstream evaluation (CLEAR+ benchmarking) can attribute answers to the exact configuration that produced them.

Steps in detail¶

Caller to RAGPipeline: query — The caller invokes query(question, persona_id) with the natural-language question and a persona identifier.
RAGPipeline to PersonaRegistry: by_id — The registry returns the PersonaSpec whose RISCEAR block will shape the LLM system prompt.
RAGPipeline to DocumentChunker: chunk_document — One of six chunking strategies is applied to the source document (fixed-size, sentence, paragraph, semantic, recursive, or sliding).
RAGPipeline to EmbeddingProvider: embed — The question is embedded via the configured provider (mock 384d or a sentence-transformers model).
RAGPipeline to SemanticRetriever: retrieve — The retriever is asked for the top-k chunks matching the query vector.
SemanticRetriever to SearchIndex: search — The index performs the k-NN lookup against stored chunk vectors.
SemanticRetriever to RAGPipeline: list[Chunk] — The top-k chunks are returned as context material.
RAGPipeline to AIClient: complete — The client is called with the persona RISCEAR as the system prompt and the question plus concatenated chunks as the user message.
RAGPipeline to RAGResult: new — A RAGResult is constructed carrying the question, answer, sources, persona ID, and model identifier.
RAGPipeline to Caller: RAGResult — The result is returned for rendering, scoring, or archival.

RAG Persona-Aware Retrieval¶

Steps in detail¶

See also¶