Skip to content

API Reference: RAG Pipeline

This document covers the FCC Retrieval-Augmented Generation (RAG) pipeline, which combines document chunking, semantic retrieval, and AI-powered answer generation with persona-aware query support.

flowchart LR
    D[Documents] --> C[DocumentChunker]
    C -->|6 strategies| CH[DocumentChunks]
    CH --> R[SemanticRetriever]
    R -->|embed + search| RR[RetrievalResults]
    RR --> P[RAGPipeline]
    Q[User Query] --> P
    P -->|persona-aware prompt| AI[AIClient]
    AI --> A[RAGResult / Answer]

DocumentChunker

fcc.rag.chunking.DocumentChunker splits documents into chunks using configurable strategies.

Chunking Strategies

Strategy Enum Value Description
Fixed Size ChunkingStrategy.FIXED_SIZE Fixed character windows with configurable overlap
Paragraph ChunkingStrategy.PARAGRAPH Split on double-newline boundaries
Semantic ChunkingStrategy.SEMANTIC Semantic-boundary splitting
YAML Block ChunkingStrategy.YAML_BLOCK Split by top-level YAML keys
Code Function ChunkingStrategy.CODE_FUNCTION Split by Python def/class boundaries
Parent-Child ChunkingStrategy.PARENT_CHILD Hierarchical parent-child chunking

Basic Usage

from fcc.rag.chunking import DocumentChunker, ChunkingStrategy

# Default: paragraph strategy, 512 char chunks, 64 char overlap
chunker = DocumentChunker()

# Fixed-size with custom parameters
chunker = DocumentChunker(
    strategy=ChunkingStrategy.FIXED_SIZE,
    chunk_size=1024,
    overlap=128,
)

# Chunk plain text
chunks = chunker.chunk_text("Long document text...", source_path="doc.txt")

# Chunk specific formats
yaml_chunks = chunker.chunk_yaml(yaml_text, source_path="config.yaml")
py_chunks = chunker.chunk_python(python_source, source_path="module.py")
md_chunks = chunker.chunk_markdown(markdown_text, source_path="guide.md")

Directory Chunking

from pathlib import Path

# Chunk all supported files in a directory
all_chunks = chunker.chunk_directory(
    Path("src/fcc/"),
    patterns=["*.py", "*.yaml", "*.yml", "*.md"],
)

Extension mapping: .py uses chunk_python, .yaml/.yml uses chunk_yaml, .md uses chunk_markdown, all others use chunk_text.

DocumentChunk

Each chunk is a frozen dataclass with the following fields:

Field Type Description
chunk_id str Deterministic hash-based ID
text str The chunk text content
source_path str Path to the source document
parent_chunk_id str or None Parent chunk ID (for parent-child strategy)
strategy ChunkingStrategy Strategy that produced this chunk
start_offset int Character offset in source document
end_offset int End character offset
metadata dict Arbitrary metadata (e.g., {"header": "...", "level": 2})
# Serialize/deserialize
data = chunk.to_dict()
restored = DocumentChunk.from_dict(data)

SemanticRetriever

fcc.rag.retriever.SemanticRetriever wraps a SearchIndex to convert search results back into DocumentChunk instances with relevance scores and optional parent-chunk context expansion.

Building from Chunks

from fcc.rag.retriever import SemanticRetriever

# Build directly from a list of chunks
retriever = SemanticRetriever.build_from_chunks(chunks)

# Or build manually
from fcc.search.index import SearchIndex
index = SearchIndex()
retriever = SemanticRetriever(index)

for chunk in chunks:
    index.add_document(chunk.chunk_id, chunk.text, dict(chunk.metadata))
    retriever.register_chunk(chunk)

Retrieval

# Basic retrieval
results = retriever.retrieve("What is R.I.S.C.E.A.R.?", k=5)
for rr in results:
    print(f"Score: {rr.score:.3f}, Source: {rr.chunk.source_path}")
    print(f"  Text: {rr.chunk.text[:100]}")

# Retrieval with parent-chunk context expansion
results = retriever.retrieve_with_context("persona dimensions", k=5)
for rr in results:
    print(f"Score: {rr.score:.3f}")
    if rr.context:
        print(f"  Parent context: {rr.context[:100]}")

RetrievalResult

Frozen dataclass with:

Field Type Description
chunk DocumentChunk The retrieved chunk
score float Relevance/similarity score
context str Expanded parent-chunk context (empty if not applicable)

RAGPipeline

fcc.rag.pipeline.RAGPipeline orchestrates end-to-end retrieval-augmented generation queries, combining a SemanticRetriever with an AI client (real or mock).

Basic Setup

from fcc.rag.pipeline import RAGPipeline
from fcc.rag.retriever import SemanticRetriever
from fcc.rag.chunking import DocumentChunker

# Chunk documents
chunker = DocumentChunker()
chunks = chunker.chunk_directory("src/fcc/data/personas/")

# Build retriever and pipeline
retriever = SemanticRetriever.build_from_chunks(chunks)
pipeline = RAGPipeline(retriever)

Querying

# Basic query (uses mock AI client by default)
result = pipeline.query("What is the Research Coordinator's role?", k=5)
print(result.answer)
print(f"Sources: {len(result.sources)}")

# Query with persona ID for style customization
result = pipeline.query(
    "Explain governance constitutions",
    persona_id="RC",
    k=3,
)

Persona-Aware Queries

from fcc.personas.registry import PersonaRegistry

registry = PersonaRegistry.from_yaml_directory(get_personas_dir())
persona = registry.get("RC")

result = pipeline.query_with_persona(
    "What are the key governance patterns?",
    persona=persona,
    k=5,
)
print(f"Answered as: {result.persona_id}")

The persona's riscear.role is injected into the system prompt to shape the answer style.

Directory Indexing Shortcut

# Re-index from a directory (replaces current retriever state)
pipeline.build_index_from_directory("src/fcc/data/")

RAGResult

Frozen dataclass with:

Field Type Description
question str The original question
answer str The generated answer
sources tuple[RetrievalResult, ...] Retrieved source chunks
persona_id str or None Persona that shaped the answer
model str The AI model used

Configuration Recommendations

Content Type Recommended Strategy Chunk Size
Persona YAML specs YAML_BLOCK N/A (split by keys)
Python source CODE_FUNCTION N/A (split by defs)
Guidebook chapters PARAGRAPH 512
Large reference docs FIXED_SIZE 1024, overlap 128
Hierarchical docs PARENT_CHILD N/A (auto)
Markdown tutorials Use chunk_markdown N/A (split by headers)

Import Paths Summary

Class Import Path
DocumentChunker fcc.rag.chunking
DocumentChunk fcc.rag.chunking
ChunkingStrategy fcc.rag.chunking
SemanticRetriever fcc.rag.retriever
RetrievalResult fcc.rag.retriever
RAGPipeline fcc.rag.pipeline
RAGResult fcc.rag.pipeline

See Also