Skip to content

Chapter 1: Semantic Search

Learning Objectives

By the end of this chapter you will be able to:

  1. Explain why keyword search is insufficient for FCC artifact retrieval.
  2. Describe the Protocol-based embedding provider design and its rationale.
  3. Implement a custom embedding provider and register it with the framework.
  4. Build and query a SearchIndex over FCC artifacts.
  5. Use the MockEmbeddingProvider for deterministic testing.

The figure below shows the semantic search data flow: documents and query text run through the same EmbeddingProvider into a shared SearchIndex, which returns cosine-similar top-K results to PersonaSearchIndex or ActionSearchIndex consumers.

flowchart LR
    Q[Query Text] --> EP[EmbeddingProvider]
    EP --> QV[Query Vector]
    QV --> SI[(SearchIndex)]

    D1[Document 1] --> EP
    D2[Document 2] --> EP
    D3[Document N] --> EP
    EP --> DV[Document Vectors]
    DV --> SI

    SI --> TOP[Top-K Results<br/>by Cosine Similarity]
    TOP --> PSI[PersonaSearchIndex<br/>or ActionSearchIndex]

    style SI fill:#2196F3,color:#fff
    style TOP fill:#4CAF50,color:#fff

Protocol-based provider dispatch means you can unit-test retrieval deterministically with MockEmbeddingProvider and only switch to a real provider for integration or production runs.

FCC workflows produce artifacts at every node: research findings, design documents, code, reviews, scores. As the number of simulations and collaboration sessions grows, so does the artifact corpus. Retrieving relevant artifacts by keyword works for simple cases ("find all documents mentioning pricing"), but it fails for semantic queries ("find all analyses that discuss market entry risk for regulated industries").

Semantic search solves this by representing artifacts as dense vectors (embeddings) in a high-dimensional space, where proximity corresponds to meaning similarity. A query like "market entry risk for regulated industries" is embedded into the same space, and the nearest artifacts are returned -- even if they use completely different vocabulary.

This capability is the foundation for two downstream features: the RAG pipeline (Chapter 3) and federated knowledge search (Chapter 4). It also powers the Sky-Parlour natural-language interface, which translates stakeholder questions into semantic queries across the artifact corpus.

The Embedding Provider Protocol

FCC uses a Protocol-based design for embedding providers, following the same pattern as the objectmodel package (see the ADR-004: Embedding Provider Protocol for the full rationale).

The key design decisions:

  1. Protocol, not ABC. The EmbeddingProvider is defined as a Python typing.Protocol, not an abstract base class. This means any class that implements the required methods is a valid provider, without inheritance. This is more flexible for third-party integrations.

  2. Batch-first API. The primary method is embed_batch(texts: list[str]) -> list[list[float]]. Single-text embedding is a convenience wrapper. Batch processing is essential for efficiency -- embedding 1,000 artifacts one at a time is 10--100x slower than embedding them in a single batch.

  3. Dimension metadata. Each provider reports its embedding dimensionality (e.g., 1536 for OpenAI text-embedding-3-small, 1024 for Anthropic's embeddings). This allows the SearchIndex to validate consistency.

from typing import Protocol


class EmbeddingProvider(Protocol):
    """Protocol for embedding text into vector space."""

    @property
    def dimensions(self) -> int:
        """The dimensionality of the embedding vectors."""
        ...

    def embed_batch(self, texts: list[str]) -> list[list[float]]:
        """Embed a batch of texts into vectors."""
        ...

    def embed(self, text: str) -> list[float]:
        """Embed a single text. Default implementation calls embed_batch."""
        ...

Why Protocol Over ABC?

The Protocol approach has three advantages in the FCC context:

  1. No import coupling. A third-party embedding library does not need to import anything from FCC to be compatible. As long as it implements the right methods, it works.
  2. Duck typing. Existing embedding clients (e.g., openai.Embeddings, custom inference servers) can be wrapped with minimal adapter code.
  3. Testability. The MockEmbeddingProvider is trivially implementable and does not need to inherit from a base class.

The MockEmbeddingProvider

For testing and development, FCC ships a MockEmbeddingProvider that generates deterministic embeddings without calling any external API:

from fcc.search.providers import MockEmbeddingProvider

provider = MockEmbeddingProvider(dimensions=128)

vectors = provider.embed_batch(["market analysis", "technical design", "code review"])
assert len(vectors) == 3
assert len(vectors[0]) == 128

# Deterministic: same input always produces same output
vectors2 = provider.embed_batch(["market analysis"])
assert vectors[0] == vectors2[0]

The mock provider uses a hash-based algorithm to generate consistent vectors. Texts with similar words produce somewhat similar vectors, enabling basic semantic search testing without the cost and nondeterminism of real embedding models.

When to Use Mock vs. Real Providers

Scenario Provider
Unit tests Mock
CI pipeline Mock
Development Mock
Quality evaluation Real
Production Real

The rule of thumb: use mock whenever you need determinism and speed. Use real providers when you need to evaluate the actual quality of semantic search results.

The SearchIndex

The SearchIndex is the core data structure for semantic search. It stores embeddings alongside metadata and supports nearest-neighbor queries:

from fcc.search.index import SearchIndex

index = SearchIndex(provider=provider)

# Index artifacts
index.add("doc_001", text="Competitive analysis of the SaaS market", metadata={
    "type": "finding",
    "persona": "research_analyst",
    "session": "session_42",
})

index.add("doc_002", text="System architecture for the data pipeline", metadata={
    "type": "design",
    "persona": "software_architect",
    "session": "session_42",
})

index.add("doc_003", text="Security review of API authentication", metadata={
    "type": "review",
    "persona": "security_auditor",
    "session": "session_43",
})

# Query
results = index.search("market competition and pricing", top_k=2)
for result in results:
    print(f"ID: {result.id}, Score: {result.score:.3f}")
    print(f"Metadata: {result.metadata}")
    print()

Index Operations

The SearchIndex supports:

  • add(id, text, metadata): Index a single artifact.
  • add_batch(items): Index multiple artifacts efficiently.
  • search(query, top_k, filters): Find the nearest artifacts.
  • remove(id): Remove an artifact from the index.
  • save(path) / load(path): Persist and restore the index.

Filtering

Search results can be filtered by metadata:

results = index.search(
    "API security",
    top_k=5,
    filters={"type": "review", "session": "session_43"},
)

Filters are applied before the nearest-neighbor search, reducing the search space and improving relevance.

Implementing a Custom Provider

To use a real embedding model, implement the Protocol:

import openai

class OpenAIEmbeddingProvider:
    """Embedding provider using OpenAI's API."""

    def __init__(self, model: str = "text-embedding-3-small"):
        self._model = model
        self._client = openai.OpenAI()

    @property
    def dimensions(self) -> int:
        return 1536  # text-embedding-3-small

    def embed_batch(self, texts: list[str]) -> list[list[float]]:
        response = self._client.embeddings.create(
            model=self._model,
            input=texts,
        )
        return [item.embedding for item in response.data]

    def embed(self, text: str) -> list[float]:
        return self.embed_batch([text])[0]

This provider works with the SearchIndex without any registration step -- it satisfies the Protocol, so it is a valid provider.

Provider Registration

For plugin-based discovery, register your provider via entry points:

[project.entry-points."fcc.plugins.providers"]
openai_embeddings = "my_package:OpenAIEmbeddingProvider"

Integration with the Simulation Engine

Semantic search integrates with the simulation engine through the shared context. During a simulation, the Find phase can use the SearchIndex to retrieve relevant artifacts from previous simulations:

# In a Find-phase persona's prompt
context = shared_context.get("search_index")
results = context.search("prior market analyses for SaaS", top_k=3)

# Inject results into the persona's input
input_context = {
    "prior_findings": [r.text for r in results],
    "scenario": current_scenario,
}

This pattern enables cross-session learning: each simulation builds on the findings of previous simulations, creating an expanding knowledge base that gets more valuable over time.

Performance Considerations

Semantic search performance depends on three factors:

  1. Embedding speed. Batch embedding is 10--100x faster than single-text embedding. Always use embed_batch when indexing.
  2. Index size. For small corpora (< 10,000 artifacts), brute-force nearest-neighbor search is fast enough. For larger corpora, consider approximate nearest-neighbor (ANN) algorithms.
  3. Query latency. In production, cache frequently-used query embeddings and pre-filter by metadata to reduce search space.

The current implementation uses brute-force search, which is sufficient for the typical FCC corpus size. ANN integration (via libraries like Faiss or Annoy) is on the roadmap for Phase 14.

Key Takeaways

  • Semantic search enables meaning-based artifact retrieval, surpassing keyword search for complex queries.
  • The Protocol-based embedding provider design avoids import coupling and supports duck typing.
  • MockEmbeddingProvider provides deterministic, cost-free testing. Use real providers only when evaluating quality.
  • The SearchIndex stores embeddings with metadata and supports filtered nearest-neighbor queries.
  • Cross-session learning uses semantic search to build on findings from previous simulations.

Cross-References


← Book 3 Index | Next: Chapter 2 -- Knowledge Graphs →