Chapter 1: Semantic Search¶
Learning Objectives¶
By the end of this chapter you will be able to:
- Explain why keyword search is insufficient for FCC artifact retrieval.
- Describe the Protocol-based embedding provider design and its rationale.
- Implement a custom embedding provider and register it with the framework.
- Build and query a SearchIndex over FCC artifacts.
- Use the MockEmbeddingProvider for deterministic testing.
The figure below shows the semantic search data flow: documents and query text run through the same EmbeddingProvider into a shared SearchIndex, which returns cosine-similar top-K results to PersonaSearchIndex or ActionSearchIndex consumers.
flowchart LR
Q[Query Text] --> EP[EmbeddingProvider]
EP --> QV[Query Vector]
QV --> SI[(SearchIndex)]
D1[Document 1] --> EP
D2[Document 2] --> EP
D3[Document N] --> EP
EP --> DV[Document Vectors]
DV --> SI
SI --> TOP[Top-K Results<br/>by Cosine Similarity]
TOP --> PSI[PersonaSearchIndex<br/>or ActionSearchIndex]
style SI fill:#2196F3,color:#fff
style TOP fill:#4CAF50,color:#fff
Protocol-based provider dispatch means you can unit-test retrieval deterministically with MockEmbeddingProvider and only switch to a real provider for integration or production runs.
Why Semantic Search?¶
FCC workflows produce artifacts at every node: research findings, design documents, code, reviews, scores. As the number of simulations and collaboration sessions grows, so does the artifact corpus. Retrieving relevant artifacts by keyword works for simple cases ("find all documents mentioning pricing"), but it fails for semantic queries ("find all analyses that discuss market entry risk for regulated industries").
Semantic search solves this by representing artifacts as dense vectors (embeddings) in a high-dimensional space, where proximity corresponds to meaning similarity. A query like "market entry risk for regulated industries" is embedded into the same space, and the nearest artifacts are returned -- even if they use completely different vocabulary.
This capability is the foundation for two downstream features: the RAG pipeline (Chapter 3) and federated knowledge search (Chapter 4). It also powers the Sky-Parlour natural-language interface, which translates stakeholder questions into semantic queries across the artifact corpus.
The Embedding Provider Protocol¶
FCC uses a Protocol-based design for embedding providers, following the same pattern as the objectmodel package (see the ADR-004: Embedding Provider Protocol for the full rationale).
The key design decisions:
-
Protocol, not ABC. The
EmbeddingProvideris defined as a Pythontyping.Protocol, not an abstract base class. This means any class that implements the required methods is a valid provider, without inheritance. This is more flexible for third-party integrations. -
Batch-first API. The primary method is
embed_batch(texts: list[str]) -> list[list[float]]. Single-text embedding is a convenience wrapper. Batch processing is essential for efficiency -- embedding 1,000 artifacts one at a time is 10--100x slower than embedding them in a single batch. -
Dimension metadata. Each provider reports its embedding dimensionality (e.g., 1536 for OpenAI
text-embedding-3-small, 1024 for Anthropic's embeddings). This allows the SearchIndex to validate consistency.
from typing import Protocol
class EmbeddingProvider(Protocol):
"""Protocol for embedding text into vector space."""
@property
def dimensions(self) -> int:
"""The dimensionality of the embedding vectors."""
...
def embed_batch(self, texts: list[str]) -> list[list[float]]:
"""Embed a batch of texts into vectors."""
...
def embed(self, text: str) -> list[float]:
"""Embed a single text. Default implementation calls embed_batch."""
...
Why Protocol Over ABC?¶
The Protocol approach has three advantages in the FCC context:
- No import coupling. A third-party embedding library does not need to import anything from FCC to be compatible. As long as it implements the right methods, it works.
- Duck typing. Existing embedding clients (e.g.,
openai.Embeddings, custom inference servers) can be wrapped with minimal adapter code. - Testability. The
MockEmbeddingProvideris trivially implementable and does not need to inherit from a base class.
The MockEmbeddingProvider¶
For testing and development, FCC ships a MockEmbeddingProvider that generates deterministic embeddings without calling any external API:
from fcc.search.providers import MockEmbeddingProvider
provider = MockEmbeddingProvider(dimensions=128)
vectors = provider.embed_batch(["market analysis", "technical design", "code review"])
assert len(vectors) == 3
assert len(vectors[0]) == 128
# Deterministic: same input always produces same output
vectors2 = provider.embed_batch(["market analysis"])
assert vectors[0] == vectors2[0]
The mock provider uses a hash-based algorithm to generate consistent vectors. Texts with similar words produce somewhat similar vectors, enabling basic semantic search testing without the cost and nondeterminism of real embedding models.
When to Use Mock vs. Real Providers¶
| Scenario | Provider |
|---|---|
| Unit tests | Mock |
| CI pipeline | Mock |
| Development | Mock |
| Quality evaluation | Real |
| Production | Real |
The rule of thumb: use mock whenever you need determinism and speed. Use real providers when you need to evaluate the actual quality of semantic search results.
The SearchIndex¶
The SearchIndex is the core data structure for semantic search. It stores embeddings alongside metadata and supports nearest-neighbor queries:
from fcc.search.index import SearchIndex
index = SearchIndex(provider=provider)
# Index artifacts
index.add("doc_001", text="Competitive analysis of the SaaS market", metadata={
"type": "finding",
"persona": "research_analyst",
"session": "session_42",
})
index.add("doc_002", text="System architecture for the data pipeline", metadata={
"type": "design",
"persona": "software_architect",
"session": "session_42",
})
index.add("doc_003", text="Security review of API authentication", metadata={
"type": "review",
"persona": "security_auditor",
"session": "session_43",
})
# Query
results = index.search("market competition and pricing", top_k=2)
for result in results:
print(f"ID: {result.id}, Score: {result.score:.3f}")
print(f"Metadata: {result.metadata}")
print()
Index Operations¶
The SearchIndex supports:
add(id, text, metadata): Index a single artifact.add_batch(items): Index multiple artifacts efficiently.search(query, top_k, filters): Find the nearest artifacts.remove(id): Remove an artifact from the index.save(path)/load(path): Persist and restore the index.
Filtering¶
Search results can be filtered by metadata:
results = index.search(
"API security",
top_k=5,
filters={"type": "review", "session": "session_43"},
)
Filters are applied before the nearest-neighbor search, reducing the search space and improving relevance.
Implementing a Custom Provider¶
To use a real embedding model, implement the Protocol:
import openai
class OpenAIEmbeddingProvider:
"""Embedding provider using OpenAI's API."""
def __init__(self, model: str = "text-embedding-3-small"):
self._model = model
self._client = openai.OpenAI()
@property
def dimensions(self) -> int:
return 1536 # text-embedding-3-small
def embed_batch(self, texts: list[str]) -> list[list[float]]:
response = self._client.embeddings.create(
model=self._model,
input=texts,
)
return [item.embedding for item in response.data]
def embed(self, text: str) -> list[float]:
return self.embed_batch([text])[0]
This provider works with the SearchIndex without any registration step -- it satisfies the Protocol, so it is a valid provider.
Provider Registration¶
For plugin-based discovery, register your provider via entry points:
[project.entry-points."fcc.plugins.providers"]
openai_embeddings = "my_package:OpenAIEmbeddingProvider"
Integration with the Simulation Engine¶
Semantic search integrates with the simulation engine through the shared context. During a simulation, the Find phase can use the SearchIndex to retrieve relevant artifacts from previous simulations:
# In a Find-phase persona's prompt
context = shared_context.get("search_index")
results = context.search("prior market analyses for SaaS", top_k=3)
# Inject results into the persona's input
input_context = {
"prior_findings": [r.text for r in results],
"scenario": current_scenario,
}
This pattern enables cross-session learning: each simulation builds on the findings of previous simulations, creating an expanding knowledge base that gets more valuable over time.
Performance Considerations¶
Semantic search performance depends on three factors:
- Embedding speed. Batch embedding is 10--100x faster than single-text embedding. Always use
embed_batchwhen indexing. - Index size. For small corpora (< 10,000 artifacts), brute-force nearest-neighbor search is fast enough. For larger corpora, consider approximate nearest-neighbor (ANN) algorithms.
- Query latency. In production, cache frequently-used query embeddings and pre-filter by metadata to reduce search space.
The current implementation uses brute-force search, which is sufficient for the typical FCC corpus size. ANN integration (via libraries like Faiss or Annoy) is on the roadmap for Phase 14.
Key Takeaways¶
- Semantic search enables meaning-based artifact retrieval, surpassing keyword search for complex queries.
- The Protocol-based embedding provider design avoids import coupling and supports duck typing.
- MockEmbeddingProvider provides deterministic, cost-free testing. Use real providers only when evaluating quality.
- The SearchIndex stores embeddings with metadata and supports filtered nearest-neighbor queries.
- Cross-session learning uses semantic search to build on findings from previous simulations.
Cross-References¶
- Chapter 2: Knowledge Graphs -- structured knowledge representation
- Chapter 3: RAG Pipelines -- retrieval-augmented generation
- ADR-004: Embedding Provider Protocol -- design rationale
- FCC Guidebook, Chapter 17 -- knowledge federation reference
- See Notebook 15 for hands-on practice with embedding providers and search indices