API Reference: Semantic Search¶
This document covers the FCC semantic search subsystem, which provides embedding-based document search with pluggable backends, persona-specific search, and action-specific search.
flowchart LR
Q[Query Text] --> EP[EmbeddingProvider]
EP -->|embed| V[Query Vector]
V --> SI[SearchIndex]
SI -->|cosine similarity| B[(InMemory or Numpy Backend)]
B --> SR[SearchResult list]
SR -->|ranked by score| Out[Top-K Results]
EmbeddingProvider Protocol¶
fcc.search.embeddings.EmbeddingProvider is a runtime_checkable Protocol defining the contract for all embedding providers.
Required Methods¶
| Method | Signature | Description |
|---|---|---|
embed |
(text: str) -> tuple[float, ...] |
Embed a single text into a vector |
embed_batch |
(texts: list[str]) -> list[tuple[float, ...]] |
Embed multiple texts |
dimension |
() -> int |
Return the vector dimensionality |
Any class implementing these three methods satisfies the protocol via structural subtyping.
MockEmbeddingProvider¶
fcc.search.embeddings.MockEmbeddingProvider produces deterministic 384-dimension vectors using MD5 hashing, matching the dimensionality of all-MiniLM-L6-v2. Suitable for testing without external dependencies.
from fcc.search.embeddings import MockEmbeddingProvider
provider = MockEmbeddingProvider()
vec = provider.embed("research coordinator")
print(len(vec)) # 384
print(provider.dimension()) # 384
# Batch embedding
vecs = provider.embed_batch(["text 1", "text 2", "text 3"])
print(len(vecs)) # 3
SentenceTransformerProvider¶
fcc.search.embeddings.SentenceTransformerProvider wraps the sentence-transformers library for production-grade embeddings. The model is lazy-loaded on first use.
from fcc.search.embeddings import SentenceTransformerProvider, st_available
if st_available():
provider = SentenceTransformerProvider(model_name="all-MiniLM-L6-v2")
vec = provider.embed("research coordinator")
print(provider.dimension()) # 384
Raises ImportError at construction time if sentence-transformers is not installed.
SearchIndex¶
fcc.search.index.SearchIndex combines an EmbeddingProvider with a SearchBackend into a high-level search API with persistence support and an embedding cache.
Adding Documents¶
from fcc.search.index import SearchIndex
index = SearchIndex() # Uses MockEmbeddingProvider + InMemoryBackend
index.add_document("doc1", "Research methodology overview", {"type": "guide"})
index.add_document("doc2", "Code review best practices", {"type": "guide"})
index.add_document("doc3", "API design patterns", {"type": "reference"})
print(index.count()) # 3
Searching¶
results = index.search("research methods", k=5)
for r in results:
print(f"{r.doc_id}: {r.score:.3f} - {r.text[:50]}")
Each SearchResult has doc_id, text, score, and metadata fields.
Remove and Check¶
Persistence¶
from pathlib import Path
# Save to JSON
index.save(Path("/tmp/my_index.json"))
# Load from JSON
restored = SearchIndex.load(Path("/tmp/my_index.json"))
Backends¶
Two backend implementations are available:
| Backend | Import | Description |
|---|---|---|
InMemoryBackend |
fcc.search.index |
Pure-Python dict-based storage with cosine similarity |
NumpyBackend |
fcc.search.index |
Numpy-accelerated batch cosine similarity (requires numpy) |
PersonaSearchIndex¶
fcc.search.persona_search.PersonaSearchIndex enables natural-language discovery of personas by searching over concatenated role, archetype, and responsibility text.
from fcc.search.persona_search import PersonaSearchIndex
from fcc.personas.registry import PersonaRegistry
from fcc._resources import get_personas_dir
registry = PersonaRegistry.from_yaml_directory(get_personas_dir())
psi = PersonaSearchIndex.from_registry(registry)
# Search by natural language
results = psi.search_personas("security and compliance auditing", k=5)
for r in results:
print(f"{r.doc_id}: {r.score:.3f} ({r.metadata.get('name')})")
# Find similar personas
similar = psi.similar_personas("RC", k=3)
How It Works¶
Each persona is indexed with:
- doc_id: The persona's ID (e.g., "RC")
- text: Concatenated riscear.role + riscear.archetype + riscear.responsibilities
- metadata: category, name, persona_id
ActionSearchIndex¶
fcc.search.action_search.ActionSearchIndex enables natural-language discovery of workflow actions by searching over description and execution steps.
from fcc.search.action_search import ActionSearchIndex
from fcc.workflow.actions import WorkflowActionRegistry
from fcc._resources import get_actions_dir
action_registry = WorkflowActionRegistry.from_yaml_directory(get_actions_dir())
asi = ActionSearchIndex.from_registry(action_registry)
# Search by natural language
results = asi.search_actions("generate test coverage report", k=5)
for r in results:
print(f"{r.doc_id}: {r.score:.3f}")
How It Works¶
Each action is indexed with:
- doc_id: "{persona_id}:{action_type}" (e.g., "RC:scaffold")
- text: Concatenated description + execution_steps
- metadata: persona_id, action_type
Import Paths Summary¶
| Class | Import Path |
|---|---|
EmbeddingProvider |
fcc.search.embeddings |
MockEmbeddingProvider |
fcc.search.embeddings |
SentenceTransformerProvider |
fcc.search.embeddings |
SearchIndex |
fcc.search.index |
InMemoryBackend |
fcc.search.index |
NumpyBackend |
fcc.search.index |
PersonaSearchIndex |
fcc.search.persona_search |
ActionSearchIndex |
fcc.search.action_search |
SearchResult |
fcc.search.models |
See Also¶
- Knowledge Graph API -- Graph construction and export
- RAG API -- Retrieval-augmented generation using search
- Personas API -- Persona models and registry