Semantic Search¶

Duration: 45 minutes Level: Advanced Module: fcc.search

This tutorial teaches you how to use the FCC semantic search module to discover personas and workflow actions using natural language queries. You will work with the EmbeddingProvider protocol, SearchIndex with pluggable backends, PersonaSearchIndex for persona discovery, and ActionSearchIndex for action discovery.

Prerequisites¶

Completed beginner/intermediate tutorials
Familiarity with PersonaRegistry and WorkflowActionRegistry
Basic understanding of embedding vectors and cosine similarity

EmbeddingProvider Protocol¶

The EmbeddingProvider protocol defines the contract for all embedding backends. Any class that implements embed, embed_batch, and dimension satisfies the protocol:

from fcc.search.embeddings import EmbeddingProvider, MockEmbeddingProvider

# The MockEmbeddingProvider produces deterministic 384-d vectors
# using hash-based generation (matches all-MiniLM-L6-v2 dimensionality)
provider = MockEmbeddingProvider()

# Embed a single text
vec = provider.embed("research methodology for data governance")
print(f"Dimension: {provider.dimension()}")  # 384
print(f"Vector length: {len(vec)}")           # 384
print(f"First 5 values: {vec[:5]}")

# Batch embedding
texts = ["persona validation", "workflow orchestration", "quality gates"]
vecs = provider.embed_batch(texts)
print(f"Batch size: {len(vecs)}")

Optional: SentenceTransformerProvider¶

When sentence-transformers is installed, you can use real neural embeddings:

from fcc.search.embeddings import SentenceTransformerProvider, st_available

if st_available():
    real_provider = SentenceTransformerProvider(
        model_name="all-MiniLM-L6-v2"
    )
    vec = real_provider.embed("data governance persona")
    print(f"Real embedding dimension: {real_provider.dimension()}")

SearchIndex¶

The SearchIndex class combines an EmbeddingProvider with a storage backend into a high-level search API. It supports two backends: InMemoryBackend (pure Python) and NumpyBackend (numpy-accelerated).

from fcc.search.index import SearchIndex

# Create an index with the default in-memory backend
index = SearchIndex(provider=MockEmbeddingProvider())

# Add documents
index.add_document("doc1", "Research methodology and literature review")
index.add_document("doc2", "Blueprint design and architecture patterns")
index.add_document("doc3", "Documentation quality and review processes")
index.add_document("doc4", "Data governance and compliance auditing")

print(f"Indexed documents: {index.count()}")  # 4

# Search
results = index.search("governance compliance", k=3)
for r in results:
    print(f"  [{r.score:.3f}] {r.doc_id}: {r.text[:60]}")

Index Persistence¶

Save and reload an index from JSON:

from pathlib import Path

# Save
index.save(Path("my_index.json"))

# Load
restored = SearchIndex.load(
    Path("my_index.json"),
    provider=MockEmbeddingProvider(),
)
print(f"Restored: {restored.count()} documents")

# Search the restored index
results = restored.search("architecture patterns", k=2)
for r in results:
    print(f"  [{r.score:.3f}] {r.doc_id}")

Document Management¶

# Check if a document exists
print(index.contains("doc1"))  # True

# Remove a document
removed = index.remove("doc1")
print(f"Removed: {removed}")   # True
print(f"Count: {index.count()}")  # 3

PersonaSearchIndex¶

The PersonaSearchIndex enables natural-language discovery of personas by indexing their R.I.S.C.E.A.R. role, archetype, and responsibilities:

from fcc.personas.registry import PersonaRegistry
from fcc.search.persona_search import PersonaSearchIndex

# Build from the persona registry
registry = PersonaRegistry.from_yaml_directory("src/fcc/data/personas")
persona_index = PersonaSearchIndex.from_registry(registry)

# Search for personas by natural language
results = persona_index.search_personas(
    "data governance and compliance auditing",
    k=5,
)
for r in results:
    print(f"  [{r.score:.3f}] {r.metadata['persona_id']}: "
          f"{r.metadata['name']} ({r.metadata['category']})")

Finding Similar Personas¶

Discover personas that are functionally similar to a given persona:

# Find personas similar to the Research Crafter
similar = persona_index.similar_personas("RC", k=5)
for r in similar:
    print(f"  [{r.score:.3f}] {r.metadata['persona_id']}: "
          f"{r.metadata['name']}")

ActionSearchIndex¶

The ActionSearchIndex enables discovery of workflow actions by searching over action descriptions and execution steps:

from fcc.workflow.actions import WorkflowActionRegistry
from fcc.search.action_search import ActionSearchIndex

# Build from the action registry
action_registry = WorkflowActionRegistry.from_yaml_directory(
    "src/fcc/data/personas/actions"
)
action_index = ActionSearchIndex.from_registry(action_registry)

# Search for actions
results = action_index.search_actions(
    "generate documentation from source code",
    k=5,
)
for r in results:
    print(f"  [{r.score:.3f}] {r.doc_id}: {r.text[:80]}...")
    print(f"    Persona: {r.metadata['persona_id']}, "
          f"Type: {r.metadata['action_type']}")

Summary¶

In this tutorial you learned how to:

Use the EmbeddingProvider protocol with MockEmbeddingProvider (384-d hash-based vectors)
Build and query a SearchIndex with pluggable backends and persistence
Discover personas by natural language with PersonaSearchIndex
Find similar personas using embedding similarity
Search workflow actions with ActionSearchIndex

Next Steps¶

RAG Pipeline -- Combine search with generation for question answering
Knowledge Graphs -- Represent relationships as a queryable graph