Embedding¶
Kura's embedding pipeline transforms text (such as conversation summaries) into high-dimensional vector representations. These embeddings are essential for downstream tasks like clustering, search, and visualization, enabling Kura to analyze and organize large volumes of conversational data.
Overview¶
Embedding in Kura refers to the process of converting text into numerical vectors (embeddings) that capture semantic meaning. These vectors allow for efficient similarity search, clustering, and visualization of conversations and summaries.
- Input: A list of texts (e.g., conversation summaries, messages, or cluster descriptions)
- Output: A list of vector embeddings (
list[list[float]]
), typically one per input text
The Embedding Model¶
Kura uses an EmbeddingModel
(see kura/embedding.py
) that implements the BaseEmbeddingModel
interface. Multiple backends are supported:
- OpenAIEmbeddingModel: Uses OpenAI's API (e.g.,
text-embedding-3-small
) for high-quality embeddings - SentenceTransformerEmbeddingModel: Uses local models from the
sentence-transformers
library (e.g.,all-MiniLM-L6-v2
)
All embedding models must implement the following interface (see kura/base_classes/embedding.py
):
class BaseEmbeddingModel(ABC):
@abstractmethod
async def embed(self, texts: list[str]) -> list[list[float]]:
"""Embed a list of texts into a list of lists of floats"""
pass
Key Features¶
- Batching: Texts are automatically split into batches for efficient processing
- Concurrency: Multiple batches are embedded in parallel (configurable concurrency)
- Retry Logic: Embedding requests are retried on failure for robustness
- Extensibility: New embedding backends can be added by subclassing
BaseEmbeddingModel
- Checkpointing: Embeddings can be cached as part of the pipeline to avoid recomputation
Output: Embeddings¶
The result of embedding is a list of vectors, each representing an input text. Embeddings are typically attached to summaries or clusters for downstream analysis.
Example output for a batch of texts:
embeddings = await embedding_model.embed([
"Summarize the user's request.",
"Cluster similar conversations together."
])
# embeddings: list[list[float]]
When used in the pipeline, embeddings are stored in objects such as ConversationSummary
:
class ConversationSummary(BaseModel):
chat_id: str
summary: str
...
embedding: Optional[list[float]] = None
- embedding: The vector representation of the summary (or other text)
Pipeline Integration¶
Embedding is a core step in Kura's analysis pipeline:
- Loading: Conversations are loaded from various sources
- Summarization: Each conversation is summarized
- Embedding: Summaries (or other texts) are embedded as vectors
- Clustering: Embeddings are grouped into clusters
- Visualization/Analysis: Clusters and embeddings are explored
Embeddable Object Representations¶
All major objects that need to be embedded in Kura (such as ConversationSummary
, Cluster
, and ProjectedCluster
) implement __str__
methods. This ensures that each object can be converted to a meaningful text representation before embedding.
- Requirement: Any object passed to an embedding model must provide a
__str__
method that captures its semantic content. - Examples:
ConversationSummary
uses a custom__str__
to include summary, request, task, and other fields in a structured format.Cluster
andProjectedCluster
use__str__
to return their name and description.
This design allows embedding models to work generically with a variety of object types, as long as they implement a suitable __str__
method.
References¶
TODO: Additional Embedding Providers¶
- Support for other embedding providers (e.g., Cohere, HuggingFace Inference API, Google Vertex AI, local GPU models)
- Community contributions and suggestions are welcome!