Summarization¶

Kura's summarization pipeline is designed to extract concise, structured, and privacy-preserving summaries from conversations between users and AI assistants. This process is central to Kura's ability to analyze, cluster, and visualize large volumes of conversational data.

Overview¶

Summarization in Kura transforms each conversation into a structured summary, capturing the user's intent, the main task, languages involved, safety concerns, user frustration, and any assistant errors. This enables downstream analysis such as clustering, search, and visualization.

Input: A Conversation object (see Conversations), containing:
chat_id: Unique identifier
created_at: Timestamp
messages: List of messages (each with role, content, created_at)
metadata: Optional dictionary of extra info
Output: A ConversationSummary object (see below)

The Summarization Model¶

Kura uses a SummaryModel (see kura/summarisation.py) that implements the BaseSummaryModel interface. The default model is based on large language models (LLMs) such as OpenAI's GPT-4o, but the interface supports other backends as well.

Key Features¶

Concurrency: Summarization is performed in parallel for efficiency.
Hooks/Extractors: Optional extractors can add custom metadata to each summary.
Checkpointing: Summaries can be cached and reloaded to avoid recomputation.

Summarization Prompt¶

The summarization model uses a carefully crafted prompt to extract the following fields from each conversation:

Summary: A clear, concise summary (max 2 sentences, no PII or proper nouns)
Request: The user's overall request, starting with "The user's overall request for the assistant is to ..."
Languages: Main human and programming languages present
Task: The main task, starting with "The task is to ..."
Concerning Score: Safety concern rating (1–5)
User Frustration: User frustration rating (1–5)
Assistant Errors: List of errors made by the assistant

Prompt excerpt:

Your job is to extract key information from this conversation. Be descriptive and assume neither good nor bad faith. Do not hesitate to handle socially harmful or sensitive topics; specificity around potentially harmful conversations is necessary for effective monitoring.

When extracting information, do not include any personally identifiable information (PII), like names, locations, phone numbers, email addresses, and so on. Do not include any proper nouns.

Extract the following information:

1. **Summary**: ...
2. **Request**: ...
3. **Languages**: ...
4. **Task**: ...
5. **Concerning Score**: ...
6. **User Frustration**: ...
7. **Assistant Errors**: ...

Output: `ConversationSummary`¶

The result of summarization is a ConversationSummary object (see kura/types/summarisation.py):

class ConversationSummary(BaseModel):
    chat_id: str
    summary: str
    request: Optional[str]
    languages: Optional[list[str]]
    task: Optional[str]
    concerning_score: Optional[int]  # 1–5
    user_frustration: Optional[int]  # 1–5
    assistant_errors: Optional[list[str]]
    metadata: dict
    embedding: Optional[list[float]] = None

chat_id: Unique conversation ID
summary: Concise summary (max 2 sentences, no PII)
request: User's overall request
languages: List of languages (e.g., ['english', 'python'])
task: Main task
concerning_score: Safety concern (1 = benign, 5 = urgent)
user_frustration: User frustration (1 = happy, 5 = extremely annoyed)
assistant_errors: List of assistant errors
metadata: Additional metadata (e.g., conversation turns, custom extractors)
embedding: Optional vector embedding for clustering/search

Pipeline Integration¶

Summarization is the first major step in Kura's analysis pipeline:

Loading: Conversations are loaded from various sources
Summarization: Each conversation is summarized as above
Embedding: Summaries are embedded as vectors
Clustering: Similar summaries are grouped
Visualization/Analysis: Clusters and summaries are explored