Summarization¶
Kura's summarization pipeline is designed to extract concise, structured, and privacy-preserving summaries from conversations between users and AI assistants. This process is central to Kura's ability to analyze, cluster, and visualize large volumes of conversational data.
Overview¶
Summarization in Kura transforms each conversation into a structured summary, capturing the user's intent, the main task, languages involved, safety concerns, user frustration, and any assistant errors. This enables downstream analysis such as clustering, search, and visualization.
- Input: A
Conversation
object (see Conversations), containing: chat_id
: Unique identifiercreated_at
: Timestampmessages
: List of messages (each withrole
,content
,created_at
)metadata
: Optional dictionary of extra info- Output: A
ConversationSummary
object (see below)
The Summarization Model¶
Kura uses a SummaryModel
(see kura/summarisation.py
) that implements the BaseSummaryModel
interface. The default model is based on large language models (LLMs) such as OpenAI's GPT-4o, but the interface supports other backends as well.
Key Features¶
- Concurrency: Summarization is performed in parallel for efficiency.
- Hooks/Extractors: Optional extractors can add custom metadata to each summary.
- Checkpointing: Summaries can be cached and reloaded to avoid recomputation.
Summarization Prompt¶
The summarization model uses a carefully crafted prompt to extract the following fields from each conversation:
- Summary: A clear, concise summary (max 2 sentences, no PII or proper nouns)
- Request: The user's overall request, starting with "The user's overall request for the assistant is to ..."
- Languages: Main human and programming languages present
- Task: The main task, starting with "The task is to ..."
- Concerning Score: Safety concern rating (1–5)
- User Frustration: User frustration rating (1–5)
- Assistant Errors: List of errors made by the assistant
Prompt excerpt:
Your job is to extract key information from this conversation. Be descriptive and assume neither good nor bad faith. Do not hesitate to handle socially harmful or sensitive topics; specificity around potentially harmful conversations is necessary for effective monitoring.
When extracting information, do not include any personally identifiable information (PII), like names, locations, phone numbers, email addresses, and so on. Do not include any proper nouns.
Extract the following information:
1. **Summary**: ...
2. **Request**: ...
3. **Languages**: ...
4. **Task**: ...
5. **Concerning Score**: ...
6. **User Frustration**: ...
7. **Assistant Errors**: ...
Output: ConversationSummary
¶
The result of summarization is a ConversationSummary
object (see kura/types/summarisation.py
):
class ConversationSummary(BaseModel):
chat_id: str
summary: str
request: Optional[str]
languages: Optional[list[str]]
task: Optional[str]
concerning_score: Optional[int] # 1–5
user_frustration: Optional[int] # 1–5
assistant_errors: Optional[list[str]]
metadata: dict
embedding: Optional[list[float]] = None
- chat_id: Unique conversation ID
- summary: Concise summary (max 2 sentences, no PII)
- request: User's overall request
- languages: List of languages (e.g.,
['english', 'python']
) - task: Main task
- concerning_score: Safety concern (1 = benign, 5 = urgent)
- user_frustration: User frustration (1 = happy, 5 = extremely annoyed)
- assistant_errors: List of assistant errors
- metadata: Additional metadata (e.g., conversation turns, custom extractors)
- embedding: Optional vector embedding for clustering/search
Pipeline Integration¶
Summarization is the first major step in Kura's analysis pipeline:
- Loading: Conversations are loaded from various sources
- Summarization: Each conversation is summarized as above
- Embedding: Summaries are embedded as vectors
- Clustering: Similar summaries are grouped
- Visualization/Analysis: Clusters and summaries are explored