Skip to content

Meta-Clustering

Kura's meta-clustering extends the initial clustering process by organizing existing clusters into a hierarchical structure. This is essential for managing large numbers of base clusters, understanding broader thematic relationships, and enabling multi-level exploratory analysis of conversational data—from general topics down to specific insights.


Overview

Meta-Clustering (or hierarchical clustering) in Kura takes a list of Cluster objects (typically the output of the primary Clustering process) and groups them into higher-level, more generalized parent clusters. This creates a topic taxonomy, allowing users to navigate and comprehend vast amounts of clustered data more effectively.

  • Input: A list of Cluster objects.
  • Output: An updated list of Cluster objects, including newly created parent meta-clusters and the original child clusters now linked via parent_id.

Meta-clustering facilitates: - Scalable Exploration: Makes it feasible to explore datasets with hundreds or thousands of base clusters. - Thematic Discovery: Reveals overarching themes and connections between different groups of specific topics. - Granular Navigation: Allows users to drill down from broad categories to nuanced sub-topics, supporting deeper "unknown unknown" discovery.


The MetaClusterModel

The core logic for hierarchical clustering is encapsulated in the MetaClusterModel (see kura/meta_cluster.py). This model orchestrates the process of grouping existing clusters into parent clusters and defining the relationships between them.

Key Components and Process

The MetaClusterModel typically employs the following steps, often iteratively if reducing a large number of clusters or building multiple hierarchy levels:

  1. Input Clusters: Starts with a list of Cluster objects generated by ClusterModel.

  2. (Optional) Cluster Grouping with reduce_clusters:

    • If reduce_clusters is called with many input clusters, it first embeds the textual representation (name and description) of these existing clusters using the configured embedding_model.
    • It then uses a clustering_model (e.g., K-means) to group these cluster embeddings into a smaller number of neighborhoods or initial groupings.
    • The subsequent steps are then applied to each of these neighborhoods.
  3. **Generating Candidate Meta-Cluster Names (generate_candidate_clusters):

    • For a given set of input clusters (or a neighborhood of clusters from step 2), an LLM is prompted to propose a list of suitable higher-level candidate names.
    • The prompt provides the names and descriptions of the input clusters and asks for broader category names that can encompass several of them, emphasizing specificity and distinctiveness. The aim is to find meaningful parent themes.
  4. **Labeling Clusters (label_cluster):

    • Each individual input cluster is then presented to an LLM along with the list of candidate meta-cluster names generated in the previous step.
    • The LLM's task is to assign the cluster to the single best-fitting candidate meta-cluster name. This involves careful instruction to choose an exact match from the candidates.
    • The output is validated to ensure the chosen label is one of the provided candidates (using fuzzy matching for robustness).
  5. **Renaming and Finalizing Meta-Clusters (rename_cluster_group):

    • Clusters are grouped based on the labels assigned in step 4.
    • For each group (which will become a new meta-cluster), an LLM is prompted with the names and descriptions of all its child clusters.
    • The LLM generates a final, refined name (imperative, like base cluster names) and a two-sentence summary for this new meta-cluster. This ensures the meta-cluster accurately and concisely represents its constituent child clusters.
    • A new Cluster object is created for this meta-cluster. The original child clusters in this group have their parent_id field updated to the ID of this new meta-cluster.

Prompting Strategies

Similar to base clustering, the LLM prompts used in MetaClusterModel are designed to: - Elicit specific and descriptive names/summaries for the meta-clusters. - Ensure meta-clusters are distinguishable from one another. - Handle potentially sensitive topics appropriately by encouraging descriptive rather than euphemistic language. - Maintain a consistent style (e.g., imperative sentences for names).

Output: Hierarchical Cluster List

The final output of generate_meta_clusters (or reduce_clusters) is a list containing: - The newly created parent meta-clusters (which have parent_id=None). - The original input clusters, now updated with their respective parent_id linking them to their new meta-cluster.

This structure allows for easy reconstruction and traversal of the cluster hierarchy.


Configuration

  • LLM Model: The LLM used for candidate generation, labeling, and renaming is configurable (default: openai/gpt-4o-mini).
  • Embedding Model: If using reduce_clusters, the embedding_model is used to embed the input clusters themselves (default: OpenAIEmbeddingModel).
  • Clustering Method: If using reduce_clusters, the clustering_model is used to group the cluster embeddings (default: KmeansClusteringMethod).
  • Concurrency: max_concurrent_requests controls parallelism for LLM calls.
  • Max Clusters per Level (Implicit): The max_clusters parameter in MetaClusterModel (and logic within generate_candidate_clusters) influences how many meta-clusters are aimed for at each level of reduction, guiding the granularity of the hierarchy.

Pipeline Integration

Meta-clustering typically follows the initial clustering step performed by ClusterModel:

  1. Loading: Conversations are loaded.
  2. Summarization: Conversations are summarized (ConversationSummary).
  3. Embedding: Summaries are embedded.
  4. Clustering: Summaries are grouped into base Cluster objects.
  5. Meta-Clustering: Base clusters are organized hierarchically by MetaClusterModel.
  6. Visualization/Analysis: The full hierarchy of clusters and summaries can be explored.

References