My App

Cluster View in Probably

The Cluster View is a powerful feature in Probably that allows you to explore vector space embeddings of your text data. This view is particularly useful for uncovering semantic patterns, identifying groups of similar items, and visualizing high-dimensional data in a 2D space.

Understanding Cluster View

Cluster View uses dimensionality reduction techniques (primarily UMAP - Uniform Manifold Approximation and Projection) to represent high-dimensional data in a 2D plot. Each point in the plot represents a data point (e.g., a piece of text), and the proximity between points indicates similarity.

Accessing Cluster View

To access Cluster View:

  1. Select a continuous text variable as your X-axis.
  2. Probably will automatically switch to Cluster View when it detects a text variable suitable for embedding.

Cluster View Example

Key Features of Cluster View

1. Zooming and Panning

  • Use the mouse wheel to zoom in and out.
  • Click and drag to pan across the plot.
  • Use the "Zoom Out" button to view the entire vector space.

2. Adjusting Cluster Centroids

  • Use the "Clusters" input to change the number of plotted cluster centroids.
  • Probably will recalculate and display the new cluster centers in real-time.

3. Custom Clusters

  • Enter text in the "Custom Clusters" input to add your own cluster centers.
  • This feature allows you to "search" the vector space visually by plotting the semantic position of your input text.

4. Summarization

  • Click "Summarize All" to generate summaries for each cluster center.
  • Hover over a cluster center and press "S" to summarize just that cluster.

5. Exploring Data Points

  • Hover over any point to see a tooltip with the raw text data.
  • Right-click on a point to view the full data record associated with that point.

Interpreting Cluster View

  1. Proximity: Points that are close together in the plot represent data points that are similar in the high-dimensional space.

  2. Clusters: Dense areas of points often represent distinct themes or topics in your data.

  3. Outliers: Points that are far from other clusters may represent unique or unusual data points.

  4. Gradients: Sometimes you'll see gradual changes across the plot, representing a spectrum of related concepts.

Advanced Usage

Combining with Filters

Use filters to focus on specific subsets of your data:

  1. Apply filters using the sidebar.
  2. The Cluster View will update to show only the points that meet your filter criteria.
  3. This can help you understand how different segments of your data are distributed in the semantic space.

Using Color Coding

You can color-code points based on another variable:

  1. Select a variable for the Z-axis.
  2. Points in the Cluster View will be colored based on their value for this variable.
  3. This can reveal patterns in how other variables relate to the semantic structure of your text data.

Exporting Cluster Data

To further analyze your clusters:

  1. Click the "Export" button.
  2. Choose "Cluster Data" as the export type.
  3. You'll receive a CSV file with each data point's cluster assignment and coordinates in the 2D space.

Best Practices

  1. Start with an appropriate number of clusters: Begin with a moderate number (e.g., 5-10) and adjust based on the complexity of your data.

  2. Use custom clusters for guided exploration: If you have specific themes you're interested in, use custom clusters to see where they fall in the space.

  3. Combine with other views: Use insights from Cluster View to inform your X vs Y plotting, and vice versa.

  4. Consider preprocessing: For text data, consider preprocessing (e.g., removing stop words, stemming) before embedding to focus on meaningful content.

  5. Validate findings: Always cross-reference patterns you see in Cluster View with your domain knowledge and other analyses.

Limitations and Considerations

  • Cluster View is a 2D approximation of a high-dimensional space. Some nuances of the data relationships may be lost in this projection.
  • The quality of clusters depends on the quality and relevance of your text data. Garbage in, garbage out!
  • For very large datasets, you may need to sample your data to maintain performance.

Next Steps

Now that you're familiar with Cluster View, you might want to explore:

Remember, Cluster View is a powerful tool for exploratory data analysis. Don't be afraid to experiment and look at your data from different angles!

On this page