FastThresholdClustering
is an efficient vector clustering algorithm based on FAISS, particularly suitable for large-scale vector data clustering tasks. The algorithm uses cosine similarity as the distance metric and supports GPU acceleration.
- GPU acceleration support
- Automatic parameter optimization
- Memory usage optimization
- Performance monitoring and logging
- Batch processing for large-scale data
- Noise point detection
from fast_clustering import fast_cluster_embeddings
# Use the convenience function for clustering
labels = fast_cluster_embeddings(
embeddings,
similarity_threshold=0.8,
min_samples=5,
use_gpu=True
)
Property | Description |
---|---|
Type | float |
Default | 0.8 |
Range | [0, 1] |
Function | Similarity threshold for determining if two vectors belong to the same cluster |
Detailed Description:
- Higher values lead to stricter clustering and more clusters
- Lower values result in looser clustering and fewer clusters
- Recommended values:
- 0.7-0.8: Suitable for general text vectors
- 0.8-0.9: Suitable for high-precision matching scenarios
-
0.9: Extremely strict matching requirements
- Performance impact: Higher threshold leads to faster computation
Property | Description |
---|---|
Type | int |
Default | 5 |
Range | >= 2 |
Function | Minimum number of samples required to form a valid cluster |
Detailed Description:
- Clusters with fewer samples are marked as noise points (label -1)
- Parameter setting recommendations:
- Small datasets (<1000): 2-5
- Medium datasets (1000-10000): 5-10
- Large datasets (>10000): 10-20
- Key parameter affecting noise point determination
- Higher values result in more noise points and higher cluster quality
Property | Description |
---|---|
Type | bool |
Default | True |
Function | Whether to use GPU acceleration |
Detailed Description:
- True: Use GPU acceleration
- False: Use CPU computation
- Performance impact:
- GPU mode: Suitable for large-scale data (>100k entries)
- CPU mode: Suitable for small-scale data (<100k entries)
- Memory usage:
- GPU mode limited by VRAM
- CPU mode limited by RAM
- Recommendation: Prefer GPU mode when available
Property | Description |
---|---|
Type | int |
Default | 8 |
Range | [1, nlist] |
Function | Number of cluster units to visit during FAISS index search |
Detailed Description:
- Balances search accuracy and speed
- Recommended values:
- Small datasets (<10k): 4-8
- Medium datasets (10k-100k): 8-16
- Large datasets (>100k): 16-32
- Higher values:
- Pros: More accurate search results
- Cons: Slower search speed
- Automatically adjusted based on data scale
Property | Description |
---|---|
Type | int |
Default | 1000 |
Range | [100, dataset size] |
Function | Batch size affecting memory usage and computation speed |
Detailed Description:
- Recommended values:
- GPU mode: 500-2000
- CPU mode: 200-1000
- Memory impact:
- Larger values use more memory
- Smaller values increase computation time
- Auto-adjustment:
- Small datasets: Smaller batch_size
- Large datasets: Larger batch_size
- Adjust based on available memory
Property | Description |
---|---|
Type | int |
Default | None |
Range | [1, CPU cores] |
Function | Number of parallel processing worker threads |
Detailed Description:
- When None, automatically set to min(CPU cores, 8)
- Recommended values:
- Small datasets: 2-4 threads
- Medium datasets: 4-8 threads
- Large datasets: 8-16 threads
- Considerations:
- Too many threads may cause resource contention
- Consider resource needs of other system processes
- Less impact in GPU mode
FastThresholdClustering(
similarity_threshold=0.75,
min_samples=3,
use_gpu=False,
nprobe=4,
batch_size=500,
n_workers=4
)
FastThresholdClustering(
similarity_threshold=0.85,
min_samples=10,
use_gpu=True,
nprobe=32,
batch_size=2000,
n_workers=8
)
FastThresholdClustering(
similarity_threshold=0.9,
min_samples=5,
use_gpu=True,
nprobe=64,
batch_size=1000,
n_workers=8
)
def fit(self, embeddings: np.ndarray) -> FastThresholdClustering:
"""
Perform clustering on input vectors
Parameters:
embeddings: numpy array with shape (n_samples, n_features)
Returns:
self: Returns the clustering instance
"""
def fast_cluster_embeddings(
embeddings: np.ndarray,
similarity_threshold: float = 0.8,
min_samples: int = 5,
use_gpu: bool = True,
nprobe: int = None,
batch_size: int = None,
n_workers: int = None
) -> np.ndarray:
"""
Quick clustering interface function
Parameters:
embeddings: Input vectors with shape (n_samples, n_features)
similarity_threshold: Clustering similarity threshold
min_samples: Minimum sample count
use_gpu: Whether to use GPU
nprobe: FAISS index nprobe parameter (optional)
batch_size: Batch size (optional)
n_workers: Number of worker threads (optional)
Returns:
labels: Clustering label array with shape (n_samples,)
"""
- Clustering results stored in
labels_
attribute - Label -1 indicates noise points
- Other labels numbered consecutively from 0
The algorithm includes a built-in performance monitoring system that automatically records:
- Time spent in each phase
- Memory usage
- Clustering progress
- Final clustering statistics
import numpy as np
from fast_clustering import FastThresholdClustering
# Prepare data
embeddings = np.random.random((10000, 768))
# Create clusterer
clusterer = FastThresholdClustering(
similarity_threshold=0.8,
min_samples=5,
use_gpu=True
)
# Perform clustering
clusterer.fit(embeddings)
# Get clustering results
labels = clusterer.labels_
- Input vectors are automatically L2 normalized
- GPU acceleration recommended for large-scale datasets
- Parameters are automatically optimized based on data scale
- Memory usage increases with data scale