- What is Unsupervised Learning & Goals of Unsupervised Learning
- Type of Unsupervised Learning: 1.Clustering, 2.Association Rule & 3.Dimensionality Reduction
- Definition and Application of Clustering
- 4 methods: 1.K Means 2.Hierarchical 3.DBScan & 4.Gaussian Mixture
- Two points are near to each other, chances they are similar
- Distance Measure between two points
- Euclidean Distance: Under-root of Square distance between two points
- Manhattan Distance: Absolute Distance between points
- How Algorithim works (Step Wise Calculation)
- Pre-processing required for K Means
- Determining optimal number of K: 1.Profiling Approach & 2.Elbow Method
- Working of Elbow Method with Example
- 3 concepts: 1.Total Error, 2.Variance/Total Squared Error & 3.Within Cluster Sum of Square (WCSS)
- Define number of clusters, take centroids and measure distance
- Euclidean Distance : Measure distance between points
- Number of Clusters defined by Elbow Method
- Elbow Method : WCSS vs Number of Cluster
- Silhouette Score : Goodness of Clustering
- Two Approaches: 1.Agglomerative(Botton-Up) & 2.Divisive(Top-Down)
- Types of Linkages:
- Single Linkage - Nearest Neighbour (Minimal intercluster dissimilarity)
- Complete Linkage - Farthest Neighbour (Maximal intercluster dissimilarity)
- Average Linkage - Average Distance (Mean intercluster dissimilarity)
- Steps in Agglomerative Hierarchical Clustering with Single Linkage
- Determining optimal number of Cluster: Dendogram
- Hierarchical relationship between objects
- Optimal number of Clusters for Hierarchical Clustering
- Type of HC
- Agglomerative : Bottom Up approach
- Divisive : Top Down approach
- Number of Clusters defined by Dendogram
- Dendogram : Joining datapoints based on distance & creating clusters
- Linkage : To calculate distance between two points of two clusters
- Single linkage : Minimum Distance between two clusters
- Complete linkage : Maximum Distance between two clusters
- Average linkage : Average Distance between two clusters
- Density Based Clustering
- Kmeans & Hierarchical good for compact & well seperated Data
- Both are sensitive to Outliers & Noise
- DBScan overcome all the issue & works well with Outliers
- 2 important parameters -
- eps: Distance between 2 points is lower/equal to eps they are neighbours
- MinPts: Minimum number of neighbours/data points with eps radius
- No need to give pre-define clusters
- Distance metric is Euclidean Distance
- Need to give 2 parameters
- eps : Radius of the circle
- min_samples : minimum data points to consider it as clusters
- Weakness of K Means
- Expectation Maximization(EM) method
- Probablistic Model
- Uses Expectation-Minimization (EM) steps:
- E Step : Probability of datapoint of each cluster
- M Step : For each cluster,revise parameter based on proabability
- 2 Steps we normally do for Cluster Adjustement
- Quality of Clustering (Cardinality & Magnitude)
- Performance of Similiarity Measure (Euclidean Distance)
- Clusters are well apart from each other as the silhouette score is closer to 1
- It is a metric used to calculate the goodness of a clustering technique
- Its value ranges from -1 to 1.
- 1: Means clusters are well apart from each other and clearly distinguished
- 0: Means clusters are indifferent, or distance between clusters is not significant
- -1: Means clusters are assigned in the wrong way
- Disadvantage of each clustering techniques respectively
- Based on the data, which is the right clustering method
- Short Description of Each Clustering Alogrithim
- Advantage, Disadvantage
- When to use what
- Commonly asked question on Clustering
-
For Categorical variable clustering, use K Modes
-
It uses the dissimilarities(total mismatch) between data points
-
Lesser the dissimilarities, the more our data points are closer
-
End.
-
It uses Mode for most value in the column
- K Mode code in Python