This project analyses different clustering methods over three different datasets
- Cleaning and Preprocessing of datasets: Plotted pairplots, heatmaps and histograms for the three datasets to pre-analyse the data and to identify which variables to exclude
- Used the elbow method to use the K-means algorithm to analyse the three datasets
- Plotted a dendrogram to use the Agglomerative Hierarchical algorithm to analyse the three datasets
- Plotted a k-distance graph to use the DBScan algorithm to analyse the three datasets
- Compared and evaluated the three methods to identify the advantages and disadvantages of using each clustering method
All documents can be found under the Documentation folder
- Detailed report of the project: A Technical Review of Clustering.pdf
All datasets can be found under the Datasets folder
- Frequent Flyer Program: This dataset contains information about the behaviour of NZ Airline’s FFP customers. We have dropped two variables: PartnerTrans and FlightTrans. The models are built with the following variables: AwardMiles, EliteMiles, PartnerMiles, FlyingReturnsMiles, and EnrollDuration.
- Mall Customer: This dataset contains the basic information about the customers. None of the attributes has a good correlation among them and hence we used all the numeric variables when building the clustering models.
- Wine: This dataset contains information about different types of wines. Total Phenols, Ravanoids, Hue, OD280 and Proline show a strong negative correlation with the class label. Ash_Alcanity has a positive correlation with Ash. Therefore, we dropped the variables - Ash_Alcanity, OD280, and Proanthocyanins- when we built the clustering models.
- Ang Shu Hui
- Bachhas Nikita
- Srinivas Shruthi
- Unnikrishnan Malavika