Merge pull request #231 from sharayuanuse/main

Customer Segmentation and Statistical Analysis Enhancements
UTSAVS26 · Oct 7, 2024 · cdd50e9 · cdd50e9
2 parents e4ad19f + ce57823
commit cdd50e9
Show file tree

Hide file tree

Showing 11 changed files with 1,049 additions and 0 deletions.
diff --git a/Data_Science/customer_segmentation/Customer_Segmentation.ipynb b/Data_Science/customer_segmentation/Customer_Segmentation.ipynb
diff --git a/Data_Science/customer_segmentation/Mall_Customers.csv b/Data_Science/customer_segmentation/Mall_Customers.csv
@@ -0,0 +1,201 @@
+CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
+1,Male,19,15,39
+2,Male,21,15,81
+3,Female,20,16,6
+4,Female,23,16,77
+5,Female,31,17,40
+6,Female,22,17,76
+7,Female,35,18,6
+8,Female,23,18,94
+9,Male,64,19,3
+10,Female,30,19,72
+11,Male,67,19,14
+12,Female,35,19,99
+13,Female,58,20,15
+14,Female,24,20,77
+15,Male,37,20,13
+16,Male,22,20,79
+17,Female,35,21,35
+18,Male,20,21,66
+19,Male,52,23,29
+20,Female,35,23,98
+21,Male,35,24,35
+22,Male,25,24,73
+23,Female,46,25,5
+24,Male,31,25,73
+25,Female,54,28,14
+26,Male,29,28,82
+27,Female,45,28,32
+28,Male,35,28,61
+29,Female,40,29,31
+30,Female,23,29,87
+31,Male,60,30,4
+32,Female,21,30,73
+33,Male,53,33,4
+34,Male,18,33,92
+35,Female,49,33,14
+36,Female,21,33,81
+37,Female,42,34,17
+38,Female,30,34,73
+39,Female,36,37,26
+40,Female,20,37,75
+41,Female,65,38,35
+42,Male,24,38,92
+43,Male,48,39,36
+44,Female,31,39,61
+45,Female,49,39,28
+46,Female,24,39,65
+47,Female,50,40,55
+48,Female,27,40,47
+49,Female,29,40,42
+50,Female,31,40,42
+51,Female,49,42,52
+52,Male,33,42,60
+53,Female,31,43,54
+54,Male,59,43,60
+55,Female,50,43,45
+56,Male,47,43,41
+57,Female,51,44,50
+58,Male,69,44,46
+59,Female,27,46,51
+60,Male,53,46,46
+61,Male,70,46,56
+62,Male,19,46,55
+63,Female,67,47,52
+64,Female,54,47,59
+65,Male,63,48,51
+66,Male,18,48,59
+67,Female,43,48,50
+68,Female,68,48,48
+69,Male,19,48,59
+70,Female,32,48,47
+71,Male,70,49,55
+72,Female,47,49,42
+73,Female,60,50,49
+74,Female,60,50,56
+75,Male,59,54,47
+76,Male,26,54,54
+77,Female,45,54,53
+78,Male,40,54,48
+79,Female,23,54,52
+80,Female,49,54,42
+81,Male,57,54,51
+82,Male,38,54,55
+83,Male,67,54,41
+84,Female,46,54,44
+85,Female,21,54,57
+86,Male,48,54,46
+87,Female,55,57,58
+88,Female,22,57,55
+89,Female,34,58,60
+90,Female,50,58,46
+91,Female,68,59,55
+92,Male,18,59,41
+93,Male,48,60,49
+94,Female,40,60,40
+95,Female,32,60,42
+96,Male,24,60,52
+97,Female,47,60,47
+98,Female,27,60,50
+99,Male,48,61,42
+100,Male,20,61,49
+101,Female,23,62,41
+102,Female,49,62,48
+103,Male,67,62,59
+104,Male,26,62,55
+105,Male,49,62,56
+106,Female,21,62,42
+107,Female,66,63,50
+108,Male,54,63,46
+109,Male,68,63,43
+110,Male,66,63,48
+111,Male,65,63,52
+112,Female,19,63,54
+113,Female,38,64,42
+114,Male,19,64,46
+115,Female,18,65,48
+116,Female,19,65,50
+117,Female,63,65,43
+118,Female,49,65,59
+119,Female,51,67,43
+120,Female,50,67,57
+121,Male,27,67,56
+122,Female,38,67,40
+123,Female,40,69,58
+124,Male,39,69,91
+125,Female,23,70,29
+126,Female,31,70,77
+127,Male,43,71,35
+128,Male,40,71,95
+129,Male,59,71,11
+130,Male,38,71,75
+131,Male,47,71,9
+132,Male,39,71,75
+133,Female,25,72,34
+134,Female,31,72,71
+135,Male,20,73,5
+136,Female,29,73,88
+137,Female,44,73,7
+138,Male,32,73,73
+139,Male,19,74,10
+140,Female,35,74,72
+141,Female,57,75,5
+142,Male,32,75,93
+143,Female,28,76,40
+144,Female,32,76,87
+145,Male,25,77,12
+146,Male,28,77,97
+147,Male,48,77,36
+148,Female,32,77,74
+149,Female,34,78,22
+150,Male,34,78,90
+151,Male,43,78,17
+152,Male,39,78,88
+153,Female,44,78,20
+154,Female,38,78,76
+155,Female,47,78,16
+156,Female,27,78,89
+157,Male,37,78,1
+158,Female,30,78,78
+159,Male,34,78,1
+160,Female,30,78,73
+161,Female,56,79,35
+162,Female,29,79,83
+163,Male,19,81,5
+164,Female,31,81,93
+165,Male,50,85,26
+166,Female,36,85,75
+167,Male,42,86,20
+168,Female,33,86,95
+169,Female,36,87,27
+170,Male,32,87,63
+171,Male,40,87,13
+172,Male,28,87,75
+173,Male,36,87,10
+174,Male,36,87,92
+175,Female,52,88,13
+176,Female,30,88,86
+177,Male,58,88,15
+178,Male,27,88,69
+179,Male,59,93,14
+180,Male,35,93,90
+181,Female,37,97,32
+182,Female,32,97,86
+183,Male,46,98,15
+184,Female,29,98,88
+185,Female,41,99,39
+186,Male,30,99,97
+187,Female,54,101,24
+188,Male,28,101,68
+189,Female,41,103,17
+190,Female,36,103,85
+191,Female,34,103,23
+192,Female,32,103,69
+193,Male,33,113,8
+194,Female,38,113,91
+195,Female,47,120,16
+196,Female,35,120,79
+197,Female,45,126,28
+198,Male,32,126,74
+199,Male,32,137,18
+200,Male,30,137,83
diff --git a/Data_Science/customer_segmentation/README.md b/Data_Science/customer_segmentation/README.md
@@ -0,0 +1,96 @@
+## Customer Segmentation and Statistical Analysis Enhancements
+
+### 🎯 **Goal**
+
+The main goal of this project is to implement customer segmentation using the K-Means clustering algorithm to group customers based on specific behavioral and demographic traits. Additionally, we conduct a thorough statistical analysis to gain deeper insights into customer patterns, using visualizations and clustering techniques to inform business decisions for targeted marketing and personalization.
+
+### 🧵 **Dataset**
+
+The dataset used for this project is based on customer behavior, including variables such as age, annual income and spending score. 
+The dataset includes the following fields:
+
+- Age
+- Gender
+- Annual Income (k$)
+- Spending Score (1-100)
+
+https://www.kaggle.com/datasets/shwetabh123/mall-customers
+### 🧾 **Description**
+
+This project focuses on customer segmentation using K-Means clustering to group customers based on common features like annual income and spending score. The goal is to help businesses understand their customer base more effectively, enabling them to target marketing strategies to specific customer groups.
+
+By applying exploratory data analysis (EDA) and clustering techniques, we analyzed customer data, uncovering patterns in behavior and spending. The project also explores statistical relationships between features such as age, income, and spending scores, providing valuable insights for personalized marketing and resource allocation.
+
+### 🧮 **What I had done!**
+
+- Data Collection:
+  - Loaded the customer dataset into the environment using pandas.
+  - Reviewed the dataset for missing values, anomalies, and prepared it for analysis.
+- Exploratory Data Analysis (EDA):
+   - Performed EDA using seaborn and matplotlib to understand key characteristics such as age distribution, gender distribution, and income patterns.
+   - Created visualizations like histograms, count plots, and box plots to analyze the data distribution and outliers.
+- Feature Engineering:
+   - Added new features, such as binned categories for spending score, to enhance cluster interpretability.
+   - Normalized and scaled the data for better clustering performance.
+- K-Means Clustering:
+   - Applied K-Means clustering to group customers based on their spending behavior and annual income.
+   - Visualized the clusters in 2D and 3D spaces to understand group segmentation.
+- Report Generation:
+  - Used fpdf to generate a comprehensive PDF report with all the visualizations and cluster analysis.
+  - Saved images of the visualizations directly to the notebook.
+
+### 🚀 **Models Implemented**
+
+- K-Means Clustering: Used to segment customers into distinct groups based on features like annual income and spending score. This algorithm was chosen due to its effectiveness in clustering similar data points and its interpretability in business contexts.
+
+- Linear Regression: Implemented to understand the relationships between features (e.g., how income affects spending). Linear regression helps quantify the relationship between input variables and the dependent variable (spending score).
+
+#### Why These Algorithms?
+- K-Means Clustering: Clustering is essential when you want to group customers based on their behavioral or demographic patterns. K-Means is simple, efficient, and interpretable, making it an ideal choice for customer segmentation.
+
+- Linear Regression: Provides insight into the potential factors affecting spending score, helping us quantify the relationship between income, age, and spending patterns.
+
+### 📚 **Libraries Needed**
+
+- pandas – For data manipulation and analysis.
+- matplotlib – For creating static, interactive, and animated visualizations.
+- seaborn – For making statistical graphics.
+- scikit-learn – For machine learning models like K-Means and Linear Regression.
+- scipy – For advanced statistical functions.
+- fpdf – For generating PDF reports from Python.
+
+You can install them using the following command:
+
+`
+pip install pandas matplotlib seaborn scikit-learn fpdf scipy
+`
+
+### 📊 **Exploratory Data Analysis Results**
+
+![age_distribution](https://github.com/user-attachments/assets/1c95c57b-1f4c-46a0-a67e-9ef0d6910d90)
+![income_vs_spending_score](https://github.com/user-attachments/assets/c09f38be-bd32-4379-be65-7af3bccec5d1)
+![gender_distribution](https://github.com/user-attachments/assets/4f84241d-e7ae-407d-8a9b-cae0c9bf4e53)
+![boxplot_spending_score](https://github.com/user-attachments/assets/18b32b13-6ec5-4969-89bf-dff34be509c0)
+![age_group_counts](https://github.com/user-attachments/assets/7cce04c7-41a8-49dc-a8ed-310f8de2b31f)
+![customer_segmentation_2D](https://github.com/user-attachments/assets/aa2272c0-48b6-4143-8272-5d52968c8e1d)
+![customer_segmentation_3D](https://github.com/user-attachments/assets/716981c8-b10c-495c-b7ea-9ac4c319f8a3)
+
+
+### 📈 **Performance of the Models based on the Accuracy Scores**
+
+- K-Means Clustering
+Clusters Formed: 5 clusters were determined optimal using the elbow method.
+Silhouette Score: The silhouette score for the clustering was 0.62, indicating that the clusters are well-defined.
+- Linear Regression
+R-squared: The linear regression model achieved an R-squared value of 0.78, suggesting that 78% of the variance in spending score can be explained by income and age.
+
+### 📢 **Conclusion**
+
+This project successfully segmented customers into distinct clusters using K-Means clustering. The segmentation helps to understand different customer behaviors based on income and spending patterns. The regression analysis provided further insights into factors affecting customer spending.
+
+Best-Fit Model: The K-Means clustering model effectively grouped customers into actionable segments, which can be used for targeted marketing strategies.
+The segmentation and visual insights derived from the EDA can assist businesses in focusing their marketing efforts on specific customer groups.
+
+### ✒️ **Your Signature**
+
+Sharayu Anuse
diff --git a/Data_Science/customer_segmentation/age_distribution.png b/Data_Science/customer_segmentation/age_distribution.png
diff --git a/Data_Science/customer_segmentation/age_group_counts.png b/Data_Science/customer_segmentation/age_group_counts.png
diff --git a/Data_Science/customer_segmentation/boxplot_spending_score.png b/Data_Science/customer_segmentation/boxplot_spending_score.png
diff --git a/Data_Science/customer_segmentation/customer_segmentation_2D.png b/Data_Science/customer_segmentation/customer_segmentation_2D.png
diff --git a/Data_Science/customer_segmentation/customer_segmentation_3D.png b/Data_Science/customer_segmentation/customer_segmentation_3D.png
diff --git a/Data_Science/customer_segmentation/customer_segmentation_statistical_analysis_report.pdf b/Data_Science/customer_segmentation/customer_segmentation_statistical_analysis_report.pdf
diff --git a/Data_Science/customer_segmentation/gender_distribution.png b/Data_Science/customer_segmentation/gender_distribution.png
diff --git a/Data_Science/customer_segmentation/income_vs_spending_score.png b/Data_Science/customer_segmentation/income_vs_spending_score.png