-
-
Notifications
You must be signed in to change notification settings - Fork 216
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #709 from tanuj437/main
Health Insurance Cross Sell Prediction
- Loading branch information
Showing
24 changed files
with
355 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Health Insurance Cross Sell Prediction Dataset | ||
## Overview | ||
This dataset contains information related to health insurance policyholders, focusing on predicting the probability of a policyholder responding to a cross-selling campaign for vehicle insurance. | ||
|
||
## File | ||
- train.csv: Contains the training data with features and the target variable (Response). | ||
- test.csv: Contains the test data for which predictions need to be made. | ||
|
||
## Column Description | ||
|
||
- **id**: Unique identifier for each entry. | ||
- **Gender**: Gender of the policyholder. | ||
- **Age**: Age of the policyholder. | ||
- **Driving_License**: Whether the policyholder has a valid driving license (0 - No, 1 - Yes). | ||
- **Region_Code**: Code for the region of the policyholder. | ||
- **Previously_Insured**: Whether the policyholder already has vehicle insurance (0 - No, 1 - Yes). | ||
- **Vehicle_Age**: Age of the vehicle. | ||
- **Vehicle_Damage**: Whether the vehicle has been damaged in the past (false - No, true - Yes). | ||
- **Annual_Premium**: Amount of the annual premium. | ||
- **Policy_Sales_Channel**: Code for the channel through which the policy was purchased. | ||
- **Vintage**: Number of days the policyholder has been associated with the company. | ||
- **Response**: Whether the policyholder responded positively to the cross-selling campaign (0 - No, 1 - Yes). | ||
|
||
## Summary | ||
- File Size: 662 MB | ||
- Number of Records: 11.5 Million | ||
|
||
## Dataset | ||
This dataset can be accessed from [Kaggle](https://www.kaggle.com/competitions/playground-series-s4e7/data?select=train.csv) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Health Insurance Cross Sell Prediction-Model | ||
|
||
## 📝 Description | ||
This folder contains the pre-trained machine learning models and scripts used for predicting whether a customer will buy a vehicle insurance policy. The goal is to automatically categorize customers based on the likelihood of purchasing the insurance, helping to target potential buyers effectively. | ||
|
||
## 📂 Contents | ||
- Health_Insurance_Cross_Sell_Prediction.ipynb: Jupyter Notebook containing the complete process of data preprocessing, model training, evaluation, and visualization. | ||
- model.pkl: Pre-trained model used for prediction. | ||
- scaler.pkl: Pre-Fitted Scaler for data fitting. | ||
- README.md: This document. | ||
|
||
## 🎯 Goal | ||
The goal of this prediction project is to enhance understanding of customer behavior by organizing and analyzing data on various features. By automatically classifying customers based on their likelihood of purchasing insurance, the project aims to provide insights into potential buyers and improve targeting strategies. | ||
|
||
## 🧮 What I Did | ||
In this prediction project, various models were evaluated to find the most effective one for classifying customers. The models evaluated include: | ||
|
||
- Logistic Regression | ||
``` | ||
Description: A linear model used for binary classification. It estimates probabilities using a logistic function. | ||
Performance: Achieved an accuracy of 88%. | ||
``` | ||
|
||
- XGBoost | ||
``` | ||
Description: An implementation of gradient boosted decision trees designed for speed and performance. | ||
Performance: Achieved an accuracy of 88%. | ||
``` | ||
|
||
- Naive Bayes | ||
``` | ||
Description: A probabilistic classifier based on Bayes' theorem with strong independence assumptions. | ||
Performance: Achieved an accuracy of 64%. | ||
``` | ||
- LightGBM | ||
``` | ||
Description: A Light Gradient Boosting Machine (LightGBM) was trained for high performance and efficiency with large datasets. | ||
Performance: Achieved an accuracy of 88%. | ||
``` | ||
- Neural Network | ||
``` | ||
Description: A basic feedforward neural network used for classification tasks. | ||
Performance: Achieved an accuracy of 88%. | ||
``` | ||
- Ridge Classifier | ||
``` | ||
Description: A linear classifier with L2 regularization to avoid overfitting. | ||
Performance: Achieved an accuracy of 88%. | ||
``` | ||
- Stochastic Gradient Descent (SGD) Classifier | ||
``` | ||
Description: A linear classifier optimized using stochastic gradient descent. | ||
Performance: Achieved an accuracy of 88%. | ||
``` | ||
|
||
## Data Preprocessing and Feature Engineering | ||
- Data Cleaning: Normalized data, removed missing values, and handled duplicates. | ||
- Feature Engineering: Created new features such as interaction terms, and performed encoding for categorical variables. | ||
- Data Scaling: Standardized numerical features to ensure consistent scaling. | ||
|
||
|
||
## Model Performance Analysis | ||
- Training and Validation: Evaluated models based on accuracy, precision, recall, and F1 score to select the best-performing model. | ||
Best Model | ||
- The best-performing model, LightGBM, has been saved as model.pkl and is ready for deployment. | ||
|
||
## 📈 Performance of the Models Based on Accuracy Scores | ||
- Logistic Regression: Accuracy: 88% | ||
- XGBoost: Accuracy: 88% | ||
- Naive Bayes: Accuracy: 64% | ||
- LightGBM: Accuracy: 88% | ||
- Neural Network: Accuracy: 88% | ||
- Ridge Classifier: Accuracy: 88% | ||
- SGD Classifier: Accuracy: 88% | ||
|
||
## 📢 Conclusion | ||
The Health Insurance Cross Sell Prediction project demonstrates the effectiveness of machine learning models, particularly LightGBM, in accurately predicting customer behavior. The models help in organizing and prioritizing customer data, providing valuable insights for stakeholders. | ||
|
||
## ✒️ Your Signature | ||
Tanuj Saxena [LinkedIn](https://www.linkedin.com/in/tanuj-saxena-970271252/) |
1 change: 1 addition & 0 deletions
1
Health Insurance Cross Sell Prediction/Model/health-insurance-cross-sell-prediction.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
# Health Insurance Cross Sell Prediction | ||
Dive into the intricacies of predicting which customers are likely to purchase additional health insurance based on their profiles. This project utilizes various machine learning models to analyze customer data and provide insights that can help insurance companies tailor their marketing strategies and improve customer engagement. | ||
<img width="922" alt="webapp" src="https://github.com/user-attachments/assets/f5de12fa-0144-4f3f-92dd-94c652f6e49c"> | ||
|
||
## 📝 Abstract | ||
Health Insurance Cross Sell Prediction involves using machine learning algorithms to predict whether a customer will purchase additional health insurance. This analysis provides valuable insights into customer behavior and helps insurance companies make informed decisions to optimize their marketing and sales strategies. | ||
|
||
## 🔍 Methodology | ||
**Importing Libraries** | ||
|
||
- Libraries such as NumPy, Pandas, Scikit-Learn, LightGBM, XGBoost, and others are imported for data manipulation, visualization, and machine learning model building. | ||
|
||
**Loading the Dataset** | ||
|
||
- The dataset contains customer information with various features such as Gender, Age, Driving_License, Region_Code, Previously_Insured, Vehicle_Age, Vehicle_Damage, Annual_Premium, Policy_Sales_Channel, Vintage, and Response. | ||
|
||
**Data Preprocessing** | ||
|
||
- prepare data for analysis: handle missing values, encode categorical data, scale features, perform feature engineering, split into train-test sets, and normalize data. Ensure data is in a suitable format for machine learning algorithms. | ||
|
||
**Training the Models** | ||
|
||
- Each model is trained on the training dataset and evaluated using metrics such as accuracy, precision, recall, and F1 score. The models used include: | ||
Logistic Regression | ||
XGBoost | ||
Naive Bayes | ||
LightGBM | ||
Neural Network | ||
Ridge Classifier | ||
SGD Classifier | ||
|
||
**Model Performance Analysis** | ||
|
||
- Training and validation loss and accuracy are plotted to visualize the models' performance. | ||
|
||
**Model Prediction** | ||
|
||
- The model is given a test dataset to check the accuracy and precision of the predictions. | ||
|
||
**Deploy** | ||
|
||
- Using the Streamlit library, the model is deployed for real-time prediction of customer cross-sell potential. | ||
|
||
## Project Directory Structure | ||
```bash | ||
Health Insurance Cross Sell Prediction | ||
|- Dataset | ||
|- dataset_view.csv | ||
|- dataset_review.csv | ||
|- README.md | ||
|- Model | ||
|- Health_Insurance_Cross_Sell_Prediction.ipynb | ||
|- README.md | ||
|- model.pkl | ||
|- scaler.pkl | ||
|- Web App | ||
|- app.py | ||
|- README.md | ||
|- Images | ||
|- correlation.png | ||
|- dis_age.png | ||
|- distribution_response.png | ||
|- f1_cmp.png | ||
|- gender_response.png | ||
|- gender.png | ||
|- histogram.png | ||
|- model_cmp.png | ||
|- precision.png | ||
|- recall_cmp.png | ||
|- README.md | ||
|- webapp_run.mp4 | ||
|- requirements.txt | ||
|- README.md | ||
``` | ||
### How to Use | ||
**Requirements** | ||
- Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the `requirements.txt` file. | ||
|
||
**Download Data** | ||
- Download the train.csv and test.csv datasets from Kaggle or any other source mentioned in the dataset section of the project.[Kaggle](https://www.kaggle.com/competitions/playground-series-s4e7/data) | ||
|
||
**Run the Jupyter Notebook** | ||
- Open the provided Jupyter Notebook file and run each cell sequentially. Make sure to update any file paths or configurations as needed for your environment. | ||
|
||
**Training and Evaluation** | ||
- Train the models using the provided data and evaluate their performance using metrics such as accuracy, precision, recall, and F1 score. | ||
|
||
**Interpret Results** | ||
- Analyze the model's performance using the visualizations and metrics provided in the notebook. | ||
|
||
## Connect with Me | ||
Tanuj Saxena [LinkedIn](https://www.linkedin.com/in/tanuj-saxena-970271252/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
|
||
https://github.com/user-attachments/assets/d0554627-f3f2-4f1f-a375-e3ba23c46db3 | ||
|
||
# Health Insurance Cross Sell Prediction Web App | ||
## Goal 🎯 | ||
The goal of this prediction web application is to accurately predict whether a customer will buy a vehicle insurance policy. By analyzing customer data, the app helps in organizing and prioritizing potential buyers, detecting trends, and ensuring targeted marketing efforts. It streamlines the process of understanding customer behavior and provides valuable insights for stakeholders. 📈🚗 | ||
|
||
## Model(s) Used for the Web App 🧮 | ||
The model used in this web app is a pre-trained LightGBM classifier, which has been fine-tuned for predicting customer responses. The model analyzes various features such as Age, Gender, Vehicle_Age, and others to predict the likelihood of purchasing insurance with high accuracy. | ||
|
||
## Video Demonstration 🎥 | ||
|
||
|
||
https://github.com/user-attachments/assets/d39b0cd7-9b31-49c9-939e-060409edc74b | ||
|
||
|
||
|
||
## How to Run the Web App | ||
#### Requirements | ||
Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the requirements.txt file. | ||
|
||
### Installation | ||
**Clone the repository:** | ||
```bash | ||
gh repo clone tanuj437/Health-Insurance-Cross-Sell-Prediction | ||
cd Health-Insurance-Cross-Sell-Prediction/WebApp | ||
``` | ||
**Install the Dependencies:** | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
**Run the Streamlit app:** | ||
```bash | ||
streamlit run app.py | ||
``` | ||
### Signature ✒️ | ||
Tanuj Saxena | ||
|
||
[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/tanuj-saxena-970271252/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
import joblib | ||
import streamlit as st | ||
import pandas as pd | ||
import numpy as np | ||
from sklearn.preprocessing import StandardScaler | ||
|
||
# Load the fitted scaler | ||
scaler = joblib.load('Model/scaler.pkl') | ||
|
||
# Load your trained model | ||
model = joblib.load('Model/model.pkl') | ||
|
||
# Function to preprocess and predict the response | ||
def predict_response(gender, age, driving_license, region_code, previously_insured, | ||
vehicle_age, vehicle_damage, annual_premium, policy_sales_channel, vintage): | ||
# Preprocess input data | ||
gender = 1 if gender == 'Male' else 0 | ||
vehicle_age = {'< 1 Year': 1, '1-2 Year': 0, '> 2 Years': 2}[vehicle_age] | ||
vehicle_damage = 1 if vehicle_damage == 'Yes' else 0 | ||
|
||
# Create a numpy array with the input data | ||
data = np.array([[gender, age, driving_license, region_code, previously_insured, | ||
vehicle_age, vehicle_damage, annual_premium, policy_sales_channel, vintage]]) | ||
|
||
# Transform the data using the loaded scaler | ||
data_scaled = scaler.transform(data) | ||
|
||
# Make predictions | ||
prediction = model.predict(data_scaled)[0] | ||
|
||
return prediction | ||
|
||
def main(): | ||
st.title('Insurance Response Prediction App') | ||
st.sidebar.title('Input Parameters') | ||
|
||
# Input fields | ||
gender = st.sidebar.radio('Gender', ['Male', 'Female']) | ||
age = st.sidebar.slider('Age', 20, 85, 40) | ||
driving_license = st.sidebar.selectbox('Driving License', [0, 1]) | ||
region_code = st.sidebar.number_input('Region Code', min_value=0.0, max_value=52.0, value=25.0) | ||
previously_insured = st.sidebar.selectbox('Previously Insured', [0, 1]) | ||
vehicle_age = st.sidebar.selectbox('Vehicle Age', ['< 1 Year', '1-2 Year', '> 2 Years']) | ||
vehicle_damage = st.sidebar.selectbox('Vehicle Damage', ['No', 'Yes']) | ||
annual_premium = st.sidebar.number_input('Annual Premium', min_value=2630.0, max_value=540165.0, value=2630.0) | ||
policy_sales_channel = st.sidebar.number_input('Policy Sales Channel', min_value=1, max_value=163, value=1) | ||
vintage = st.sidebar.slider('Vintage', 10, 299, 150) | ||
|
||
# Predict function | ||
if st.button("Predict"): | ||
prediction = predict_response(gender, age, driving_license, region_code, previously_insured, vehicle_age, vehicle_damage, annual_premium, policy_sales_channel, vintage) | ||
response = 'True' if prediction == 1 else 'False' | ||
st.success(f"The predicted response is: {response}") | ||
|
||
if __name__ == '__main__': | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# Image Folder Overview | ||
This folder contains various visualizations that represent different aspects of the dataset and model performance. Below is a detailed description of each visualization. | ||
|
||
## Visualizations | ||
- **Correlation Matrix:** | ||
This matrix shows the correlation between different features in the dataset, providing insight into how variables are related to each other. | ||
<img width="566" alt="correlation" src="https://github.com/user-attachments/assets/aee01c65-aa02-4dd2-9c77-eb5451df1dc8"> | ||
|
||
|
||
- **Distribution of Response:** | ||
This chart displays the distribution of the target variable Response. It shows the frequency of positive and negative responses in the dataset. | ||
<img width="397" alt="distribution_response" src="https://github.com/user-attachments/assets/1dbf1549-6a83-4bcb-b285-eb5499df880f"> | ||
|
||
- **Distribution Based on Gender:** | ||
Description: This chart shows the distribution of the Gender variable, illustrating the proportion of male and female customers. | ||
<img width="404" alt="gender_response" src="https://github.com/user-attachments/assets/ca61e4c1-3430-4594-82e8-abb56285fea4"> | ||
|
||
- **Gender Distribution:** | ||
Description: This visualization represents the count of responses from different genders, helping to understand the gender demographics of the dataset. | ||
<img width="230" alt="gender" src="https://github.com/user-attachments/assets/4dc42fba-9bd3-47b0-a49e-ef938858b447"> | ||
|
||
- **Distribution of Age:** | ||
Description: This chart displays the distribution of the Age variable, showing the age range and frequency of customers in the dataset. | ||
<img width="518" alt="dis_age" src="https://github.com/user-attachments/assets/6bbb4aff-436b-4b43-a2c8-db16b176fe2c"> | ||
|
||
- **Histograms of Selected Columns:** | ||
Description: These histograms show the distribution of values for selected columns such as Annual_Premium, Policy_Sales_Channel, and Vintage. They provide insight into the data distribution for these features. | ||
<img width="775" alt="histogram" src="https://github.com/user-attachments/assets/851989ee-bcaa-4ab6-88d0-2f87c56a961f"> | ||
|
||
- **F1 Score Comparison:** | ||
Description: This bar chart compares the F1 scores of different models used in the analysis. The F1 score is a measure of a model's accuracy, balancing precision and recall. | ||
<img width="587" alt="f1_cmp" src="https://github.com/user-attachments/assets/51c4ddf5-e032-4cbc-af2d-66cfda1be209"> | ||
|
||
- **Recall Comparison:** | ||
Description: This bar chart compares the recall scores of different models used in the analysis. Recall measures the ability of a model to identify all relevant instances in the dataset. | ||
<img width="572" alt="recall_cmp" src="https://github.com/user-attachments/assets/a4329cb8-f658-42d3-8dd6-952b52535656"> | ||
|
||
- **Precision Comparison:** | ||
Description: This bar chart compares the precision scores of different models used in the analysis. Precision measures the accuracy of the positive predictions made by the model. | ||
<img width="584" alt="precision_cmp" src="https://github.com/user-attachments/assets/4bb39a4c-1aec-4f12-9319-a67e481e111c"> | ||
|
||
- **Accuracy Comparison:** | ||
Description: This bar chart compares the accuracy scores of different models used in the analysis. Accuracy measures the overall correctness of the model's predictions. | ||
<img width="482" alt="model_cmp" src="https://github.com/user-attachments/assets/c63dd77d-0e85-4acb-a847-df8aa6e20fe0"> | ||
|
||
### Usage | ||
These visualizations provide a comprehensive view of the dataset's characteristics and the performance of various models used for sentiment analysis. They can be used to gain insights into customer demographics, feature distributions, and model effectiveness, aiding in identifying areas for improvement in the analysis. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+24.1 KB
Health Insurance Cross Sell Prediction/images/distribution_response.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
pandas==1.3.3 | ||
numpy==1.21.2 | ||
scikit-learn==0.24.2 | ||
joblib==1.0.1 | ||
xgboost==1.4.2 | ||
lightgbm==3.2.1 | ||
tensorflow==2.6.0 | ||
streamlit==0.86.0 | ||
matplotlib==3.4.3 | ||
seaborn==0.11.2 | ||
|