Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health Insurance Cross Sell Prediction #709

Merged
merged 5 commits into from
Jul 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions Health Insurance Cross Sell Prediction/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Health Insurance Cross Sell Prediction Dataset
## Overview
This dataset contains information related to health insurance policyholders, focusing on predicting the probability of a policyholder responding to a cross-selling campaign for vehicle insurance.

## File
- train.csv: Contains the training data with features and the target variable (Response).
- test.csv: Contains the test data for which predictions need to be made.

## Column Description

- **id**: Unique identifier for each entry.
- **Gender**: Gender of the policyholder.
- **Age**: Age of the policyholder.
- **Driving_License**: Whether the policyholder has a valid driving license (0 - No, 1 - Yes).
- **Region_Code**: Code for the region of the policyholder.
- **Previously_Insured**: Whether the policyholder already has vehicle insurance (0 - No, 1 - Yes).
- **Vehicle_Age**: Age of the vehicle.
- **Vehicle_Damage**: Whether the vehicle has been damaged in the past (false - No, true - Yes).
- **Annual_Premium**: Amount of the annual premium.
- **Policy_Sales_Channel**: Code for the channel through which the policy was purchased.
- **Vintage**: Number of days the policyholder has been associated with the company.
- **Response**: Whether the policyholder responded positively to the cross-selling campaign (0 - No, 1 - Yes).

## Summary
- File Size: 662 MB
- Number of Records: 11.5 Million

## Dataset
This dataset can be accessed from [Kaggle](https://www.kaggle.com/competitions/playground-series-s4e7/data?select=train.csv)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
80 changes: 80 additions & 0 deletions Health Insurance Cross Sell Prediction/Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Health Insurance Cross Sell Prediction-Model

## 📝 Description
This folder contains the pre-trained machine learning models and scripts used for predicting whether a customer will buy a vehicle insurance policy. The goal is to automatically categorize customers based on the likelihood of purchasing the insurance, helping to target potential buyers effectively.

## 📂 Contents
- Health_Insurance_Cross_Sell_Prediction.ipynb: Jupyter Notebook containing the complete process of data preprocessing, model training, evaluation, and visualization.
- model.pkl: Pre-trained model used for prediction.
- scaler.pkl: Pre-Fitted Scaler for data fitting.
- README.md: This document.

## 🎯 Goal
The goal of this prediction project is to enhance understanding of customer behavior by organizing and analyzing data on various features. By automatically classifying customers based on their likelihood of purchasing insurance, the project aims to provide insights into potential buyers and improve targeting strategies.

## 🧮 What I Did
In this prediction project, various models were evaluated to find the most effective one for classifying customers. The models evaluated include:

- Logistic Regression
```
Description: A linear model used for binary classification. It estimates probabilities using a logistic function.
Performance: Achieved an accuracy of 88%.
```

- XGBoost
```
Description: An implementation of gradient boosted decision trees designed for speed and performance.
Performance: Achieved an accuracy of 88%.
```

- Naive Bayes
```
Description: A probabilistic classifier based on Bayes' theorem with strong independence assumptions.
Performance: Achieved an accuracy of 64%.
```
- LightGBM
```
Description: A Light Gradient Boosting Machine (LightGBM) was trained for high performance and efficiency with large datasets.
Performance: Achieved an accuracy of 88%.
```
- Neural Network
```
Description: A basic feedforward neural network used for classification tasks.
Performance: Achieved an accuracy of 88%.
```
- Ridge Classifier
```
Description: A linear classifier with L2 regularization to avoid overfitting.
Performance: Achieved an accuracy of 88%.
```
- Stochastic Gradient Descent (SGD) Classifier
```
Description: A linear classifier optimized using stochastic gradient descent.
Performance: Achieved an accuracy of 88%.
```

## Data Preprocessing and Feature Engineering
- Data Cleaning: Normalized data, removed missing values, and handled duplicates.
- Feature Engineering: Created new features such as interaction terms, and performed encoding for categorical variables.
- Data Scaling: Standardized numerical features to ensure consistent scaling.


## Model Performance Analysis
- Training and Validation: Evaluated models based on accuracy, precision, recall, and F1 score to select the best-performing model.
Best Model
- The best-performing model, LightGBM, has been saved as model.pkl and is ready for deployment.

## 📈 Performance of the Models Based on Accuracy Scores
- Logistic Regression: Accuracy: 88%
- XGBoost: Accuracy: 88%
- Naive Bayes: Accuracy: 64%
- LightGBM: Accuracy: 88%
- Neural Network: Accuracy: 88%
- Ridge Classifier: Accuracy: 88%
- SGD Classifier: Accuracy: 88%

## 📢 Conclusion
The Health Insurance Cross Sell Prediction project demonstrates the effectiveness of machine learning models, particularly LightGBM, in accurately predicting customer behavior. The models help in organizing and prioritizing customer data, providing valuable insights for stakeholders.

## ✒️ Your Signature
Tanuj Saxena [LinkedIn](https://www.linkedin.com/in/tanuj-saxena-970271252/)

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
92 changes: 92 additions & 0 deletions Health Insurance Cross Sell Prediction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Health Insurance Cross Sell Prediction
Dive into the intricacies of predicting which customers are likely to purchase additional health insurance based on their profiles. This project utilizes various machine learning models to analyze customer data and provide insights that can help insurance companies tailor their marketing strategies and improve customer engagement.
<img width="922" alt="webapp" src="https://github.com/user-attachments/assets/f5de12fa-0144-4f3f-92dd-94c652f6e49c">

## 📝 Abstract
Health Insurance Cross Sell Prediction involves using machine learning algorithms to predict whether a customer will purchase additional health insurance. This analysis provides valuable insights into customer behavior and helps insurance companies make informed decisions to optimize their marketing and sales strategies.

## 🔍 Methodology
**Importing Libraries**

- Libraries such as NumPy, Pandas, Scikit-Learn, LightGBM, XGBoost, and others are imported for data manipulation, visualization, and machine learning model building.

**Loading the Dataset**

- The dataset contains customer information with various features such as Gender, Age, Driving_License, Region_Code, Previously_Insured, Vehicle_Age, Vehicle_Damage, Annual_Premium, Policy_Sales_Channel, Vintage, and Response.

**Data Preprocessing**

- prepare data for analysis: handle missing values, encode categorical data, scale features, perform feature engineering, split into train-test sets, and normalize data. Ensure data is in a suitable format for machine learning algorithms.

**Training the Models**

- Each model is trained on the training dataset and evaluated using metrics such as accuracy, precision, recall, and F1 score. The models used include:
Logistic Regression
XGBoost
Naive Bayes
LightGBM
Neural Network
Ridge Classifier
SGD Classifier

**Model Performance Analysis**

- Training and validation loss and accuracy are plotted to visualize the models' performance.

**Model Prediction**

- The model is given a test dataset to check the accuracy and precision of the predictions.

**Deploy**

- Using the Streamlit library, the model is deployed for real-time prediction of customer cross-sell potential.

## Project Directory Structure
```bash
Health Insurance Cross Sell Prediction
|- Dataset
|- dataset_view.csv
|- dataset_review.csv
|- README.md
|- Model
|- Health_Insurance_Cross_Sell_Prediction.ipynb
|- README.md
|- model.pkl
|- scaler.pkl
|- Web App
|- app.py
|- README.md
|- Images
|- correlation.png
|- dis_age.png
|- distribution_response.png
|- f1_cmp.png
|- gender_response.png
|- gender.png
|- histogram.png
|- model_cmp.png
|- precision.png
|- recall_cmp.png
|- README.md
|- webapp_run.mp4
|- requirements.txt
|- README.md
```
### How to Use
**Requirements**
- Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the `requirements.txt` file.

**Download Data**
- Download the train.csv and test.csv datasets from Kaggle or any other source mentioned in the dataset section of the project.[Kaggle](https://www.kaggle.com/competitions/playground-series-s4e7/data)

**Run the Jupyter Notebook**
- Open the provided Jupyter Notebook file and run each cell sequentially. Make sure to update any file paths or configurations as needed for your environment.

**Training and Evaluation**
- Train the models using the provided data and evaluate their performance using metrics such as accuracy, precision, recall, and F1 score.

**Interpret Results**
- Analyze the model's performance using the visualizations and metrics provided in the notebook.

## Connect with Me
Tanuj Saxena [LinkedIn](https://www.linkedin.com/in/tanuj-saxena-970271252/)
39 changes: 39 additions & 0 deletions Health Insurance Cross Sell Prediction/Webapp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@

https://github.com/user-attachments/assets/d0554627-f3f2-4f1f-a375-e3ba23c46db3

# Health Insurance Cross Sell Prediction Web App
## Goal 🎯
The goal of this prediction web application is to accurately predict whether a customer will buy a vehicle insurance policy. By analyzing customer data, the app helps in organizing and prioritizing potential buyers, detecting trends, and ensuring targeted marketing efforts. It streamlines the process of understanding customer behavior and provides valuable insights for stakeholders. 📈🚗

## Model(s) Used for the Web App 🧮
The model used in this web app is a pre-trained LightGBM classifier, which has been fine-tuned for predicting customer responses. The model analyzes various features such as Age, Gender, Vehicle_Age, and others to predict the likelihood of purchasing insurance with high accuracy.

## Video Demonstration 🎥


https://github.com/user-attachments/assets/d39b0cd7-9b31-49c9-939e-060409edc74b



## How to Run the Web App
#### Requirements
Ensure you have the necessary libraries and dependencies installed. You can find the list of required packages in the requirements.txt file.

### Installation
**Clone the repository:**
```bash
gh repo clone tanuj437/Health-Insurance-Cross-Sell-Prediction
cd Health-Insurance-Cross-Sell-Prediction/WebApp
```
**Install the Dependencies:**
```bash
pip install -r requirements.txt
```
**Run the Streamlit app:**
```bash
streamlit run app.py
```
### Signature ✒️
Tanuj Saxena

[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/tanuj-saxena-970271252/)
56 changes: 56 additions & 0 deletions Health Insurance Cross Sell Prediction/Webapp/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import joblib
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load the fitted scaler
scaler = joblib.load('Model/scaler.pkl')

# Load your trained model
model = joblib.load('Model/model.pkl')

# Function to preprocess and predict the response
def predict_response(gender, age, driving_license, region_code, previously_insured,
vehicle_age, vehicle_damage, annual_premium, policy_sales_channel, vintage):
# Preprocess input data
gender = 1 if gender == 'Male' else 0
vehicle_age = {'< 1 Year': 1, '1-2 Year': 0, '> 2 Years': 2}[vehicle_age]
vehicle_damage = 1 if vehicle_damage == 'Yes' else 0

# Create a numpy array with the input data
data = np.array([[gender, age, driving_license, region_code, previously_insured,
vehicle_age, vehicle_damage, annual_premium, policy_sales_channel, vintage]])

# Transform the data using the loaded scaler
data_scaled = scaler.transform(data)

# Make predictions
prediction = model.predict(data_scaled)[0]

return prediction

def main():
st.title('Insurance Response Prediction App')
st.sidebar.title('Input Parameters')

# Input fields
gender = st.sidebar.radio('Gender', ['Male', 'Female'])
age = st.sidebar.slider('Age', 20, 85, 40)
driving_license = st.sidebar.selectbox('Driving License', [0, 1])
region_code = st.sidebar.number_input('Region Code', min_value=0.0, max_value=52.0, value=25.0)
previously_insured = st.sidebar.selectbox('Previously Insured', [0, 1])
vehicle_age = st.sidebar.selectbox('Vehicle Age', ['< 1 Year', '1-2 Year', '> 2 Years'])
vehicle_damage = st.sidebar.selectbox('Vehicle Damage', ['No', 'Yes'])
annual_premium = st.sidebar.number_input('Annual Premium', min_value=2630.0, max_value=540165.0, value=2630.0)
policy_sales_channel = st.sidebar.number_input('Policy Sales Channel', min_value=1, max_value=163, value=1)
vintage = st.sidebar.slider('Vintage', 10, 299, 150)

# Predict function
if st.button("Predict"):
prediction = predict_response(gender, age, driving_license, region_code, previously_insured, vehicle_age, vehicle_damage, annual_premium, policy_sales_channel, vintage)
response = 'True' if prediction == 1 else 'False'
st.success(f"The predicted response is: {response}")

if __name__ == '__main__':
main()
47 changes: 47 additions & 0 deletions Health Insurance Cross Sell Prediction/images/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Image Folder Overview
This folder contains various visualizations that represent different aspects of the dataset and model performance. Below is a detailed description of each visualization.

## Visualizations
- **Correlation Matrix:**
This matrix shows the correlation between different features in the dataset, providing insight into how variables are related to each other.
<img width="566" alt="correlation" src="https://github.com/user-attachments/assets/aee01c65-aa02-4dd2-9c77-eb5451df1dc8">


- **Distribution of Response:**
This chart displays the distribution of the target variable Response. It shows the frequency of positive and negative responses in the dataset.
<img width="397" alt="distribution_response" src="https://github.com/user-attachments/assets/1dbf1549-6a83-4bcb-b285-eb5499df880f">

- **Distribution Based on Gender:**
Description: This chart shows the distribution of the Gender variable, illustrating the proportion of male and female customers.
<img width="404" alt="gender_response" src="https://github.com/user-attachments/assets/ca61e4c1-3430-4594-82e8-abb56285fea4">

- **Gender Distribution:**
Description: This visualization represents the count of responses from different genders, helping to understand the gender demographics of the dataset.
<img width="230" alt="gender" src="https://github.com/user-attachments/assets/4dc42fba-9bd3-47b0-a49e-ef938858b447">

- **Distribution of Age:**
Description: This chart displays the distribution of the Age variable, showing the age range and frequency of customers in the dataset.
<img width="518" alt="dis_age" src="https://github.com/user-attachments/assets/6bbb4aff-436b-4b43-a2c8-db16b176fe2c">

- **Histograms of Selected Columns:**
Description: These histograms show the distribution of values for selected columns such as Annual_Premium, Policy_Sales_Channel, and Vintage. They provide insight into the data distribution for these features.
<img width="775" alt="histogram" src="https://github.com/user-attachments/assets/851989ee-bcaa-4ab6-88d0-2f87c56a961f">

- **F1 Score Comparison:**
Description: This bar chart compares the F1 scores of different models used in the analysis. The F1 score is a measure of a model's accuracy, balancing precision and recall.
<img width="587" alt="f1_cmp" src="https://github.com/user-attachments/assets/51c4ddf5-e032-4cbc-af2d-66cfda1be209">

- **Recall Comparison:**
Description: This bar chart compares the recall scores of different models used in the analysis. Recall measures the ability of a model to identify all relevant instances in the dataset.
<img width="572" alt="recall_cmp" src="https://github.com/user-attachments/assets/a4329cb8-f658-42d3-8dd6-952b52535656">

- **Precision Comparison:**
Description: This bar chart compares the precision scores of different models used in the analysis. Precision measures the accuracy of the positive predictions made by the model.
<img width="584" alt="precision_cmp" src="https://github.com/user-attachments/assets/4bb39a4c-1aec-4f12-9319-a67e481e111c">

- **Accuracy Comparison:**
Description: This bar chart compares the accuracy scores of different models used in the analysis. Accuracy measures the overall correctness of the model's predictions.
<img width="482" alt="model_cmp" src="https://github.com/user-attachments/assets/c63dd77d-0e85-4acb-a847-df8aa6e20fe0">

### Usage
These visualizations provide a comprehensive view of the dataset's characteristics and the performance of various models used for sentiment analysis. They can be used to gain insights into customer demographics, feature distributions, and model effectiveness, aiding in identifying areas for improvement in the analysis.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions Health Insurance Cross Sell Prediction/requirement.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
pandas==1.3.3
numpy==1.21.2
scikit-learn==0.24.2
joblib==1.0.1
xgboost==1.4.2
lightgbm==3.2.1
tensorflow==2.6.0
streamlit==0.86.0
matplotlib==3.4.3
seaborn==0.11.2

Loading