Report - Fraud Analytics: Credit Card Transaction Fraud Detection

(There is no dataset provided. I only uploaded the code I used as well as the result. Please look at the attached PDF for a full report.)

Executive summary

We utilized credit card transaction data from 2010 to create a credit card transaction fraud detection model that incorporated data from both credit card companies and merchants. Through out-of-time validation, we determined that the optimal threshold for fraud detection is a score cutoff of 4% of the total population. By adopting this threshold, we expect to achieve an estimated overall savings of $20,148,000, which considers the costs and benefits of fraud detection. Specifically, we anticipate a gain of $400 for every instance of fraud successfully detected, as well as a loss of $20 for each false positive.

Overview

The report has been included in this repository with 7 sections as follow:

Section 1: Description of Data

This is transaction data containing credit card transactions in 2010. There are 96,753 transactions with 10 fields as follows:

Date - Transaction date
Amount - Transaction amount
Recnum - A record number. Ordinal unique positive integer for each record.
Cardnum - Credit card number
Merchnum - Merchants' ID
Merch description - Merchants' name
Merch state - Merchants' two-letter U.S. state and territory abbreviations
Merch zip - Merchants' 5 digits zip code
Transtype - Transaction type includes purchase, authorization, etc.
Fraud - (This is our target variable) A binary field for fraud applications

Section 2: Data Cleaning

The data had been filtered and kept only Transtype == 'P' before missing value imputation. There are three variables with missing values which are Merchnum, Merch state, and Merch zip.

Section 3: Feature Engineering

The 3,792 variables have been created with transaction information. Credit card transaction fraud is the case when a credit card is used by someone other than the owner. The supervised algorithm will look for characteristics of fraud cases based on variables that are created from account and merchant information. The algorithm will investigate if any card has been used with an abnormal number of times/amounts.

The variables were created withh transaction information as follows:

Day of the Week Risk Variable: Fraud case percentage for each day of the week (with Statistical Smoothing to overall average percentage)
Benford’s Law Variables: The ratio between amounts which first digit is 3-9 and amounts which first digit is 1-2, normalize by 1.096 according to Benford’s Law (plus Statistical Smoothing to 1).
Days Since Variables: # days since a transaction with that entity has been seen. Entities list is attached below
Frequency Variables: # records with the same entity over the last {0,1,3,7,14,30} days
Amount Variables: Aggregation of "transaction amounts" with the same entity over the last {0,1,3,7,14,30} days, including average, max, total, standard deviation, actual/avg, actual/max, and actual/total
Amount Difference Variables: Aggregation of "differences between recent transaction amount and previous amounts" with the same entity over the last {0,1,3,7,14,30} days, including average, min, max, total, and standard deviation
Relative Velocity Variables: The ratio between {# records, total amounts} with the same entity over the past {0,1} days and the average of {# records, total amounts} with the same entity over the past {3,7,14,30} days
Relative Velocity per Day Since Variables: The ratio between each Relative Velocity variables and Days Since variables with the same entity
Counts by Entities: For each of the same entity over the past {0,1,3,7,14,30,60} days, # unique values of other entities

Entities list: ['Cardnum', 'Merchnum', 'Merch_description', 'Merch_num_des', 'Merch_num_state', 'Merch_num_zip', 'Merch_des_state', 'Merch_des_zip', 'Merch_state_zip', 'Merch_num_des_state', 'Merch_num_des_zip', 'Merch_num_state_zip', 'Merch_des_state_zip', 'Merch_all_info', 'card_Merchnum', 'card_Merch_description', 'card_Merch_state', 'card_Merch_zip', 'card_Merch_num_des', 'card_Merch_num_state', 'card_Merch_num_zip', 'card_Merch_des_state', 'card_Merch_des_zip', 'card_Merch_state_zip', 'card_Merch_num_des_state', 'card_Merch_num_des_zip', 'card_Merch_num_state_zip', 'card_Merch_des_state_zip', 'card_Merch_all_info']

Section 4: Feature Selection

The first step is filtering variables independently based on their univariate model performance measure. Kolmogorov-Smirnov test for goodness of fit is used for this step. The result from the first step has 600 variables.

The second step is using a wrapper model with forward selection. LightGBM with n_estimators = 30 and num_leaves = 4 is used for this step. The result from the second step has 25 variables.

card_Merch_num_state_total_14
card_Merchnum_max_30
card_Merch_des_zip_total_1
Merch_description_total_0
card_Merch_state_zip_total_14
card_Merch_des_state_max_30
card_Merch_des_zip_max_30
card_Merch_des_state_zip_max_30
card_Merch_num_des_zip_total_0
card_Merch_des_state_zip_total_30
card_Merch_num_state_zip_total_0
card_Merch_num_state_zip_total_30
card_Merch_des_zip_total_30
card_Merch_num_zip_total_0
card_Merch_des_zip_total_0
card_Merch_state_zip_total_0
card_Merch_num_state_total_30
Merch_state_zip_variability_avg_14
card_Merch_num_des_total_0
card_Merch_des_state_total_30
card_Merch_des_state_total_0
card_Merch_description_total_0
card_Merch_zip_total_0
card_Merch_num_des_zip_total_14
card_Merch_description_total_30

Section 5: Preliminary Models Exploration

Variables from the feature selection process will be used to create several supervised models. Logistic regression models are used as a baseline for comparison. Non-linear models are expected to perform better than logistic regression models. Each model will be run 5 times and selected based on their average fraud detection rate at 3% population with the test dataset.

Section 6: Summary of Results

RandomForestClassifier (n_estimators = 20, min_impurity_decrease = 0.00001, min_samples_leaf = 30, min_samples_split = 60) with 10 variables has been chosen as the final model.

Section 7: Recommended Fraud Score Cutoff

Based on out-of-time validation, we recommend ranking transactions based on the fraud likelihood and classifying the top 4% of the total population as fraudulent, which yields an expected overall savings of $20,148,000. This estimation is generated by considering the costs and benefits of detecting fraud, assuming a gain of $400 for every fraud caught and a loss of $20 for every false positive.

Section 8: Summary

We utilized credit card transaction data from 2010 to create a credit card transaction fraud detection model that incorporated data from both credit card companies and merchants. Through out-of-time validation, we determined that the optimal threshold for fraud detection is a score cutoff of 4% of the total population. By adopting this threshold, we expect to achieve an estimated savings of $20,148,000, considering the costs and benefits of fraud detection. Specifically, we anticipate a gain of $400 for every instance of fraud successfully detected, as well as a loss of $20 for each false positive.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
1. Data Exploration - Credit Card Transaction		1. Data Exploration - Credit Card Transaction
2. Feature Engineering - Credit Card Transaction		2. Feature Engineering - Credit Card Transaction
3. Feature Selection - Credit Card Transaction		3. Feature Selection - Credit Card Transaction
4. Machine Learning Models - Credit Card Transaction		4. Machine Learning Models - Credit Card Transaction
5. Fraud Score Cutoff Optimization - Credit Card Transaction		5. Fraud Score Cutoff Optimization - Credit Card Transaction
README.md		README.md
Report - Fraud Detection Algorithm - Credit Card Transaction.pdf		Report - Fraud Detection Algorithm - Credit Card Transaction.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Report - Fraud Analytics: Credit Card Transaction Fraud Detection

Executive summary

Overview

Section 1: Description of Data

Section 2: Data Cleaning

Section 3: Feature Engineering

Section 4: Feature Selection

Section 5: Preliminary Models Exploration

Section 6: Summary of Results

Section 7: Recommended Fraud Score Cutoff

Section 8: Summary

About

Releases

Packages

Languages

kevin-titi/Fraud_Detection_Credit_Card_Transaction

Folders and files

Latest commit

History

Repository files navigation

Report - Fraud Analytics: Credit Card Transaction Fraud Detection

Executive summary

Overview

Section 1: Description of Data

Section 2: Data Cleaning

Section 3: Feature Engineering

Section 4: Feature Selection

Section 5: Preliminary Models Exploration

Section 6: Summary of Results

Section 7: Recommended Fraud Score Cutoff

Section 8: Summary

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages