Skip to content

Detecting Fraudulent Blockchain Accounts on Ethereum with Supervised Machine Learning

License

Notifications You must be signed in to change notification settings

eltontay/Ethereum-Fraud-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ethereum-Fraud-Detection

Detecting Fraudulent Blockchain Accounts on Ethereum with Supervised Machine Learning


Introduction

Since 2021, more than 46,000 people lost over $1 billion to cryptocurrency scams, nearly 60 times more compared to 2018.1 The Federal Trade Commission (FTC) found that the top cryptocurrencies used to pay scammers were Bitcoin (70%), Tether (10%) and Ethereum (9%).1 Especially, with the most recent incident with FTX, a crypto exchange which misused more than $1 billion of client’s funds, it becomes ever more important to stay vigilant when navigating through the cryptocurrency world.2 To enforce deterrence against fraudulent scams, we used supervised machine learning techniques such as Logistic Regression, Naive Bayes, SVM, XGboost, LightGBM, MLP, Tabnet and Stacking to detect and predict fraudulent Ethereum accounts. This would add business value by enhancing fraudulent account detection features on crypto exchanges and crypto wallets, enabling people to navigate confidently through the cryptocurrency world and safeguard their personal assets. We set an objective to achieve more than 90% F1 score for machine learning models in predicting fraudulent accounts on the Ethereum blockchain.


Data

There are 2 data sources : Kaggle and Etherscan

Kaggle

The Kaggle dataset is downloaded from https://www.kaggle.com/datasets/vagifa/ethereum-frauddetection-dataset and can be found in ./Data/address_data_k.csv

Etherscan

Data are mined from etherscan from https://etherscan.io/accounts/label/phish-hack (Currently data has been taken off Etherscan, but we have saved our data) and can be found in ./Data/address_data_e.csv

Combined without Time Series

Data from Kaggle and Etherscan are combined and can be found in ./Data/address_data_combined.csv

Time-Series

One key aspect of the dataset that we realised was missing was the time series element. Although each observation in our data was a user account, this data was generated by aggregating individual transactions. By doing so, valuable information could have been “flattened out”. The flow of Ethereum transactions are intrinsically time series data that could be used in our model, such as seasonality of transactions. These information was extracted using the 'tsfresh' library and can be found in ./Data/Transaction_data and the new features extracted can be found in ./Data/new_ts_features_only.csv.

Combined with Time Series

Data from Kaggle and Etherscan including time series can be found in ./Data/address_data_combined_ts.csv

Data Description

We started with a Kaggle dataset of 9841 observations. Each observation is a unique Ethereum account, with each variable being an aggregate statistic over all transactions performed by that unique account, such as total Ether value received or average time between transactions. The data also distinguishes between account-to-account transactions and account-to-smart contract transactions. However, the dataset was highly imbalanced, with only 2179 out of 9841 (22.14%) being marked as fraud. To address the imbalance, we leveraged an API provided by Etherscan, a “Block Explorer and Analytics Platform for Ethereum”. This allowed us to retrieve transactions made by any given account address on the Ethereum blockchain. As a result, the number of fraudulent accounts in our dataset climbed to 4339 observations, making the combined dataset less imbalanced (45.97% fraud).


Machine Learning Models

Random Foest

./Models/Random_Forest_Model.ipynb

Logistic Regression

/Models/LightGBM_Model.ipynb

Naive Bayes

/Models/Naive_Bayes_Model.ipynb

Support Vector Machine (SVM)

/Models/Support_Vector_Machine_Model.ipynb

Multi-Layer Perceptron (MLP)

/Models/Multi_Layer_Perceptron_Model.ipynb

eXtreme Gradient Boosting (XGBoost)

/Models/XGBoost_Model.ipynb

TabNet

/Models/TabNet_Model.ipynb

LightGBM

/Models/LightGBM_Model.ipynb

Stacked Ensemble Model without Time Series

/Models/Final_Stacking_Model.ipynb

Stacked Ensemble Model with Time Series

/Models/Final_Stacking_Model_w_ts.ipynb


Model Performance without Time Series

Model F1 Recall Precision Accuracy Time taken ROC-AUC Optimal Parameters
Logistic Regression 0.8360 0.8420 0.8301 0.8479 84.85 0.8475 ‘C’:1000, ‘penalty’:’l1’, ‘solver’:’liblinear’
Naive Bayes 0.7797 0.8241 0.7398 0.7855 1.54 0.7883 ‘Var_smoothing’: 0.0533669923120631
SVM 0.9243 0.9149 0.9340 0.9427 3.28 0.9374 'C':1000,'gamma':1
XGBoost 0.9358 0.9177 0.9546 0.9519 2.79 0.9453 'learning_rate':0.05,'max_depth':8,'n_estimators':1000
MLP 0.8505 0.8346 0.8670 0.8879 1.91 0.8777 'input_dim':12,'H':60,'activation':'relu','dropout_probability':0.2,'num_epochs':75, 'num_layers':10
TabNet 0.9147 0.8903 0.9405 0.9366 56.93 0.9277 'gamma':1.0,'lambda_sparse':0, 'momentum':0.4, 'n_steps':8, 'optimizer_params':'lr': 0.025
LightGBM 0.9376 0.9198 0.9561 0.9532 0.06 0.9468 'bagging_fraction':0.95, 'bagging_freq':1, 'feature_fraction':0.95, 'learning_rate':0.2, 'max_bin':300, 'max_depth':6, 'min_gain_to_split':0, 'num_leaves':20
Stacking 0.9371 0.9226 0.9521 0.9527 198.35 0.9469 SVM, XGBoost, MLP, Tabnet and LightGBM
Stacking (excluding MLP) 0.9379 0.9240 0.9521 0.9532 185.72 0.9476 SVM, XGBoost, Tabnet and LightGBM

Model Performance with Time Series

Model F1 Recall Precision Accuracy Time taken ROC-AUC Optimal Parameters
SVM 0.9208 0.9220 0.9196 0.9284 5.43 0.9278 'C':1000,'gamma':1
XGBoost 0.9323 0.9310 0.9335 0.9389 7.74 0.9383 'learning_rate':0.05, 'max_depth':8,'n_estimators':1000
MLP 0.8364 0.8079 0.8668 0.8573 2.43 0.8529 'input_dim':12,'H':60,'activation':'relu' ,'dropout_probability':0.2,'num_epochs':75, 'num_layers':10
TabNet 0.8968 0.8578 0.9396 0.9109 89.65 0.9062 'gamma':1.0,'lambda_sparse':0, 'momentum':0.4, 'n_steps':8, 'optimizer_params':'lr': 0.025
LightGBM 0.9314 0.9326 0.9302 0.9379 0.012 0.9375 'bagging_fraction':0.95, 'bagging_freq':1, 'feature_fraction':0.95, 'learning_rate':0.2, 'max_bin':300, 'max_depth':6, 'min_gain_to_split':0, 'num_leaves':20
Stacking 0.9323 0.9347 0.9298 0.9387 311.96 0.9383 SVM, XGBoost, MLP, Tabnet and LightGBM