Detecting Fraudulent Blockchain Accounts on Ethereum with Supervised Machine Learning
Since 2021, more than 46,000 people lost over $1 billion to cryptocurrency scams, nearly 60 times more compared to 2018.1 The Federal Trade Commission (FTC) found that the top cryptocurrencies used to pay scammers were Bitcoin (70%), Tether (10%) and Ethereum (9%).1 Especially, with the most recent incident with FTX, a crypto exchange which misused more than $1 billion of client’s funds, it becomes ever more important to stay vigilant when navigating through the cryptocurrency world.2 To enforce deterrence against fraudulent scams, we used supervised machine learning techniques such as Logistic Regression, Naive Bayes, SVM, XGboost, LightGBM, MLP, Tabnet and Stacking to detect and predict fraudulent Ethereum accounts. This would add business value by enhancing fraudulent account detection features on crypto exchanges and crypto wallets, enabling people to navigate confidently through the cryptocurrency world and safeguard their personal assets. We set an objective to achieve more than 90% F1 score for machine learning models in predicting fraudulent accounts on the Ethereum blockchain.
There are 2 data sources : Kaggle
and Etherscan
The Kaggle dataset is downloaded from https://www.kaggle.com/datasets/vagifa/ethereum-frauddetection-dataset and can be found in
./Data/address_data_k.csv
Data are mined from etherscan from https://etherscan.io/accounts/label/phish-hack (Currently data has been taken off Etherscan, but we have saved our data) and can be found in
./Data/address_data_e.csv
Data from Kaggle and Etherscan are combined and can be found in
./Data/address_data_combined.csv
One key aspect of the dataset that we realised was missing was the time series element. Although each observation in our data was a user account, this data was generated by aggregating individual transactions. By doing so, valuable information could have been “flattened out”. The flow of Ethereum transactions are intrinsically time series data that could be used in our model, such as seasonality of transactions. These information was extracted using the 'tsfresh' library and can be found in
./Data/Transaction_data
and the new features extracted can be found in./Data/new_ts_features_only.csv
.
Data from Kaggle and Etherscan including time series can be found in
./Data/address_data_combined_ts.csv
We started with a Kaggle dataset of 9841 observations. Each observation is a unique Ethereum account, with each variable being an aggregate statistic over all transactions performed by that unique account, such as total Ether value received or average time between transactions. The data also distinguishes between account-to-account transactions and account-to-smart contract transactions. However, the dataset was highly imbalanced, with only 2179 out of 9841 (22.14%) being marked as fraud. To address the imbalance, we leveraged an API provided by Etherscan, a “Block Explorer and Analytics Platform for Ethereum”. This allowed us to retrieve transactions made by any given account address on the Ethereum blockchain. As a result, the number of fraudulent accounts in our dataset climbed to 4339 observations, making the combined dataset less imbalanced (45.97% fraud).
./Models/Random_Forest_Model.ipynb
/Models/LightGBM_Model.ipynb
/Models/Naive_Bayes_Model.ipynb
/Models/Support_Vector_Machine_Model.ipynb
/Models/Multi_Layer_Perceptron_Model.ipynb
/Models/XGBoost_Model.ipynb
/Models/TabNet_Model.ipynb
/Models/LightGBM_Model.ipynb
/Models/Final_Stacking_Model.ipynb
/Models/Final_Stacking_Model_w_ts.ipynb
Model | F1 | Recall | Precision | Accuracy | Time taken | ROC-AUC | Optimal Parameters |
---|---|---|---|---|---|---|---|
Logistic Regression | 0.8360 | 0.8420 | 0.8301 | 0.8479 | 84.85 | 0.8475 | ‘C’:1000, ‘penalty’:’l1’, ‘solver’:’liblinear’ |
Naive Bayes | 0.7797 | 0.8241 | 0.7398 | 0.7855 | 1.54 | 0.7883 | ‘Var_smoothing’: 0.0533669923120631 |
SVM | 0.9243 | 0.9149 | 0.9340 | 0.9427 | 3.28 | 0.9374 | 'C':1000,'gamma':1 |
XGBoost | 0.9358 | 0.9177 | 0.9546 | 0.9519 | 2.79 | 0.9453 | 'learning_rate':0.05,'max_depth':8,'n_estimators':1000 |
MLP | 0.8505 | 0.8346 | 0.8670 | 0.8879 | 1.91 | 0.8777 | 'input_dim':12,'H':60,'activation':'relu','dropout_probability':0.2,'num_epochs':75, 'num_layers':10 |
TabNet | 0.9147 | 0.8903 | 0.9405 | 0.9366 | 56.93 | 0.9277 | 'gamma':1.0,'lambda_sparse':0, 'momentum':0.4, 'n_steps':8, 'optimizer_params':'lr': 0.025 |
LightGBM | 0.9376 | 0.9198 | 0.9561 | 0.9532 | 0.06 | 0.9468 | 'bagging_fraction':0.95, 'bagging_freq':1, 'feature_fraction':0.95, 'learning_rate':0.2, 'max_bin':300, 'max_depth':6, 'min_gain_to_split':0, 'num_leaves':20 |
Stacking | 0.9371 | 0.9226 | 0.9521 | 0.9527 | 198.35 | 0.9469 | SVM, XGBoost, MLP, Tabnet and LightGBM |
Stacking (excluding MLP) | 0.9379 | 0.9240 | 0.9521 | 0.9532 | 185.72 | 0.9476 | SVM, XGBoost, Tabnet and LightGBM |
Model | F1 | Recall | Precision | Accuracy | Time taken | ROC-AUC | Optimal Parameters |
---|---|---|---|---|---|---|---|
SVM | 0.9208 | 0.9220 | 0.9196 | 0.9284 | 5.43 | 0.9278 | 'C':1000,'gamma':1 |
XGBoost | 0.9323 | 0.9310 | 0.9335 | 0.9389 | 7.74 | 0.9383 | 'learning_rate':0.05, 'max_depth':8,'n_estimators':1000 |
MLP | 0.8364 | 0.8079 | 0.8668 | 0.8573 | 2.43 | 0.8529 | 'input_dim':12,'H':60,'activation':'relu' ,'dropout_probability':0.2,'num_epochs':75, 'num_layers':10 |
TabNet | 0.8968 | 0.8578 | 0.9396 | 0.9109 | 89.65 | 0.9062 | 'gamma':1.0,'lambda_sparse':0, 'momentum':0.4, 'n_steps':8, 'optimizer_params':'lr': 0.025 |
LightGBM | 0.9314 | 0.9326 | 0.9302 | 0.9379 | 0.012 | 0.9375 | 'bagging_fraction':0.95, 'bagging_freq':1, 'feature_fraction':0.95, 'learning_rate':0.2, 'max_bin':300, 'max_depth':6, 'min_gain_to_split':0, 'num_leaves':20 |
Stacking | 0.9323 | 0.9347 | 0.9298 | 0.9387 | 311.96 | 0.9383 | SVM, XGBoost, MLP, Tabnet and LightGBM |