Skip to content

cereniyim/ethereum-fraud-detection-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ethereum Fraud Detection App

Ethereum Mainnet Anomalous Transactions (Fraudulent) Detection App

App Summary & Main Purpose

This app detects fraudulent (used anomaly interchangeably for the fraudulent throughout the readme and the codebase) transactions on Ethereum Mainnet for the given block range or time interval. Current scope is ERC20 external token transfers.

In this problem context, an anomaly (outlier) can be defined as transactions that exchanges a potentially untrusted token. The app identifies such transactions so that users are prevented from interacting with untrusted tokens.

app_flow

You can interact with the app through the Swagger (please see How to Install and Use section) or curl requests from the command line.

App Inputs and Interpretation of the Predictions

The app accepts 4 parameters:

  • start_block
  • end_block
  • time_interval_in_seconds: defaults to 0
  • use_pre_trained_model: defaults to false

To get the data with the block range set start_block and end_block. start_block must be smaller than or equal to end_block and boundaries are inclusive.

To get the data within the time interval, set time_interval_in_seconds to an integer greater than 0. This will override using the app with the block range and get the latest blocks within the specified seconds approximately.

To use a pre-trained model, set use_pre_trained_model to true. This will load the latest model from the model registry. Model training is required upon start. When use_pre_trained_model is false, loaded data is used for both training and anomaly detection.

After loading the data, the model is trained with 2 features:

  • transaction value per token
  • gas cost in ETH

The app returns a list of dictionaries as an output, an example is as follows

[
  {
    "transaction_hash": "0x3c73140e51879e17902c7eb3845a8990ea63df0637e3cea4c5a6453508eadfda",
    "value": "2,103,429,269,563,549.75",
    "token": "BUGATTI",
    "gas_cost": "0.00087476",
    "anomaly_score": "0.8432",
    "etherscan_link": "https://etherscan.io/tx/0x3c73140e51879e17902c7eb3845a8990ea63df0637e3cea4c5a6453508eadfda"
  }
]

To illustrate how fraud detection app works, I used

  • 18183000-18183050 block range as training dataset
  • 18370728-18370788 block range as test dataset

The distribution of features from the test dataset in the log scale (as the both features are highly positively-skewed distributions) are as follows. The red circle shows where most of the data are centered. Axes show the original data range.

inference_features_distribution.png

Some outlier samples and regions are visible:

  • low value txs (left hand side of the circle)
  • high gas_cost txs (above the circle)
  • high value txs (right hand side of the circle)

After model training and running the predictions, the app detects transactions that exchanges tokens in extremely high values as anomalies, represented as dark blue points below.

predictions.png

When we filter by the anomaly-labeled transactions, we get transactions with potentially untrusted tokens like

where there are only a few holders of those tokens and they usually worth nothing.

Etherscan also labels tokens as trusted or untrusted in a more granular level here, this model is able to identify tokens with "UNKNOWN" reputation.

Since datasets are indexed by unique transaction_hash and token, the model only detects a certain leg of the transaction as anomalous, even though trusted tokens are part of the transaction. For instance, this transaction consists of trusted tokens and the POKEMON 2.0 token: USDT -> WETH -> POKEMON 2.0

Nevertheless, we can infer that if a transaction goes through an untrusted token we can flag it as anomaly. All in all, the aim of the app is to identify transactions with untrusted tokens, so that users are prevented from exchanging them.

anomalous_txs.png

My Approach on the Project and Key Architectural Decisions

While working on the project, I kept my focus on having a reasonably-working-well fraud detection MVP app with a readable and high-quality code.

Querying for every transaction on Ethereum Mainnet seemed suboptimal since a transaction can be any contract method call (mint, burn etc.). So, I started by narrowing down the problem scope to use ERC20 token transfers only.

To keep it simple I only included 2 features: the value of the transaction per token and gas cost in ETH.

Alchemy as the source data provider

I explored several data source providers (Alchemy and Etherscan) to get token transfer transactions. I chose Alchemy API because it offers endpoints for efficient querying and filtering of Ethereum transactions.

I used getAssetTransfers endpoint to get ERC20 token transfers. I also excluded internal transactions so that I only get transactions initiated by the users.

To get each gas spent for the transactions, I used getTransactionReceipts endpoint. From that endpoint, I used gasUsed and effectiveGasPrice to calculate gas cost. After loading the transactions, I extracted the gas_cost_in_eth feature by multiplying the two and then converted it from wei to ETH.

Isolation forest as the underlying algorithm

I researched on the anomaly detection problem first and most common statistical approaches used. Given the above features, I decided to approach this problem as an unsupervised machine learning problem. I chose Isolation Forest ensemble model because of its decision-tree-based, non-parametric and easy-to-understand nature.

I used this blog postand read the original paper to understand how algorithm works.

The algorithm focuses on detecting and isolating anomalous samples in a decision tree as early as possible. Each tree is constructed with a subset of dataset. The algorithm starts by randomly selecting a feature from the dataset. It then chooses a random value within the range of that feature. This value serves as a threshold.

Then, the algorithm uses this threshold to split samples. Samples on one side of the threshold are grouped together, and samples on the other side are grouped separately.

The above steps are repeated until the tree depth is reached (default is 8). This process is repeated for each tree in the algorithm. In the end, you have a collection of trees, where more common samples are grouped with other normal samples in the deeper nodes and anomalous samples isolated in the shallower nodes. Here is a random single tree visualization from the anomaly detection app, notice that there are 2 obvious anomaly samples (the top white and orange leaf nodes):

img.png

In the end, if a sample is isolated by many trees in the forest very quickly, it's considered as an anomaly. Conversely, if it takes many iterations to isolate a sample, it's more likely to be a normal, non-anomalous sample.

The algorithm assigns anomaly scores to all samples in the dataset. It is a normalized score between 0 and 1. It measures how quickly a sample gets isolated among all trees. If it's isolated very quickly, it's given a high score, suggesting it's an anomaly. If it takes a long time to get isolated, it's considered normal. The closer the score is to 1, the more likely it's an anomaly, and the closer it is to 0, the more normal it is.

Also, you can set a threshold to decide what level of anomaly you want to detect. Samples with scores above certain threshold are considered anomalies, while those below the threshold are considered normal. This threshold parameter is controlled by contamination parameter. I set it to 0.001 by intuition. In simple terms, this parameter controls how much of anomalies are expected for the domain problem.

Local filesystem as a model registry

I allowed users can predict using the latest pre-trained model, so I used the local filesystem as the model registry. Each trained model is indexed by timestamp and saved in the app by AnomalyDetector.fit_and_save_model. When users request pre-trained model, latest model is loaded from the local filesystem and predictions are generated.

I chose the local filesystem to show how a model registry could be integrated into the app to satisfy this requirement.

Making a POST endpoint and containerization with Docker

I wrapped the core data loading and model training & predictions process in an API endpoint.

I preferred an API endpoint because of OpenAPI Specification which provides self-explanatory documentation. I used Docker to create the required environment and run the app so that it is installable on any local or virtual machine.

How to Install and Use

Pre-requisites: Docker engine running locally. You can find the instructions here to install Docker on your local machine.

Build Docker image: It will create the environment to run the app

docker build -t anomaly-detection-app:0.0.1 .

Spawn container: It will start the app

docker run -d -p 8000:8000 anomaly-detection-app:0.0.1

You can access the app and its documentation on Swagger http://0.0.0.0:8000/docs or interact with it from command line.

Interact from command line

curl -X 'POST' \
  'http://0.0.0.0:8000/anomaly_detection/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "start_block": 0,
  "end_block": 0,
  "time_interval_in_seconds":0,
  "use_pre_trained_model": false 
}'

Each item in the list is unique per transaction_hash and token, so you might get duplicate transactions in the results.

To stop the app

docker stop $(docker ps -a -q)

For Developers

Project organization

├── README.md                         <- The top-level README explaining the project
├── anomaly_detection                 <- Anomaly detection core implementation
│   ├── anomaly_detector.py           <- Data processing and anomaly detection class
│   ├── transaction_loader.py         <- Transaction loading class
├── app                               <- fastAPI app
│   ├── data_models.py                <- Input & output model classes for the endpoint
│   ├── main.py                       <- Endpoint implementation
├── images                            <- Images used in the README
├── Dockerfile                        <- Dockerfile to create anomaly-detection-app image
├── notebooks                         <- Model exploration notebooks
├── tests                             <- Unit tests
├── .gitignore                        <- Ignored files by git
├── requirements.txt                  <- Required Python packages for the environment creation

Setup local environment & run unit tests

Change directory to your local repository

cd <path-to-your-local-repository>

Create conda environment

conda create --name anomaly-detection-env python=3.9.18

Activate environment

conda activate anomaly-detection-env

Install requirements

pip install -r requirements.txt

Add repository path to PYTHONPATH

export PYTHONPATH=<path-to-your-repo-root>

Run unit tests

py.test tests

Update environment to run the notebook

To run the notebook update environment with the following commands

pip install jupyter
pip install plotly
pip install seaborn
pip intall pydotplus
conda install python-graphviz

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages