Ethereum Mainnet Anomalous Transactions (Fraudulent) Detection App
This app detects fraudulent (used anomaly
interchangeably for the fraudulent throughout the readme and the codebase)
transactions on Ethereum Mainnet for the given block range or time interval. Current scope is ERC20 external token transfers.
In this problem context, an anomaly (outlier) can be defined as transactions that exchanges a potentially untrusted token. The app identifies such transactions so that users are prevented from interacting with untrusted tokens.
You can interact with the app through the Swagger (please see How to Install and Use section) or curl requests from the command line.
The app accepts 4 parameters:
- start_block
- end_block
- time_interval_in_seconds: defaults to
0
- use_pre_trained_model: defaults to
false
To get the data with the block range set start_block
and end_block
. start_block
must be smaller than or equal to
end_block
and boundaries are inclusive.
To get the data within the time interval, set time_interval_in_seconds
to an integer greater than 0. This will
override
using the app with the block range and get the latest blocks within the specified seconds approximately.
To use a pre-trained model, set use_pre_trained_model
to true
. This will load the latest model from the model
registry.
Model training is required upon start. When use_pre_trained_model
is false
, loaded data is used for both training
and anomaly detection.
After loading the data, the model is trained with 2 features:
- transaction value per token
- gas cost in ETH
The app returns a list of dictionaries as an output, an example is as follows
[
{
"transaction_hash": "0x3c73140e51879e17902c7eb3845a8990ea63df0637e3cea4c5a6453508eadfda",
"value": "2,103,429,269,563,549.75",
"token": "BUGATTI",
"gas_cost": "0.00087476",
"anomaly_score": "0.8432",
"etherscan_link": "https://etherscan.io/tx/0x3c73140e51879e17902c7eb3845a8990ea63df0637e3cea4c5a6453508eadfda"
}
]
To illustrate how fraud detection app works, I used
- 18183000-18183050 block range as training dataset
- 18370728-18370788 block range as test dataset
The distribution of features from the test dataset in the log scale (as the both features are highly positively-skewed distributions) are as follows. The red circle shows where most of the data are centered. Axes show the original data range.
Some outlier samples and regions are visible:
- low value txs (left hand side of the circle)
- high gas_cost txs (above the circle)
- high value txs (right hand side of the circle)
After model training and running the predictions, the app detects transactions that exchanges tokens in extremely high values as anomalies, represented as dark blue points below.
When we filter by the anomaly-labeled transactions, we get transactions with potentially untrusted tokens like
where there are only a few holders of those tokens and they usually worth nothing.
Etherscan also labels tokens as trusted or untrusted in a more granular level here, this model is able to identify tokens with "UNKNOWN" reputation.
Since datasets are indexed by unique transaction_hash and token, the model only detects a certain leg of the transaction as anomalous, even though trusted tokens are part of the transaction. For instance, this transaction consists of trusted tokens and the POKEMON 2.0 token: USDT -> WETH -> POKEMON 2.0
Nevertheless, we can infer that if a transaction goes through an untrusted token we can flag it as anomaly. All in all, the aim of the app is to identify transactions with untrusted tokens, so that users are prevented from exchanging them.
While working on the project, I kept my focus on having a reasonably-working-well fraud detection MVP app with a readable and high-quality code.
Querying for every transaction on Ethereum Mainnet seemed suboptimal since a transaction can be any contract method call (mint, burn etc.). So, I started by narrowing down the problem scope to use ERC20 token transfers only.
To keep it simple I only included 2 features: the value of the transaction per token and gas cost in ETH.
I explored several data source providers (Alchemy and Etherscan) to get token transfer transactions. I chose Alchemy API because it offers endpoints for efficient querying and filtering of Ethereum transactions.
I used getAssetTransfers endpoint to get ERC20 token transfers. I also excluded internal transactions so that I only get transactions initiated by the users.
To get each gas spent for the transactions, I used getTransactionReceipts endpoint. From that
endpoint, I used gasUsed
and effectiveGasPrice
to calculate gas cost. After loading the transactions, I extracted
the gas_cost_in_eth
feature by multiplying the two and then converted it from wei to ETH.
I researched on the anomaly detection problem first and most common statistical approaches used. Given the above features, I decided to approach this problem as an unsupervised machine learning problem. I chose Isolation Forest ensemble model because of its decision-tree-based, non-parametric and easy-to-understand nature.
I used this blog postand read the original paper to understand how algorithm works.
The algorithm focuses on detecting and isolating anomalous samples in a decision tree as early as possible. Each tree is constructed with a subset of dataset. The algorithm starts by randomly selecting a feature from the dataset. It then chooses a random value within the range of that feature. This value serves as a threshold.
Then, the algorithm uses this threshold to split samples. Samples on one side of the threshold are grouped together, and samples on the other side are grouped separately.
The above steps are repeated until the tree depth is reached (default is 8). This process is repeated for each tree in the algorithm. In the end, you have a collection of trees, where more common samples are grouped with other normal samples in the deeper nodes and anomalous samples isolated in the shallower nodes. Here is a random single tree visualization from the anomaly detection app, notice that there are 2 obvious anomaly samples (the top white and orange leaf nodes):
In the end, if a sample is isolated by many trees in the forest very quickly, it's considered as an anomaly. Conversely, if it takes many iterations to isolate a sample, it's more likely to be a normal, non-anomalous sample.
The algorithm assigns anomaly scores to all samples in the dataset. It is a normalized score between 0 and 1. It measures how quickly a sample gets isolated among all trees. If it's isolated very quickly, it's given a high score, suggesting it's an anomaly. If it takes a long time to get isolated, it's considered normal. The closer the score is to 1, the more likely it's an anomaly, and the closer it is to 0, the more normal it is.
Also, you can set a threshold to decide what level of anomaly you want to detect. Samples with scores above certain
threshold are considered anomalies, while those below the threshold are considered normal. This threshold parameter is
controlled by contamination
parameter. I set it to 0.001 by intuition. In simple terms, this parameter controls how
much of anomalies are expected for the domain problem.
I allowed users can predict using the latest pre-trained model, so I used the local filesystem as the
model registry. Each trained model is indexed by timestamp and saved in the app by AnomalyDetector.fit_and_save_model
.
When users request pre-trained model, latest model is loaded from the local filesystem and predictions are generated.
I chose the local filesystem to show how a model registry could be integrated into the app to satisfy this requirement.
I wrapped the core data loading and model training & predictions process in an API endpoint.
I preferred an API endpoint because of OpenAPI Specification which provides self-explanatory documentation. I used Docker to create the required environment and run the app so that it is installable on any local or virtual machine.
Pre-requisites: Docker engine running locally. You can find the instructions here to install Docker on your local machine.
Build Docker image: It will create the environment to run the app
docker build -t anomaly-detection-app:0.0.1 .
Spawn container: It will start the app
docker run -d -p 8000:8000 anomaly-detection-app:0.0.1
You can access the app and its documentation on Swagger http://0.0.0.0:8000/docs or interact with it from command line.
Interact from command line
curl -X 'POST' \
'http://0.0.0.0:8000/anomaly_detection/' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"start_block": 0,
"end_block": 0,
"time_interval_in_seconds":0,
"use_pre_trained_model": false
}'
Each item in the list is unique per transaction_hash
and token
, so you might get duplicate transactions in the
results.
To stop the app
docker stop $(docker ps -a -q)
├── README.md <- The top-level README explaining the project
├── anomaly_detection <- Anomaly detection core implementation
│ ├── anomaly_detector.py <- Data processing and anomaly detection class
│ ├── transaction_loader.py <- Transaction loading class
├── app <- fastAPI app
│ ├── data_models.py <- Input & output model classes for the endpoint
│ ├── main.py <- Endpoint implementation
├── images <- Images used in the README
├── Dockerfile <- Dockerfile to create anomaly-detection-app image
├── notebooks <- Model exploration notebooks
├── tests <- Unit tests
├── .gitignore <- Ignored files by git
├── requirements.txt <- Required Python packages for the environment creation
Change directory to your local repository
cd <path-to-your-local-repository>
Create conda environment
conda create --name anomaly-detection-env python=3.9.18
Activate environment
conda activate anomaly-detection-env
Install requirements
pip install -r requirements.txt
Add repository path to PYTHONPATH
export PYTHONPATH=<path-to-your-repo-root>
Run unit tests
py.test tests
To run the notebook update environment with the following commands
pip install jupyter
pip install plotly
pip install seaborn
pip intall pydotplus
conda install python-graphviz