👉 Read our data story online using the following link 🚀
The paper Cyclic Arbitrage in Decentralized Exchange Markets[1] showed a large number of occurrences of cyclic arbitrages in decentralised exchanges compared to centralised ones. In their work, they mainly focus on analysing these cycles in terms of length, token distribution, daily patens and profitability.
However, the factors driving their appearance have not been studied yet. To this end, we propose to extend the work of [1] on Uniswap data. Moreover, we also plan to study the predictive power of these factors in a binary classification setting. It will allow determining whether or not a cycle can actually be implemented and generate a positive return, which has an inherent market value.
The goal of this project is to study exploited cyclic arbitrage in decentralised exchanges. We already have access to the Cyclic transaction dataset which contains cyclic arbitrages that were exploited. We intend to extract features out of events (trade rates, trade volumes, liquidity) preceding the arbitrages.
These features could potentially be high dimensional (depending on the length of the time series) and we will need to use dimensionality reduction techinques to create an embedding to build a relevant set of features of our future machine learning models.
Then, we will cluster the arbitrages based on the computed features. Ideally, we would like to observe meaningful clusterings: profitable cycles get clustered together, cycles having similar duration (how long it is profitable) also end up in the same cluster, etc. Once meaningful clusters are obtained, it gets interesting to use the same features in a prediction model having profitability of the arbitrage as a target.
-
Data preprocessing:
- Keep only cycles of length 3.
- Filter out illiquid tokens.
- Log-transformation for heavy-tailed features
- Token-based standard scaling
- Zero padding for length standardisation.
-
Cycles embedding:
- After preprocessing an autencoder is built.
- Multiple architectures are tested (linear, multilayer densly connected, convolutional layer).
- Their performance is compared to a classical PCA approach.
- In part 4. Profitablity prediction, the performance of the different embeddings techniques is evaluated on the accuracy of the task.
-
Cycles clustering:
- Use the embedding, a KMeans clustering is constructed.
- Clusters in the training set are analysed
- Based on the test set results, we can understand whether or not there is predictibility in the results obtained in point 2.
-
Cycle profitablity prediction:
- Study profitability prediction for arbitrage cycles.
- Multiples models are tested (logistic regression, SVM).
- The impact of adding token encoding to the models is tested.
- The performance of the different embeddings is evaluated.
Each folder contains a decidaced README
where extra instruction and details are given.
.
├── data # Data folder
│ ├── uniswap_raw_data # data fetched from bitquery
│ │ ├── uniswap_raw_data_0_1000.json.gz # example of file
│ │ ├── ...
│ ├── liquid # directory containing datasets corresponding to liquid cycles
│ │ ├── uniswap_data_liquid.csv # csv version of the dataset fetched from biquery (filter out illiquid cycles)
│ │ ├── additional_features_train_liquid.csv # file used by the clustering and prediction task with extra features (train)
│ │ ├── additional_features_test_liquid.csv # file used by the clustering and prediction task with extra features (test)
│ │ ├── ML_features # directory for ML features
│ │ │ ├── ...
│ │ ├── pca # directory containing the encoded features from the PCA model
│ │ │ ├── ...
│ │ ├── rule_based # directory containing the encoded features from the Rule-Based model
│ │ │ ├── ...
│ ├── full # similar architecture as for the liquid folder but for the full dataset
│ │ ├── ...
│ ├── cycles_in_Uniswap.json # dataset from the paper
│ ├── filtered_cycles.json # only cycles of length 3
├── data_acquisition # Scripts to fetch the datasets (from bitquery and from the paper)
├── data_exploration # Contains visualisations of the datasets
├── data_processing # All scripts to process the raw data into usables features for ML
├── models # all ML related tasks
│ ├── clustering # files related to the clustering task
│ ├── embedding # files related to the embedding task
│ ├── prediction # files related to the profitablity prediction task
├── figures # Contains the ouput images and html used for the data story
├── requirements.txt # Dependencies file
└── README.md
- Follow the steps in Data Acquisition to download the raw datasets
- Follow the steps in Data Processing to generate the preprocessed data
- Data exploration: run the
data_exploration/data_exploration.ipynb
notebook to see the data exploration steps taken. - Embeddings: open the
models/embedding
folder:- Autoencoder: follow the steps in Train Autoencoders to understand how to train and use the available autoencoders
- PCA: run the
pca_embedding.ipynb
notebook to create thePCA
embedding. - Rule-based: follow the steps in Build Rule-based features to generate preprocessed data usefull for performance comparision.
- Clustering: run the
models/clustering/Kmeans.ipynb
notebook to see the code related to the clustering. - Profitablity prediction: run the
models/prediction/prediction.ipynb
notebook for the profitablity prediction task.
In the repository, we provide a requirement.txt
file from which you can create a virtual python environment.
If you want to run our code in the scitas cluster, you will need several additional steps for the set-up:
- Create a compatible Jupyter/Tensorflow environment using the following official tutotrial
- To be able to import
talos
on the Scitas cluster, we need to update line 8 ofopt/venv-gcc/lib/python3.7/site-packages/kerasplotlib/traininglog.py
fromfrom keras.callbacks import Callback
tofrom tensorflow.keras.callbacks import Callback
Task | Team member(s) | work hours |
---|---|---|
Literature research | Lucas & Augustin | 3h |
API choice and query design | Lucas & Augustin | 4h |
EPFL Cluster & environment setup | Lucas | 2h |
Data fetching script test | Augustin | 3h |
Data fetching validation | Augustin | 2h |
Data fetching improvements | Augustin | 2h |
Task | Team member(s) | work hours |
---|---|---|
Data cleaning | Augustin | 5h |
Data exploration paper dataset | Augustin | 2h |
Data exploration | Lucas | 3h |
Raw data => embedding format | Lucas | 3h |
Task | Team member(s) | work hours |
---|---|---|
Autencoder keras basic code | Lucas | 3h |
Comparision with PCA and debugging | Lucas | 1h |
K-means | Augustin | 2h |
Task | Team member(s) | work hours |
---|---|---|
Clustering analysis | Lucas | 4h |
Profitablity prediction setup | Augustin | 2h |
Github pages setup | Lucas | 2h |
Data story (1) | Lucas | 5h |
Data story (2) | Augustin | 2h |
Task | Team member(s) | work hours |
---|---|---|
Token based scaling | Lucas & Augustin | 5h |
Token one hot encoding | Lucas | 1h |
Token encoding in profitablity prediction | Augustin | 1h |
Deep NN for profitablity prediction | Augustin | 1h |
Better data processing | Augustin | 2h |
Improved data exploration | Lucas | 3h |
Better understanding of PCA output | Augustin | 1h |
Autencoder testing | Augustin | 2h |
Data story (3) | Lucas | 1h |
Add ruled based indicators for autoencoder performance comparision | Lucas | 2h |
Task | Team member(s) | work hours |
---|---|---|
Filter illiquid data & debug | Lucas | 3h |
Update architecture for liquid data | Augustin | 3h |
Research on attention learning | Lucas | 2h |
Data processing simpler pipeline | Augustin | 2h |
Autencoder improvement and debug | Augustin | 3h |
Autencoder manual tests for several architectures | Augustin | 8h |
Testing optimizers | Augustin | 2h |
Talos setup | Lucas | 2h |
Running Talos | Augustin | 1h |
Hyperparameter opmisation | Lucas & Augustin | 4h |
Reporting losses | Augustin | 1h |
Kmeans : better silouhette analysis | Lucas | 3h |
Kmeans : update results for liquid data | Lucas | 4h |
PCA embedding | Lucas | 1h |
Ruled based data : pandas-ta implementation | Lucas | 1h |
Ruled based data : pandas implementation | Lucas | 3h |
Ruled based data : code optimisation | Lucas | 3h |
Ruled based data : performance comparision with AE | Lucas | 1h |
Repository cleaning | Lucas & Augustin | 1h |
Notebook comments and markdown | Lucas & Augustin | 4h |
Data story (4) | Lucas & Augustin | 6h |
Team member | work hours |
---|---|
Lucas Giordano | 81h |
Augustin Kapps | 60h |