Skip to content

Application of K-Means Clustering to Identify Groups of Cryptocurrencies

Notifications You must be signed in to change notification settings

FreshOats/Cryptocurrencies

Repository files navigation

Clustering Cryptocurrencies

Application of K-Means Clustering to Identify Groups of Cryptocurrencies

by Justin R. Papreck


Overview

As cryptocurrencies continue to rise in popularity worldwide, it is difficult to determine which crypocurrencies are similar. Many cryptocurrencies use similar algorithms, yet others rely on unique algorithms that may yield the same types of results. As the unique number of crypocurrencies continues to rise, it is difficult to ascertain how each of them are related, and whether the algorithm itself is even a good predictor of how each crypto can be classified. K-Means clusterding, is applied here to perform unsupervised machine learning in the grouping of over 500 active currencies. It is important to know which cryptocurrencies are similar for potential crypto investors. For example, the periodic table enthusiast who recently discovered cryptocurrencies and wants to sample the elements but at the same time wants to diversify. Are the differences between Osmium, Actinium, Lithium, Einsteinium, and Radium as simple as their locations on the periodic table? Or perhaps it's a cannabis entrepreneur looking to invest in 3 different cryptocurrencies, but they have to choose between Cannabis Industry Coin, Canna Coin, Sativa Coin, GanjaCoin, KushCoin and PotCoin and want to diversify their investments as they diversify their stock. Or perhaps there is just a sci-fi investor willing to put all of their money into one place and wants it to go into whatever algorithm is most similar to BitCoin, but wants it to be in 42, Unobtainium, or Klingon Empire Darsek. This application aims to classify all of these currencies through unsupervised learning.

Deliverable 1: Preprocessing the Data for Principal Component Analysis (PCA)


Upon first glance of the data, there are cryptocurrencies that are not trading as well as currencies that haven't mined or acquired a supply of coins.

CoinName Algorithm IsTrading ProofType TotalCoinsMined TotalCoinSupply
42 42 Coin Scrypt True PoW/PoS 4.199995e+01 42
365 365Coin X11 True PoW/PoS NaN 2300000000
404 404Coin Scrypt True PoW/PoS 1.055185e+09 532000000
611 SixEleven SHA-256 True PoW NaN 611000
808 808 SHA-256 True PoW/PoS 0.000000e+00 0
1337 EliteCoin X13 True PoW/PoS 2.927942e+10 314159265359
2015 2015 coin X11 True PoW/PoS NaN 0
BTC Bitcoin SHA-256 True PoW 1.792718e+07 21000000
ETH Ethereum Ethash True PoW 1.076842e+08 0
LTC Litecoin Scrypt True PoW 6.303924e+07 84000000

Initially, all currencies that are not trading were removed, thus dropping the "Is Trading Column". Subsequently, any row containing a Null value were removed. Finally, since purchasing a currency requires at least 1 coin, all cryptocurrencies with no coin supply were also removed. From the 1252 currencies listed, only 533 met the criteria that there were a non-zero coin supply, the coin name, algorithm, proof type, number of coins mined and total supply were all accounted for and they were actively trading. In order to process for machine learning, the coin name code was used as the index, and the names were saved in a different data frame, as the name should not influence a currency's grouping. The remaining non-numerical data in the set were the algorithm and proof type.

CoinName
42 42 Coin
404 404Coin
1337 EliteCoin
BTC Bitcoin
ETH Ethereum
Algorithm ProofType TotalCoinsMined TotalCoinSupply
42 Scrypt PoW/PoS 4.199995e+01 42
404 Scrypt PoW/PoS 1.055185e+09 532000000
1337 X13 PoW/PoS 2.927942e+10 314159265359
BTC SHA-256 PoW 1.792718e+07 21000000
ETH Ethash PoW 1.076842e+08 0
LTC Scrypt PoW 6.303924e+07 84000000
DASH X11 PoW/PoS 9.031294e+06 22000000
XMR CryptoNight-V7 PoW 1.720114e+07 0
ETC Ethash PoW 1.133597e+08 210000000
ZEC Equihash PoW 7.383056e+06 21000000

To address these columns, dummy variables were created for each algorithm and proof type, identifying each as a binary value by applying the pandas function 'get_dummies'. Subsequently, the values were scaled to a value between 0 and 1, with a standard deviation of 1 using StandardScaler from the scikit-learn library.

TotalCoinsMined TotalCoinSupply Algorithm_1GB AES Pattern Search Algorithm_536 Algorithm_Argon2d Algorithm_BLAKE256 Algorithm_Blake Algorithm_Blake2S Algorithm_Blake2b Algorithm_C11 ... ProofType_PoW/PoS ProofType_PoW/PoS ProofType_PoW/PoW ProofType_PoW/nPoS ProofType_Pos ProofType_Proof of Authority ProofType_Proof of Trust ProofType_TPoS ProofType_Zero-Knowledge Proof ProofType_dPoW/PoW
42 4.199995e+01 42 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
404 1.055185e+09 532000000 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
1337 2.927942e+10 314159265359 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
BTC 1.792718e+07 21000000 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ETH 1.076842e+08 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 100 columns

# Standardize the data with StandardScaler().
X = StandardScaler().fit_transform(X)

Deliverable 2: Reducing Data Dimensions Using PCA


Due to the use of dummies, the data expaned to 100 columns, a number of dimensions that would be visually impossible to present. Using PCA() from scikit-learn, the number of dimensions were reduced to 3 for the ability to present in a 3-dimensional graph. The components were named PC 1, PC 2, and PC 3, and the dataframe maintains the indices established in the previous dataframe.

PC 1 PC 2 PC 3
42 -0.299625 1.093503 -0.495499
404 -0.282859 1.093897 -0.495606
1337 2.317407 1.718744 -0.532707
BTC -0.152806 -1.298636 0.123549
ETH -0.162274 -2.025853 0.333898
LTC -0.133639 -1.091527 -0.035369
DASH -0.411299 1.238712 -0.421072
XMR -0.156051 -2.275822 0.285527
ETC -0.160707 -2.025936 0.333897
ZEC -0.150078 -2.175185 0.269426

Deliverable 3: Clustering Crytocurrencies Using K-Means


Finding the Best Value for 'k' Using the Elbow Curve

To determine the optimal number of clusters for the K-Means Clustering application, an elbow curve was created comparing the inertia with the number of clusters. There is a clear bend at 4 clusters with diminishing returns after, so the analyses were performed with 4 clusters.

Elbow_Curve

# Initialize the K-Means model.
model = KMeans(n_clusters=4, random_state=0)

# Fit the model
model.fit(pcs_df)

# Predict clusters
predictions = model.predict(pcs_df)

The outcome of the KMeans analysis with 4 clusters provided an array with clusters labeled 0, 1, 2, and 3.

[0 0 0 3 3 3 0 3 3 3 0 3 0 0 3 0 3 3 0 0 3 3 3 3 3 0 3 3 3 0 3 0 3 3 0 0 3 3 3 3 3 3 0 0 3 3 3 3 3 0 0 3 0 3 3 3 3 0 3 3 0 3 0 0 0 3 3 3 0 0 0 0 0 3 3 3 0 0 3 0 3 0 0 3 3 3 3 0 0 3 0 3 3 0 0 3 0 0 3 3 0 0 3 0 0 3 0 3 0 3 0 3 0 0 3 3 0 3 3 3 0 3 3 3 3 3 0 0 3 3 3 0 3 0 3 3 0 3 0 3 0 0 3 3 0 3 3 0 0 3 0 3 0 0 0 3 3 3 3 0 0 0 0 0 3 3 0 0 0 0 0 3 0 0 0 0 0 3 0 3 0 0 3 0 3 0 0 3 0 3 0 3 0 3 0 0 0 0 3 0 0 0 0 0 3 3 0 0 3 3 0 0 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 3 3 3 0 0 0 0 3 0 3 0 0 3 0 3 3 0 3 3 0 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 3 0 3 0 0 0 0 3 0 3 0 3 3 3 3 0 3 0 0 3 0 3 3 3 0 3 0 3 3 3 0 3 0 3 0 0 1 3 0 3 3 3 3 3 0 0 3 0 0 0 3 0 3 0 3 0 3 0 0 0 0 3 0 0 3 0 0 0 3 3 3 3 0 0 0 0 3 0 3 3 3 0 0 3 3 0 0 3 0 3 3 3 0 3 3 0 0 0 3 3 3 0 0 0 3 3 0 3 3 3 3 0 1 1 3 3 3 0 1 0 0 0 0 3 3 3 3 0 0 0 3 0 3 0 0 0 0 3 0 0 3 0 0 3 3 0 3 0 3 3 3 3 0 0 3 0 3 0 0 0 0 0 0 3 3 3 0 0 0 0 0 0 3 0 3 3 3 3 0 0 0 0 3 0 0 3 0 0 3 1 3 0 3 3 0 0 3 0 3 3 3 3 3 0 3 0 3 0 0 3 0 0 0 0 0 3 3 3 0 0 0 3 0 3 0 3 0 0 0 0 3 0 0 0 3 0 3 0 3 0 0 0 3 3 0 0 0 0 0 0 1 3 0 3 0 3 0 0 1 0 2 0 0 0 3 3 0]

These data were concatenated into a dataframe along with the PCA data and the original cryptocurrency dataframe with the Algorithm, Proof Type, and Coin Names

Algorithm ProofType TotalCoinsMined TotalCoinSupply PC 1 PC 2 PC 3 CoinName Class
42 Scrypt PoW/PoS 4.199995e+01 42 -0.299625 1.093503 -0.495499 42 Coin 0
404 Scrypt PoW/PoS 1.055185e+09 532000000 -0.282859 1.093897 -0.495606 404Coin 0
1337 X13 PoW/PoS 2.927942e+10 314159265359 2.317407 1.718744 -0.532707 EliteCoin 0
BTC SHA-256 PoW 1.792718e+07 21000000 -0.152806 -1.298636 0.123549 Bitcoin 3
ETH Ethash PoW 1.076842e+08 0 -0.162274 -2.025853 0.333898 Ethereum 3
LTC Scrypt PoW 6.303924e+07 84000000 -0.133639 -1.091527 -0.035369 Litecoin 3
DASH X11 PoW/PoS 9.031294e+06 22000000 -0.411299 1.238712 -0.421072 Dash 0
XMR CryptoNight-V7 PoW 1.720114e+07 0 -0.156051 -2.275822 0.285527 Monero 3
ETC Ethash PoW 1.133597e+08 210000000 -0.160707 -2.025936 0.333897 Ethereum Classic 3
ZEC Equihash PoW 7.383056e+06 21000000 -0.150078 -2.175185 0.269426 ZCash 3

Deliverable 4: Visualizing Cryptocurrencies Results


3D-Scatter with Clusters

With the reduced dimensions from the PCA application, the results can be visualized in a 3-D plot. The plot was created using Plotly Express to show the unique points with the name, algorithm used, and principal component data. The plot is interactive allowing the user to hover over the individual points, zoom, and rotate.

3D_Graph

Something very clear with this graph are the points from Classes 1 and 2. (Note: the class number changes when the application is run, but these classes refer to the above figure).

  • BitTorrent stands in a class of its own (Class 2). It had very low influence from PC 3, a higher influence from PC2 than groups 0 and 1, but an extremely high influence from PC1, something none of the other groups shared
  • Groups 0 and 3 are tightly associated, having a low impact from PC 1 and PC3
  • Group 0 is more positive than Group 3, but it does contain some negative values in PC 2
  • Group 1 is a loosely associated group sharing a non-zero influence by PC 3, but otherwise are greater than 0 in PC 2
  • Since Group 2 is such an outlier from the rest of the classes, it would be beneficial to reanalyze these data without the BitTorrent data to see if the other groups remain in their respective classes

Since it is difficult to filter or find individual currencies, such as the ones mentioned in the Overview, hvplot was used to create an interactive table that can be sorted by each of the original parameters.

Table

From here the user can inspect the cryptocurrencies organized by any of the individual parameters.

# Print the total number of tradable currencies in the Clustered_df Dataframe
print(f"There are {len(clustered_df)} tradable currencies in the dataframe")

There are 533 tradable currencies in the dataframe

The final analysis considers the number of total coins mined and the total coin supply of the different classes. A low number of coins mined may suggest that the algorithm to mine them is complex, driving the supply down accordingly, and conversely, a high total number mined may have a high supply if they are readily mined and coin becomes quickly available. These may impact the economics of each of the currencies which can give the investment firm an indicator of which currencies are likely to see an increase or decrease of value over time. To analyze these with differences in numbers up to 12 orders of magnitude, the MinMaxScaler function was used to bring these values between 0 and 1.

# Select the columns to scale
totals = clustered_df[['TotalCoinSupply', 'TotalCoinsMined']]
scaled_totals = MinMaxScaler(feature_range=(0,1)).fit_transform(totals)
clustered_scaled_df = clustered_scaled_df[["TotalCoinSupply", "TotalCoinsMined", "CoinName", "Class"]]
TotalCoinSupply TotalCoinsMined CoinName Class
42 4.200000e-11 0.005942 42 Coin 0
404 5.320000e-04 0.007002 404Coin 0
1337 3.141593e-01 0.035342 EliteCoin 0
BTC 2.100000e-05 0.005960 Bitcoin 3
ETH 0.000000e+00 0.006050 Ethereum 3
LTC 8.400000e-05 0.006006 Litecoin 3
DASH 2.200000e-05 0.005951 Dash 0
XMR 0.000000e+00 0.005960 Monero 3
ETC 2.100000e-04 0.006056 Ethereum Classic 3
ZEC 2.100000e-05 0.005950 ZCash 3
# Alternatively, the above steps can be done...
# Create a new dataframe that has the scaled data
plot_df = pd.DataFrame(data=scaled_totals, columns=["TotalCoinSupply", "TotalCoinsMined"])

# Add the indexing from the original df
plot_df.index = clustered_df.index  

# Add the coin name and add the class from the original 
plot_df[["CoinName", "Class"]] = clustered_df[["CoinName", "Class"]] 

These data were plotted in a 2D scatter plot. As can be seen, once again Class 2, the BitTorrent currency really stands out having the highest supply and the highest number of coins mined. Class 1 also has a single currency, TurtleCoin, that matches the maximum total coin supply, but the coins mined is only a fraction of that of BitTorrent.

# Create a hvplot.scatter plot using x="TotalCoinsMined" and y="TotalCoinSupply".
plot_df.hvplot.scatter(x="TotalCoinsMined"
    , y="TotalCoinSupply"
    , by="Class"
    , hover_cols="Class"
    , width=800
    , height=500
    , s=150
    , alpha = 0.8
    , selection_alpha=0.1
    , line_color='black'
    , line_alpha = 0.5
    , title="K-Means Scatter of Cryptocurrencies, Total Coins Mined vs. Total Coin Supply"
    )

2D_Graph


To finally address the question posed at the beginning regarding someone interested in a particular subset of cryptocurrencies by name, a function was created to do just this, requiring only the input of a list with those coin names.

def investor(investments, df=clustered_df):
    results = clustered_df[clustered_df['CoinName'].isin(investments)]
    results = results[["CoinName", "Algorithm", "ProofType", "Class"]]
    return results
CoinName Algorithm ProofType Class
EMC2 Einsteinium Scrypt PoW 1
ACM Actinium Lyra2Z PoW 1
LIT Lithium Blake PoW 1
RADS Radium PoS PoS 0
CoinName Algorithm ProofType Class
42 42 Coin Scrypt PoW/PoS 0
UNO Unobtainium SHA-256 PoW 1
KED Klingon Empire Darsek Scrypt PoW/PoS 0
CoinName Algorithm ProofType Class
CCN CannaCoin Scrypt PoW 1
POT PotCoin Scrypt PoW/PoS 0
STV Sativa Coin X13 PoW/PoS 0
DOPE DopeCoin Scrypt PoW 1
GNJ GanjaCoin V2 X14 PoW/PoS 0
XCI Cannabis Industry Coin CryptoNight PoW 1

To answer our scientist investor, while the elements are spread across the periodic table, all but one of the cryptocurrencies fall into group 1. Our science fiction enthusiast may be interested to know that one of the three, Unobtainium, is in class 1 while the others are in class 2. However, for our cannabis entrepreneur trying to diversify across classes, the 6 cannabis-themed coins only fall into two of the groups. This is expected, since group 2 only contained one cryptocurrency and group 3 was much smaller than the others, but this provides users with the capability to choose and compare individual currencies given they know the coin name.

About

Application of K-Means Clustering to Identify Groups of Cryptocurrencies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published