Application of K-Means Clustering to Identify Groups of Cryptocurrencies
As cryptocurrencies continue to rise in popularity worldwide, it is difficult to determine which crypocurrencies are similar. Many cryptocurrencies use similar algorithms, yet others rely on unique algorithms that may yield the same types of results. As the unique number of crypocurrencies continues to rise, it is difficult to ascertain how each of them are related, and whether the algorithm itself is even a good predictor of how each crypto can be classified. K-Means clusterding, is applied here to perform unsupervised machine learning in the grouping of over 500 active currencies. It is important to know which cryptocurrencies are similar for potential crypto investors. For example, the periodic table enthusiast who recently discovered cryptocurrencies and wants to sample the elements but at the same time wants to diversify. Are the differences between Osmium, Actinium, Lithium, Einsteinium, and Radium as simple as their locations on the periodic table? Or perhaps it's a cannabis entrepreneur looking to invest in 3 different cryptocurrencies, but they have to choose between Cannabis Industry Coin, Canna Coin, Sativa Coin, GanjaCoin, KushCoin and PotCoin and want to diversify their investments as they diversify their stock. Or perhaps there is just a sci-fi investor willing to put all of their money into one place and wants it to go into whatever algorithm is most similar to BitCoin, but wants it to be in 42, Unobtainium, or Klingon Empire Darsek. This application aims to classify all of these currencies through unsupervised learning.
Upon first glance of the data, there are cryptocurrencies that are not trading as well as currencies that haven't mined or acquired a supply of coins.
CoinName | Algorithm | IsTrading | ProofType | TotalCoinsMined | TotalCoinSupply | |
---|---|---|---|---|---|---|
42 | 42 Coin | Scrypt | True | PoW/PoS | 4.199995e+01 | 42 |
365 | 365Coin | X11 | True | PoW/PoS | NaN | 2300000000 |
404 | 404Coin | Scrypt | True | PoW/PoS | 1.055185e+09 | 532000000 |
611 | SixEleven | SHA-256 | True | PoW | NaN | 611000 |
808 | 808 | SHA-256 | True | PoW/PoS | 0.000000e+00 | 0 |
1337 | EliteCoin | X13 | True | PoW/PoS | 2.927942e+10 | 314159265359 |
2015 | 2015 coin | X11 | True | PoW/PoS | NaN | 0 |
BTC | Bitcoin | SHA-256 | True | PoW | 1.792718e+07 | 21000000 |
ETH | Ethereum | Ethash | True | PoW | 1.076842e+08 | 0 |
LTC | Litecoin | Scrypt | True | PoW | 6.303924e+07 | 84000000 |
Initially, all currencies that are not trading were removed, thus dropping the "Is Trading Column". Subsequently, any row containing a Null value were removed. Finally, since purchasing a currency requires at least 1 coin, all cryptocurrencies with no coin supply were also removed. From the 1252 currencies listed, only 533 met the criteria that there were a non-zero coin supply, the coin name, algorithm, proof type, number of coins mined and total supply were all accounted for and they were actively trading. In order to process for machine learning, the coin name code was used as the index, and the names were saved in a different data frame, as the name should not influence a currency's grouping. The remaining non-numerical data in the set were the algorithm and proof type.
CoinName | |
---|---|
42 | 42 Coin |
404 | 404Coin |
1337 | EliteCoin |
BTC | Bitcoin |
ETH | Ethereum |
Algorithm | ProofType | TotalCoinsMined | TotalCoinSupply | |
---|---|---|---|---|
42 | Scrypt | PoW/PoS | 4.199995e+01 | 42 |
404 | Scrypt | PoW/PoS | 1.055185e+09 | 532000000 |
1337 | X13 | PoW/PoS | 2.927942e+10 | 314159265359 |
BTC | SHA-256 | PoW | 1.792718e+07 | 21000000 |
ETH | Ethash | PoW | 1.076842e+08 | 0 |
LTC | Scrypt | PoW | 6.303924e+07 | 84000000 |
DASH | X11 | PoW/PoS | 9.031294e+06 | 22000000 |
XMR | CryptoNight-V7 | PoW | 1.720114e+07 | 0 |
ETC | Ethash | PoW | 1.133597e+08 | 210000000 |
ZEC | Equihash | PoW | 7.383056e+06 | 21000000 |
To address these columns, dummy variables were created for each algorithm and proof type, identifying each as a binary value by applying the pandas function 'get_dummies'. Subsequently, the values were scaled to a value between 0 and 1, with a standard deviation of 1 using StandardScaler from the scikit-learn library.
TotalCoinsMined | TotalCoinSupply | Algorithm_1GB AES Pattern Search | Algorithm_536 | Algorithm_Argon2d | Algorithm_BLAKE256 | Algorithm_Blake | Algorithm_Blake2S | Algorithm_Blake2b | Algorithm_C11 | ... | ProofType_PoW/PoS | ProofType_PoW/PoS | ProofType_PoW/PoW | ProofType_PoW/nPoS | ProofType_Pos | ProofType_Proof of Authority | ProofType_Proof of Trust | ProofType_TPoS | ProofType_Zero-Knowledge Proof | ProofType_dPoW/PoW | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
42 | 4.199995e+01 | 42 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
404 | 1.055185e+09 | 532000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1337 | 2.927942e+10 | 314159265359 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
BTC | 1.792718e+07 | 21000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ETH | 1.076842e+08 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 100 columns
# Standardize the data with StandardScaler().
X = StandardScaler().fit_transform(X)
Due to the use of dummies, the data expaned to 100 columns, a number of dimensions that would be visually impossible to present. Using PCA() from scikit-learn, the number of dimensions were reduced to 3 for the ability to present in a 3-dimensional graph. The components were named PC 1, PC 2, and PC 3, and the dataframe maintains the indices established in the previous dataframe.
PC 1 | PC 2 | PC 3 | |
---|---|---|---|
42 | -0.299625 | 1.093503 | -0.495499 |
404 | -0.282859 | 1.093897 | -0.495606 |
1337 | 2.317407 | 1.718744 | -0.532707 |
BTC | -0.152806 | -1.298636 | 0.123549 |
ETH | -0.162274 | -2.025853 | 0.333898 |
LTC | -0.133639 | -1.091527 | -0.035369 |
DASH | -0.411299 | 1.238712 | -0.421072 |
XMR | -0.156051 | -2.275822 | 0.285527 |
ETC | -0.160707 | -2.025936 | 0.333897 |
ZEC | -0.150078 | -2.175185 | 0.269426 |
Finding the Best Value for 'k' Using the Elbow Curve
To determine the optimal number of clusters for the K-Means Clustering application, an elbow curve was created comparing the inertia with the number of clusters. There is a clear bend at 4 clusters with diminishing returns after, so the analyses were performed with 4 clusters.
# Initialize the K-Means model.
model = KMeans(n_clusters=4, random_state=0)
# Fit the model
model.fit(pcs_df)
# Predict clusters
predictions = model.predict(pcs_df)
The outcome of the KMeans analysis with 4 clusters provided an array with clusters labeled 0, 1, 2, and 3.
[0 0 0 3 3 3 0 3 3 3 0 3 0 0 3 0 3 3 0 0 3 3 3 3 3 0 3 3 3 0 3 0 3 3 0 0 3 3 3 3 3 3 0 0 3 3 3 3 3 0 0 3 0 3 3 3 3 0 3 3 0 3 0 0 0 3 3 3 0 0 0 0 0 3 3 3 0 0 3 0 3 0 0 3 3 3 3 0 0 3 0 3 3 0 0 3 0 0 3 3 0 0 3 0 0 3 0 3 0 3 0 3 0 0 3 3 0 3 3 3 0 3 3 3 3 3 0 0 3 3 3 0 3 0 3 3 0 3 0 3 0 0 3 3 0 3 3 0 0 3 0 3 0 0 0 3 3 3 3 0 0 0 0 0 3 3 0 0 0 0 0 3 0 0 0 0 0 3 0 3 0 0 3 0 3 0 0 3 0 3 0 3 0 3 0 0 0 0 3 0 0 0 0 0 3 3 0 0 3 3 0 0 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 3 3 3 0 0 0 0 3 0 3 0 0 3 0 3 3 0 3 3 0 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 3 0 3 0 0 0 0 3 0 3 0 3 3 3 3 0 3 0 0 3 0 3 3 3 0 3 0 3 3 3 0 3 0 3 0 0 1 3 0 3 3 3 3 3 0 0 3 0 0 0 3 0 3 0 3 0 3 0 0 0 0 3 0 0 3 0 0 0 3 3 3 3 0 0 0 0 3 0 3 3 3 0 0 3 3 0 0 3 0 3 3 3 0 3 3 0 0 0 3 3 3 0 0 0 3 3 0 3 3 3 3 0 1 1 3 3 3 0 1 0 0 0 0 3 3 3 3 0 0 0 3 0 3 0 0 0 0 3 0 0 3 0 0 3 3 0 3 0 3 3 3 3 0 0 3 0 3 0 0 0 0 0 0 3 3 3 0 0 0 0 0 0 3 0 3 3 3 3 0 0 0 0 3 0 0 3 0 0 3 1 3 0 3 3 0 0 3 0 3 3 3 3 3 0 3 0 3 0 0 3 0 0 0 0 0 3 3 3 0 0 0 3 0 3 0 3 0 0 0 0 3 0 0 0 3 0 3 0 3 0 0 0 3 3 0 0 0 0 0 0 1 3 0 3 0 3 0 0 1 0 2 0 0 0 3 3 0]
These data were concatenated into a dataframe along with the PCA data and the original cryptocurrency dataframe with the Algorithm, Proof Type, and Coin Names
Algorithm | ProofType | TotalCoinsMined | TotalCoinSupply | PC 1 | PC 2 | PC 3 | CoinName | Class | |
---|---|---|---|---|---|---|---|---|---|
42 | Scrypt | PoW/PoS | 4.199995e+01 | 42 | -0.299625 | 1.093503 | -0.495499 | 42 Coin | 0 |
404 | Scrypt | PoW/PoS | 1.055185e+09 | 532000000 | -0.282859 | 1.093897 | -0.495606 | 404Coin | 0 |
1337 | X13 | PoW/PoS | 2.927942e+10 | 314159265359 | 2.317407 | 1.718744 | -0.532707 | EliteCoin | 0 |
BTC | SHA-256 | PoW | 1.792718e+07 | 21000000 | -0.152806 | -1.298636 | 0.123549 | Bitcoin | 3 |
ETH | Ethash | PoW | 1.076842e+08 | 0 | -0.162274 | -2.025853 | 0.333898 | Ethereum | 3 |
LTC | Scrypt | PoW | 6.303924e+07 | 84000000 | -0.133639 | -1.091527 | -0.035369 | Litecoin | 3 |
DASH | X11 | PoW/PoS | 9.031294e+06 | 22000000 | -0.411299 | 1.238712 | -0.421072 | Dash | 0 |
XMR | CryptoNight-V7 | PoW | 1.720114e+07 | 0 | -0.156051 | -2.275822 | 0.285527 | Monero | 3 |
ETC | Ethash | PoW | 1.133597e+08 | 210000000 | -0.160707 | -2.025936 | 0.333897 | Ethereum Classic | 3 |
ZEC | Equihash | PoW | 7.383056e+06 | 21000000 | -0.150078 | -2.175185 | 0.269426 | ZCash | 3 |
With the reduced dimensions from the PCA application, the results can be visualized in a 3-D plot. The plot was created using Plotly Express to show the unique points with the name, algorithm used, and principal component data. The plot is interactive allowing the user to hover over the individual points, zoom, and rotate.
Something very clear with this graph are the points from Classes 1 and 2. (Note: the class number changes when the application is run, but these classes refer to the above figure).
- BitTorrent stands in a class of its own (Class 2). It had very low influence from PC 3, a higher influence from PC2 than groups 0 and 1, but an extremely high influence from PC1, something none of the other groups shared
- Groups 0 and 3 are tightly associated, having a low impact from PC 1 and PC3
- Group 0 is more positive than Group 3, but it does contain some negative values in PC 2
- Group 1 is a loosely associated group sharing a non-zero influence by PC 3, but otherwise are greater than 0 in PC 2
- Since Group 2 is such an outlier from the rest of the classes, it would be beneficial to reanalyze these data without the BitTorrent data to see if the other groups remain in their respective classes
Since it is difficult to filter or find individual currencies, such as the ones mentioned in the Overview, hvplot was used to create an interactive table that can be sorted by each of the original parameters.
From here the user can inspect the cryptocurrencies organized by any of the individual parameters.
# Print the total number of tradable currencies in the Clustered_df Dataframe
print(f"There are {len(clustered_df)} tradable currencies in the dataframe")
There are 533 tradable currencies in the dataframe
The final analysis considers the number of total coins mined and the total coin supply of the different classes. A low number of coins mined may suggest that the algorithm to mine them is complex, driving the supply down accordingly, and conversely, a high total number mined may have a high supply if they are readily mined and coin becomes quickly available. These may impact the economics of each of the currencies which can give the investment firm an indicator of which currencies are likely to see an increase or decrease of value over time. To analyze these with differences in numbers up to 12 orders of magnitude, the MinMaxScaler function was used to bring these values between 0 and 1.
# Select the columns to scale
totals = clustered_df[['TotalCoinSupply', 'TotalCoinsMined']]
scaled_totals = MinMaxScaler(feature_range=(0,1)).fit_transform(totals)
clustered_scaled_df = clustered_scaled_df[["TotalCoinSupply", "TotalCoinsMined", "CoinName", "Class"]]
TotalCoinSupply | TotalCoinsMined | CoinName | Class | |
---|---|---|---|---|
42 | 4.200000e-11 | 0.005942 | 42 Coin | 0 |
404 | 5.320000e-04 | 0.007002 | 404Coin | 0 |
1337 | 3.141593e-01 | 0.035342 | EliteCoin | 0 |
BTC | 2.100000e-05 | 0.005960 | Bitcoin | 3 |
ETH | 0.000000e+00 | 0.006050 | Ethereum | 3 |
LTC | 8.400000e-05 | 0.006006 | Litecoin | 3 |
DASH | 2.200000e-05 | 0.005951 | Dash | 0 |
XMR | 0.000000e+00 | 0.005960 | Monero | 3 |
ETC | 2.100000e-04 | 0.006056 | Ethereum Classic | 3 |
ZEC | 2.100000e-05 | 0.005950 | ZCash | 3 |
# Alternatively, the above steps can be done...
# Create a new dataframe that has the scaled data
plot_df = pd.DataFrame(data=scaled_totals, columns=["TotalCoinSupply", "TotalCoinsMined"])
# Add the indexing from the original df
plot_df.index = clustered_df.index
# Add the coin name and add the class from the original
plot_df[["CoinName", "Class"]] = clustered_df[["CoinName", "Class"]]
These data were plotted in a 2D scatter plot. As can be seen, once again Class 2, the BitTorrent currency really stands out having the highest supply and the highest number of coins mined. Class 1 also has a single currency, TurtleCoin, that matches the maximum total coin supply, but the coins mined is only a fraction of that of BitTorrent.
# Create a hvplot.scatter plot using x="TotalCoinsMined" and y="TotalCoinSupply".
plot_df.hvplot.scatter(x="TotalCoinsMined"
, y="TotalCoinSupply"
, by="Class"
, hover_cols="Class"
, width=800
, height=500
, s=150
, alpha = 0.8
, selection_alpha=0.1
, line_color='black'
, line_alpha = 0.5
, title="K-Means Scatter of Cryptocurrencies, Total Coins Mined vs. Total Coin Supply"
)
To finally address the question posed at the beginning regarding someone interested in a particular subset of cryptocurrencies by name, a function was created to do just this, requiring only the input of a list with those coin names.
def investor(investments, df=clustered_df):
results = clustered_df[clustered_df['CoinName'].isin(investments)]
results = results[["CoinName", "Algorithm", "ProofType", "Class"]]
return results
CoinName | Algorithm | ProofType | Class | |
---|---|---|---|---|
EMC2 | Einsteinium | Scrypt | PoW | 1 |
ACM | Actinium | Lyra2Z | PoW | 1 |
LIT | Lithium | Blake | PoW | 1 |
RADS | Radium | PoS | PoS | 0 |
CoinName | Algorithm | ProofType | Class | |
---|---|---|---|---|
42 | 42 Coin | Scrypt | PoW/PoS | 0 |
UNO | Unobtainium | SHA-256 | PoW | 1 |
KED | Klingon Empire Darsek | Scrypt | PoW/PoS | 0 |
CoinName | Algorithm | ProofType | Class | |
---|---|---|---|---|
CCN | CannaCoin | Scrypt | PoW | 1 |
POT | PotCoin | Scrypt | PoW/PoS | 0 |
STV | Sativa Coin | X13 | PoW/PoS | 0 |
DOPE | DopeCoin | Scrypt | PoW | 1 |
GNJ | GanjaCoin V2 | X14 | PoW/PoS | 0 |
XCI | Cannabis Industry Coin | CryptoNight | PoW | 1 |
To answer our scientist investor, while the elements are spread across the periodic table, all but one of the cryptocurrencies fall into group 1. Our science fiction enthusiast may be interested to know that one of the three, Unobtainium, is in class 1 while the others are in class 2. However, for our cannabis entrepreneur trying to diversify across classes, the 6 cannabis-themed coins only fall into two of the groups. This is expected, since group 2 only contained one cryptocurrency and group 3 was much smaller than the others, but this provides users with the capability to choose and compare individual currencies given they know the coin name.