Clustering Cryptocurrencies

Application of K-Means Clustering to Identify Groups of Cryptocurrencies

by Justin R. Papreck

Overview

As cryptocurrencies continue to rise in popularity worldwide, it is difficult to determine which crypocurrencies are similar. Many cryptocurrencies use similar algorithms, yet others rely on unique algorithms that may yield the same types of results. As the unique number of crypocurrencies continues to rise, it is difficult to ascertain how each of them are related, and whether the algorithm itself is even a good predictor of how each crypto can be classified. K-Means clusterding, is applied here to perform unsupervised machine learning in the grouping of over 500 active currencies. It is important to know which cryptocurrencies are similar for potential crypto investors. For example, the periodic table enthusiast who recently discovered cryptocurrencies and wants to sample the elements but at the same time wants to diversify. Are the differences between Osmium, Actinium, Lithium, Einsteinium, and Radium as simple as their locations on the periodic table? Or perhaps it's a cannabis entrepreneur looking to invest in 3 different cryptocurrencies, but they have to choose between Cannabis Industry Coin, Canna Coin, Sativa Coin, GanjaCoin, KushCoin and PotCoin and want to diversify their investments as they diversify their stock. Or perhaps there is just a sci-fi investor willing to put all of their money into one place and wants it to go into whatever algorithm is most similar to BitCoin, but wants it to be in 42, Unobtainium, or Klingon Empire Darsek. This application aims to classify all of these currencies through unsupervised learning.

Deliverable 1: Preprocessing the Data for Principal Component Analysis (PCA)

Upon first glance of the data, there are cryptocurrencies that are not trading as well as currencies that haven't mined or acquired a supply of coins.

	CoinName	Algorithm	IsTrading	ProofType	TotalCoinsMined	TotalCoinSupply
42	42 Coin	Scrypt	True	PoW/PoS	4.199995e+01	42
365	365Coin	X11	True	PoW/PoS	NaN	2300000000
404	404Coin	Scrypt	True	PoW/PoS	1.055185e+09	532000000
611	SixEleven	SHA-256	True	PoW	NaN	611000
808	808	SHA-256	True	PoW/PoS	0.000000e+00	0
1337	EliteCoin	X13	True	PoW/PoS	2.927942e+10	314159265359
2015	2015 coin	X11	True	PoW/PoS	NaN	0
BTC	Bitcoin	SHA-256	True	PoW	1.792718e+07	21000000
ETH	Ethereum	Ethash	True	PoW	1.076842e+08	0
LTC	Litecoin	Scrypt	True	PoW	6.303924e+07	84000000

Initially, all currencies that are not trading were removed, thus dropping the "Is Trading Column". Subsequently, any row containing a Null value were removed. Finally, since purchasing a currency requires at least 1 coin, all cryptocurrencies with no coin supply were also removed. From the 1252 currencies listed, only 533 met the criteria that there were a non-zero coin supply, the coin name, algorithm, proof type, number of coins mined and total supply were all accounted for and they were actively trading. In order to process for machine learning, the coin name code was used as the index, and the names were saved in a different data frame, as the name should not influence a currency's grouping. The remaining non-numerical data in the set were the algorithm and proof type.

	CoinName
42	42 Coin
404	404Coin
1337	EliteCoin
BTC	Bitcoin
ETH	Ethereum

	Algorithm	ProofType	TotalCoinsMined	TotalCoinSupply
42	Scrypt	PoW/PoS	4.199995e+01	42
404	Scrypt	PoW/PoS	1.055185e+09	532000000
1337	X13	PoW/PoS	2.927942e+10	314159265359
BTC	SHA-256	PoW	1.792718e+07	21000000
ETH	Ethash	PoW	1.076842e+08	0
LTC	Scrypt	PoW	6.303924e+07	84000000
DASH	X11	PoW/PoS	9.031294e+06	22000000
XMR	CryptoNight-V7	PoW	1.720114e+07	0
ETC	Ethash	PoW	1.133597e+08	210000000
ZEC	Equihash	PoW	7.383056e+06	21000000

To address these columns, dummy variables were created for each algorithm and proof type, identifying each as a binary value by applying the pandas function 'get_dummies'. Subsequently, the values were scaled to a value between 0 and 1, with a standard deviation of 1 using StandardScaler from the scikit-learn library.

	TotalCoinsMined	TotalCoinSupply	...	ProofType_PoW/PoS
42	4.199995e+01	42	...	1
404	1.055185e+09	532000000	...	1
1337	2.927942e+10	314159265359	...	1
BTC	1.792718e+07	21000000	...	0
ETH	1.076842e+08	0	...	0

5 rows × 100 columns

# Standardize the data with StandardScaler().
X = StandardScaler().fit_transform(X)

Deliverable 2: Reducing Data Dimensions Using PCA

Due to the use of dummies, the data expaned to 100 columns, a number of dimensions that would be visually impossible to present. Using PCA() from scikit-learn, the number of dimensions were reduced to 3 for the ability to present in a 3-dimensional graph. The components were named PC 1, PC 2, and PC 3, and the dataframe maintains the indices established in the previous dataframe.

	PC 1	PC 2	PC 3
42	-0.299625	1.093503	-0.495499
404	-0.282859	1.093897	-0.495606
1337	2.317407	1.718744	-0.532707
BTC	-0.152806	-1.298636	0.123549
ETH	-0.162274	-2.025853	0.333898
LTC	-0.133639	-1.091527	-0.035369
DASH	-0.411299	1.238712	-0.421072
XMR	-0.156051	-2.275822	0.285527
ETC	-0.160707	-2.025936	0.333897
ZEC	-0.150078	-2.175185	0.269426

Deliverable 3: Clustering Crytocurrencies Using K-Means

Finding the Best Value for 'k' Using the Elbow Curve

To determine the optimal number of clusters for the K-Means Clustering application, an elbow curve was created comparing the inertia with the number of clusters. There is a clear bend at 4 clusters with diminishing returns after, so the analyses were performed with 4 clusters.

# Initialize the K-Means model.
model = KMeans(n_clusters=4, random_state=0)

# Fit the model
model.fit(pcs_df)

# Predict clusters
predictions = model.predict(pcs_df)

The outcome of the KMeans analysis with 4 clusters provided an array with clusters labeled 0, 1, 2, and 3.

[0 0 0 3 3 3 0 3 3 3 0 3 0 0 3 0 3 3 0 0 3 3 3 3 3 0 3 3 3 0 3 0 3 3 0 0 3 3 3 3 3 3 0 0 3 3 3 3 3 0 0 3 0 3 3 3 3 0 3 3 0 3 0 0 0 3 3 3 0 0 0 0 0 3 3 3 0 0 3 0 3 0 0 3 3 3 3 0 0 3 0 3 3 0 0 3 0 0 3 3 0 0 3 0 0 3 0 3 0 3 0 3 0 0 3 3 0 3 3 3 0 3 3 3 3 3 0 0 3 3 3 0 3 0 3 3 0 3 0 3 0 0 3 3 0 3 3 0 0 3 0 3 0 0 0 3 3 3 3 0 0 0 0 0 3 3 0 0 0 0 0 3 0 0 0 0 0 3 0 3 0 0 3 0 3 0 0 3 0 3 0 3 0 3 0 0 0 0 3 0 0 0 0 0 3 3 0 0 3 3 0 0 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 3 3 3 0 0 0 0 3 0 3 0 0 3 0 3 3 0 3 3 0 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 3 0 3 0 0 0 0 3 0 3 0 3 3 3 3 0 3 0 0 3 0 3 3 3 0 3 0 3 3 3 0 3 0 3 0 0 1 3 0 3 3 3 3 3 0 0 3 0 0 0 3 0 3 0 3 0 3 0 0 0 0 3 0 0 3 0 0 0 3 3 3 3 0 0 0 0 3 0 3 3 3 0 0 3 3 0 0 3 0 3 3 3 0 3 3 0 0 0 3 3 3 0 0 0 3 3 0 3 3 3 3 0 1 1 3 3 3 0 1 0 0 0 0 3 3 3 3 0 0 0 3 0 3 0 0 0 0 3 0 0 3 0 0 3 3 0 3 0 3 3 3 3 0 0 3 0 3 0 0 0 0 0 0 3 3 3 0 0 0 0 0 0 3 0 3 3 3 3 0 0 0 0 3 0 0 3 0 0 3 1 3 0 3 3 0 0 3 0 3 3 3 3 3 0 3 0 3 0 0 3 0 0 0 0 0 3 3 3 0 0 0 3 0 3 0 3 0 0 0 0 3 0 0 0 3 0 3 0 3 0 0 0 3 3 0 0 0 0 0 0 1 3 0 3 0 3 0 0 1 0 2 0 0 0 3 3 0]

These data were concatenated into a dataframe along with the PCA data and the original cryptocurrency dataframe with the Algorithm, Proof Type, and Coin Names

	Algorithm	ProofType	TotalCoinsMined	TotalCoinSupply	PC 1	PC 2	PC 3	CoinName	Class
42	Scrypt	PoW/PoS	4.199995e+01	42	-0.299625	1.093503	-0.495499	42 Coin	0
404	Scrypt	PoW/PoS	1.055185e+09	532000000	-0.282859	1.093897	-0.495606	404Coin	0
1337	X13	PoW/PoS	2.927942e+10	314159265359	2.317407	1.718744	-0.532707	EliteCoin	0
BTC	SHA-256	PoW	1.792718e+07	21000000	-0.152806	-1.298636	0.123549	Bitcoin	3
ETH	Ethash	PoW	1.076842e+08	0	-0.162274	-2.025853	0.333898	Ethereum	3
LTC	Scrypt	PoW	6.303924e+07	84000000	-0.133639	-1.091527	-0.035369	Litecoin	3
DASH	X11	PoW/PoS	9.031294e+06	22000000	-0.411299	1.238712	-0.421072	Dash	0
XMR	CryptoNight-V7	PoW	1.720114e+07	0	-0.156051	-2.275822	0.285527	Monero	3
ETC	Ethash	PoW	1.133597e+08	210000000	-0.160707	-2.025936	0.333897	Ethereum Classic	3
ZEC	Equihash	PoW	7.383056e+06	21000000	-0.150078	-2.175185	0.269426	ZCash	3

Deliverable 4: Visualizing Cryptocurrencies Results

3D-Scatter with Clusters

With the reduced dimensions from the PCA application, the results can be visualized in a 3-D plot. The plot was created using Plotly Express to show the unique points with the name, algorithm used, and principal component data. The plot is interactive allowing the user to hover over the individual points, zoom, and rotate.

Something very clear with this graph are the points from Classes 1 and 2. (Note: the class number changes when the application is run, but these classes refer to the above figure).

BitTorrent stands in a class of its own (Class 2). It had very low influence from PC 3, a higher influence from PC2 than groups 0 and 1, but an extremely high influence from PC1, something none of the other groups shared
Groups 0 and 3 are tightly associated, having a low impact from PC 1 and PC3
Group 0 is more positive than Group 3, but it does contain some negative values in PC 2
Group 1 is a loosely associated group sharing a non-zero influence by PC 3, but otherwise are greater than 0 in PC 2
Since Group 2 is such an outlier from the rest of the classes, it would be beneficial to reanalyze these data without the BitTorrent data to see if the other groups remain in their respective classes

Since it is difficult to filter or find individual currencies, such as the ones mentioned in the Overview, hvplot was used to create an interactive table that can be sorted by each of the original parameters.

From here the user can inspect the cryptocurrencies organized by any of the individual parameters.

# Print the total number of tradable currencies in the Clustered_df Dataframe
print(f"There are {len(clustered_df)} tradable currencies in the dataframe")

There are 533 tradable currencies in the dataframe

The final analysis considers the number of total coins mined and the total coin supply of the different classes. A low number of coins mined may suggest that the algorithm to mine them is complex, driving the supply down accordingly, and conversely, a high total number mined may have a high supply if they are readily mined and coin becomes quickly available. These may impact the economics of each of the currencies which can give the investment firm an indicator of which currencies are likely to see an increase or decrease of value over time. To analyze these with differences in numbers up to 12 orders of magnitude, the MinMaxScaler function was used to bring these values between 0 and 1.

# Select the columns to scale
totals = clustered_df[['TotalCoinSupply', 'TotalCoinsMined']]
scaled_totals = MinMaxScaler(feature_range=(0,1)).fit_transform(totals)
clustered_scaled_df = clustered_scaled_df[["TotalCoinSupply", "TotalCoinsMined", "CoinName", "Class"]]

	TotalCoinSupply	TotalCoinsMined	CoinName	Class
42	4.200000e-11	0.005942	42 Coin	0
404	5.320000e-04	0.007002	404Coin	0
1337	3.141593e-01	0.035342	EliteCoin	0
BTC	2.100000e-05	0.005960	Bitcoin	3
ETH	0.000000e+00	0.006050	Ethereum	3
LTC	8.400000e-05	0.006006	Litecoin	3
DASH	2.200000e-05	0.005951	Dash	0
XMR	0.000000e+00	0.005960	Monero	3
ETC	2.100000e-04	0.006056	Ethereum Classic	3
ZEC	2.100000e-05	0.005950	ZCash	3

# Alternatively, the above steps can be done...
# Create a new dataframe that has the scaled data
plot_df = pd.DataFrame(data=scaled_totals, columns=["TotalCoinSupply", "TotalCoinsMined"])

# Add the indexing from the original df
plot_df.index = clustered_df.index  

# Add the coin name and add the class from the original 
plot_df[["CoinName", "Class"]] = clustered_df[["CoinName", "Class"]]

These data were plotted in a 2D scatter plot. As can be seen, once again Class 2, the BitTorrent currency really stands out having the highest supply and the highest number of coins mined. Class 1 also has a single currency, TurtleCoin, that matches the maximum total coin supply, but the coins mined is only a fraction of that of BitTorrent.

# Create a hvplot.scatter plot using x="TotalCoinsMined" and y="TotalCoinSupply".
plot_df.hvplot.scatter(x="TotalCoinsMined"
    , y="TotalCoinSupply"
    , by="Class"
    , hover_cols="Class"
    , width=800
    , height=500
    , s=150
    , alpha = 0.8
    , selection_alpha=0.1
    , line_color='black'
    , line_alpha = 0.5
    , title="K-Means Scatter of Cryptocurrencies, Total Coins Mined vs. Total Coin Supply"
    )

To finally address the question posed at the beginning regarding someone interested in a particular subset of cryptocurrencies by name, a function was created to do just this, requiring only the input of a list with those coin names.

def investor(investments, df=clustered_df):
    results = clustered_df[clustered_df['CoinName'].isin(investments)]
    results = results[["CoinName", "Algorithm", "ProofType", "Class"]]
    return results

	CoinName	Algorithm	ProofType	Class
EMC2	Einsteinium	Scrypt	PoW	1
ACM	Actinium	Lyra2Z	PoW	1
LIT	Lithium	Blake	PoW	1
RADS	Radium	PoS	PoS	0

	CoinName	Algorithm	ProofType	Class
42	42 Coin	Scrypt	PoW/PoS	0
UNO	Unobtainium	SHA-256	PoW	1
KED	Klingon Empire Darsek	Scrypt	PoW/PoS	0

	CoinName	Algorithm	ProofType	Class
CCN	CannaCoin	Scrypt	PoW	1
POT	PotCoin	Scrypt	PoW/PoS	0
STV	Sativa Coin	X13	PoW/PoS	0
DOPE	DopeCoin	Scrypt	PoW	1
GNJ	GanjaCoin V2	X14	PoW/PoS	0
XCI	Cannabis Industry Coin	CryptoNight	PoW	1

To answer our scientist investor, while the elements are spread across the periodic table, all but one of the cryptocurrencies fall into group 1. Our science fiction enthusiast may be interested to know that one of the three, Unobtainium, is in class 1 while the others are in class 2. However, for our cannabis entrepreneur trying to diversify across classes, the 6 cannabis-themed coins only fall into two of the groups. This is expected, since group 2 only contained one cryptocurrency and group 3 was much smaller than the others, but this provides users with the capability to choose and compare individual currencies given they know the coin name.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Images		Images
README.md		README.md
crypto_clustering.ipynb		crypto_clustering.ipynb
crypto_clustering.md		crypto_clustering.md
crypto_data.csv		crypto_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering Cryptocurrencies

by Justin R. Papreck

Overview

Deliverable 1: Preprocessing the Data for Principal Component Analysis (PCA)

Deliverable 2: Reducing Data Dimensions Using PCA

Deliverable 3: Clustering Crytocurrencies Using K-Means

Deliverable 4: Visualizing Cryptocurrencies Results

3D-Scatter with Clusters

About

Releases

Packages

Languages

FreshOats/Cryptocurrencies

Folders and files

Latest commit

History

Repository files navigation

Clustering Cryptocurrencies

by Justin R. Papreck

Overview

Deliverable 1: Preprocessing the Data for Principal Component Analysis (PCA)

Deliverable 2: Reducing Data Dimensions Using PCA

Deliverable 3: Clustering Crytocurrencies Using K-Means

Deliverable 4: Visualizing Cryptocurrencies Results

3D-Scatter with Clusters

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages