Since we now have our cleaned reddit comments, we need to figure out a way to derive sentiment from those comments and then figure out a way to then quantify that sentiment. Before we do so, the method that we choose is going to have to solve three problems:
- Our data is not prelabelled and we don't have a metric to try and predict (like stars on a review)
- We don't have predefined sentiment dictionaries/lexicon of positive/negative words
- We need domain specific sentiment words around bitcoin as well as around finance. Financial sentiment is very different than normal sentiment.
To best meet all of these needs, I chose VADER (Valence Aware Dictionary and sEntiment Reasoner) which is a "lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media" that is available in Python.
You can download VADER from its github repository here, or by conda/pip.
While VADER does have a comprehensive lexicon, we may want to add sentiment words that are specific to our domain. In order to do so, go to the file location where python is downloaded to and go Python_File_Location>Lib>site-packages>vaderSentiment
. This is the location of the source files of this package. You will notice that there is a text file called "vader_lexicon.txt". This is the location of the VADER lexicon with all of the sentiment words as well as their respective weight. COPY and paste this file to your R working_directory. You can also download the vader lexicon text file here.
While you could make sentiment dictionaries yourself, there are many resources online that have already created them for you. In our context, we are interested in financial sentiment. Here you can find the Loughran-McDonald financial sentiment dictionary. Using this resource, I created a formatted excel file that compiled this financial sentiment lexicon that I named the Loughran-McDonald Master Dictionary.xlsx which you can download.
Using this resource, I created a csv called additional_sentiment.csv with a column for positive and negative words in the Loughran-McDonald dictionary that I edited using my discretion. To this csv, I also added domain specific sentiment words that represent the bitcoin community on reddit.
If you create your own additional sentiment dictionary, be weary of duplicate words in your dictionary. A useful tool to help with this is conditional formatting in excel which is why I recommend you first create/edit/modify it in excel and then export it to a format like CSV.
In this step, you are going to add additional_sentiment.csv to the vader lexicon text file here. To do so:
# import VADER lexicon txt file
library(readr)
vader <- read_delim("vader_lexicon.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
# format lexicon
vader$word <- vader$X1
vader$mean <- vader$X2
vader$sd <- vader$X3
vader$raw <- vader$X4
vader <- vader[ ,c("word", "mean", "sd", "raw")]
# import additional sentiment data
library(readr)
additional_sentiment <- read_csv("additional_sentiment.csv")
negative <- additional_sentiment$negative
positive <- additional_sentiment$positive
# remove NA from positive data (since positive is shorter than negative)
positive <- positive[!is.na(positive)]
# to lowercase
negative <- tolower(negative)
positive <- tolower(positive)
# words in common with VADER
neg_similar <- vader[vader$word %in% negative, ]
pos_similar <- vader[vader$word %in% positive, ]
# words not in VADER to add
neg_add <- negative[!negative %in% vader$word]
pos_add <- positive[!positive %in% vader$word]
# define weight of added negative/positive words (arbitarily chosen)
neg_weight <- -1
pos_weight <- 1
# create neg df
neg_df <- data.frame(matrix(NA, nrow = length(neg_add), ncol = 4))
colnames(neg_df) <- c("word", "mean", "sd", "raw")
neg_df$word <- neg_add
neg_df$mean <- rep(neg_weight, length(neg_df$mean))
neg_df$sd <- rep(0.5, length(neg_df$sd))
neg_df$raw <- rep("[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]", length(neg_df$raw))
# create pos df
pos_df <- data.frame(matrix(NA, nrow = length(pos_add), ncol = 4))
colnames(pos_df) <- c("word", "mean", "sd", "raw")
pos_df$word <- pos_add
pos_df$mean <- rep(pos_weight, length(pos_df$mean))
pos_df$sd <- rep(0.5, length(pos_df$sd))
pos_df$raw <- rep("[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]", length(pos_df$raw))
# append pos/neg df with VADER
additional_sentiment <- rbind(neg_df, pos_df)
darth_vader <- rbind(vader, additional_sentiment)
# export data
write.table(darth_vader, file = "vader_lexicon.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t", fileEncoding = "UTF-8")
The new VADER lexicon that I created is available here for download.
Go to the same file location that we went to here Python_File_Location>Lib>site-packages>vaderSentiment
and replace our newly created lexicon with the old one. They will have the same name so make sure not to get them mixed up.