Skip to content
Jack Kelly edited this page May 2, 2023 · 2 revisions

Predicting the Stock Market with Reddit Posts

Predicting “meme” stock prices from posts from r/WallStreetBets

Introduction/Motivation

One common misconception about the stock market is that it is for the big boys on Wall Street who sit in their luxurious offices in three-piece suits in nice shiny office buildings in the financial district of New York City. So when retail traders in 2021 successfully short-squeezed GameStop (ticker symbol: GME), a failing brick-and-mortar retail video game store, the people cheered. Through the power of the internet, retail traders from the Reddit forum r/WallStreetbets drove up the price of GME from a closing price of $19.94 on Jan 11th to $347.51 on Jan 19th.

“GME-Stock”

While investment firms and hedge funds lost a total of $10 billion shorting GME, many retail traders cashed in hundreds and thousands of dollars with the leader of the short squeeze, Keith Gill (Reddit username “DeepFuckingValue”) increasing the value of his position from $53,000 to $48 million dollars at the height of the frenzy. GME was not the only stock targeted in short squeeze attempts. Movie theater chain AMC (AMC), Bed Bath and Beyond (BBBY), and even Silver were all targeted by short squeezers on r/WallStreetBets.

Coordinating a targeted short squeeze like this using social media was the first in recent history. We wanted to see if retail traders on r/WallStreetBets actually influenced the price of the GameStop or if there were outside influences. Can everyday people like you and me take control of the mystical stock market? Did this Reddit activity actually influence the price of the stock? To answer these questions our team took the following approach:

  1. Download GME stock price data

  2. Scrape GME related posts from r/WallStreetBets

  3. Run a sentiment analysis on those posts

  4. Compare daily sentiment with stock price movement

  5. Find the most important features that influence stock price

  6. If there was a correlation, use ML to predict future prices based on Reddit posts

Data “Scraping”

Grabbing the stock data was the easiest part. We just downloaded a CSV file from Yahoo with the specified dates. The CSV file gave us opening and closing prices, average daily price, and trade volume for each day. There was no need to access the Yahoo API because the CSV file was well formatted. Why would we do more work if we don’t need to?

The next step was to grab posts from r/WallStreetBets to analyze them. We found an API on GitHub called PushShift API that would let us scrape Reddit posts with more parameters like time, size, and attribute. However, we later found a data set on Kaggle that had already been processed so we simply downloaded it as a CSV. This data set had posts from r/WallStreetBets all the way from 2012. In the end, we did not have to scrape data which saved us some time.

Preprocessing

Before we get to all the juicy stuff we had to preprocess all of the data and run our sentiment analysis. To run a sentiment analysis for each day, we first aggregated all of the content for each day into one string. That way we could get the sentiment for each to compare to price movements in the stock.

To run a sentiment analysis we used the VADER sentiment analyzer. This was a natural choice for this project because VADER could already handle emojis and slang terms. However, we had to update VADER’s “score” for some emojis. VADER automatically assigns a value between -4 and 4 for words and emojis which shows how positive or negative they are. For example, the word “hate” would be close to -4 while the word “love” would be close to 4. For emojis, VADER converts them to words first and then analyzes them. If you try to run the fire emoji through VADER, it would convert the emoji to the word fire and then assign a score that is more negative than positive. This is a slight problem for our purposes because r/WallStreetBets uses many emojis and slang terms that have different sentiment scores than VADER’s default values.

“”

To update the values of common emojis used in this Reddit community we looked up VADER’s emoji translations. For example, the diamond emoji was “gem” and the rocket emoji was “rocket” Then we could manually update commonly found emojis in the community before running all of our posts through the sentiment analyzer.

    def __init__(self, name):
     “””The constructor of the SqueezeNet class”””
     self.custom_emoji_scores = {
     ‘rocket’: 4.0,
     ‘gem stone’: 4.0,
     ‘raising hands’: 3.0,
     ‘bull’: 3.5,
     ‘bear’: -3.5,
     ‘toilet paper’: -4,
     ‘fire’: 3.0,
     ‘green’: 2,
     ‘red’: -2
     }

Another problem we ran into was time considerations. The stock market is closed on weekends and federal holidays so we had to consider which days each post would impact. Conveniently, each post in the Kaggle dataset was marked with epoch time. After writing a function that converts epoch time into a readable date (UTC time), we decided that posts from the weekend and holidays should impact the next market date and that posts after market close (2100 UTC) should also impact the next market open date.

Our merged pandas data frame with all the data

Features

After preprocessing all of the data and aggregating it all into one nice pandas data frame, we can now look for interesting correlations within the data. Because we have the power of computer science and data science, instead of plotting a bunch of scatter plots to find correlations, we first ran a random forest regression to determine the most important features within the data. We decided to use a random forest regression because we are trying to predict a continuous variable, closing stock price, based on our inputs, which were the Reddit features. Using the feature_importances_ function, we found that the number of comments (0.44) and score, which is the aggregate of upvotes and downvotes (0.27), were the most important features in predicting the price of GameStop.

“”

Correlations

Before we go into the accuracy of the predictions, let’s take a pause and look at the interesting correlations between the features and stock price. Looking at the histogram of the sentiment score of all posts in the data set, we can see that the majority of them were neutral or fairly neutral.

“”

This suggests that the overall sentiment of these posts would not have much importance in dictating the stock price. This is corroborated by the original feature importance graph, which shows that compound sentiment only had an importance of 0.06, and by the noncorrelation of the scatter plot below.

“”

Simply looking at the graph, we can see that there is little correlation between compound sentiment and daily closing price, which is further supported by the correlation coefficient (r) of -0.012. What about positive and negative sentiments? One of our group members had a theory that negative sentiment might affect the stock price more than positive sentiment because people are more prone to panic selling stocks instead of euphoric buying.

“” “”

Again, we see very little correlation between positive and negative sentiment with the closing price of GME. Both features had an absolute r value of less than 0.1, which shows a very weak correlation.

Okay then, what about the most important features like the number of comments and score?

“” “”

Although there is slightly more correlation between these features and closing price, again, these features were not very correlated at all.

What about time considerations? After looking at our findings in defeat, frustrated at our inconclusive findings, we considered time to be a factor. Our dataset from Kaggle goes back to 2012, way before the GameStop rally happened. Correlation between Reddit activity and GME price, which would be a new phenomenon, would theoretically occur slightly before and during the GME rally of early 2021. So, for the next batch of scatter plots, we only took data from July 2020 to February 2021 (Which was the end of the dataset). We also decided to look at daily price changes instead of closing prices to see if we could spot the trend better.

“” “”

Looking at the two graphs below of the two most important features, score and number of comments, we can see that there is more correlation between the features and daily price changes. The weak correlations disprove our theory that the level of engagement would affect the price movement of GameStop. Looking at the scatter plots, we can see that the posts with the biggest price movements have lower scores and number of comments.

Based on our findings, clearly, there is little correlation between our features and daily price movement. But what about stock volume? Does increased Reddit activity on r/WallStretBets lead to more people buying and selling GME?

Once more, we will look at the posts and stock prices from after July 2020. Surprisingly, we have the strongest correlation coefficients so far with compound sentiment (0.5403) and num_comments (-0.3879).

“” “”

What is interesting is the negative relationship between the number of comments and stock volume. One might assume that if there was more activity on r/WallStreetBets, then there would be more volume on the next trading day. However, the data tells us the exact opposite.

Back to Random Forests

So determining correlation and causation proved to be unsuccessful with our feeble human minds. After training on the data, this is the most accurate decision tree our random forest regressor generated.

How did our random forest do in predicting closing prices? Training and testing on the historical data, the model had an accuracy of 70.96%, mean squared error (mse) of 19.1, and root mean squared error (rmse) of 4.37.

“” “”

Taking a look at the prediction plots, we can see that the random forest did a pretty good job of predicting closing prices for GME based on the Reddit features. The random forest was able to predict prices and place them in the correct general “cluster” of points in the scatter plots

Limitations and Improvements

Obviously, one problem that we encountered was that the data did not give us any conclusive information about the link between Reddit and the stock market. Maybe this is because there is no correlation and the small-fry retail traders cannot compete with the big hedge funds on Wall Street.

If we had more time, we would try to analyze this problem with more data and more features. Because the short squeeze happened in January 2021, it could be that all data before that is irrelevant to our question. It would be helpful to scrape more recent posts all the way up to the present day, May 2023, and compare them with GME stock price because there have been several more rallies since the initial rally. Another data point we could compare is the prices and volume of other stocks targeted for short squeezes by r/WallStreetBets like Bed Bath and Beyond (BBBY) and AMC movies (AMC). Hopefully, by collecting more data, we will be able to find more meaningful correlations between stock volume and price and Reddit features.

Conclusions

  1. There is no significant correlation between Reddit post features like score, number of comments, and sentiment and GME stock price or volume.

  2. The strongest correlation was between the number of comments and the volume of GME (This does not mean the number of comments influenced the volume! Correlation does not equal causation)

  3. Our random forest regressor did a pretty good job of predicting the closing GME price based on the Reddit features despite there being little correlation.

This project was made by me, Jack Kelly, and Kirean O’Dell. If you are interested in replicating the project or taking it a step further, here is the link to the GitHub with all of our code. We hope this project inspires you to take a stab at predicting the stock market! (Trade and invest responsibly. Don’t gamble💎🙌🚀)