The goal of this project was to use Natural Language Processing to analyze posts from two subreddits and train a model to classify posts by which subreddit they came from. The two subreddits, r/Conspiracy and r/TheOnion, were chosen because of the similarity between their title styles and content, and the percieved difficulty in training a model to distinguish between the two.
The first portion of this project involved accessing and scraping data from Reddit. The data was pulled in the .json format and relevant information (subreddit, title, and unique ID) were extracted into a Pandas Dataframe for easier analysis. After deleting duplicate rows, I explored the impact of lemmatizing on the text data, to Some exploratory data analysis was then done to determine the structure and composition of the text in the titles, including the identification of frequently occuring terms (both individual words and bi-grams), possible words to include as stop words (words to be excluded from analysis), and words that were common to both subreddits. Term frequencies were visualized in bar charts to get a sense of the most common terms and their rate of occurence. After this exploratory data analysis, I conducted a train test split to split my data into training and testing sets. I then ran a few different classification models using both CountVectorizer and TD-IDF Vectorizer to tokenize and count term frequencies. Using Pipelines and GridSearch to determine optimal parameters for each model, I tested Logistic Regression, K-Nearest Neighbors, Multinomial Naive Bayes, and Random Forest. After numerous iterations tweaking the parameters for each model, my best result came from a combination of TD-IDF Vectorizer and Multinomial Naive Bayes with parameters: maximum document frequency = 0.31, maximum features = 1400, minimum document frequency = 2, n-gram range = (1,1), and stop words = None. This model had a training accuracy score of 0.9467 and a testing score of 0.8362. While this is certainly overfit, the result was significantly less overfit than my other models, and was accurately able to classify 216 Onion posts and 177 Conspiracy posts, misclassifying only 77 posts within my testing set.
While my final model was overfit, it still did much better than my baseline of 0.5217. There are a number of difficulties in distinguishing between these subreddits that likely made it harder for my model to accurately classify all post. Some of these include the similarity of post titles, the use of real news headlines in some Onion articles, and the frequent occurence of names of famous people and politicians in both subreddits. My next steps in this analysis would be to remove proper nouns (names and places) from the data, attempt other forms of lemmatization or stemming of tokens, and implementing boosting algorithms and support vector machines.