https://www.yelp.ca/dataset/challenge
MySQL 8.0 required
The goal of this project is to train a sentiment classifier using reviews from yelp dataset. This classifier can later be generalized for any unseen sentences/paragraphs so that it could accurately classify polarity of a given text and extract subjective information, such as positive or negative emotion.
Bayesian Classifier is a well-known machine learning algorithm for textual classification. It is aimed to give the class that yields maximum posterior. In this particular application, we implement a Naïve Bayesian Classifier with the assumption of independence of word probabilities and ordering given any class.
The Review sentences.
The yelp reviews are classified based on stars. If it’s 5 stars, it’s marked as good review. If it’s less than 2 stars, it’s regarded as bad review
The data is randomly splited into training data and testing data with testing ratio of 10%.
In this implementation, Laplace smoothing is applied for both training and testing dataset in order to solve zero probability problem. Considering of machine precision for the floating points, we transform conditional probabilities to log-space to avoid dealing with extremely tiny numbers.
I also made up some fake reviews and let the classifier to figure out:
In general, words from a sentence are correlated in some sense. For instance, Pr (“fantastic” | C = good review) = 0.7 and Pr (“delicious” | C = good review) = 0.8, but Pr (“fantastic”, “delicious” | good review) does not necessarily equal to 0.8*0.7=0.56, and it usually higher than that, since Pr (“fantastic” | C = good review, “delicious”) is not equal to zero. Theoretically, this might degrade the performance to some extent. In reality, for binary classification, it turns out that this algorithm performs pretty well in terms of accuracy and time efficiency.