Go to https://covid-vaccine-sentiment.herokuapp.com/ to checkout our web app.
- Detecting people’s opinion towards vaccine
- Identify overall customer ratings for various vaccines from different providers
- Investigate potential social problems about vaccine
- Create developer accounts for TwitterAPI
- Investigate JSON response and filter out relevant attributes for classification
- Use python Tweepy to connect the API
- Scrape data related to Covid19 Vaccine, Pfizer, Moderna, Astrazeneca, Sinovac
- Combine all sources of data
- Target: pos, neu, neg
- Evaluate if the tweet can be classified, if no, put ‘del’ in the target column
- Decision-making:
- How to deal with questions?
- Do we treat subjective and objective tweets differently?
- How to deal with tweets with several sentences and compound sentiments?
- How to deal with modality, eg. should, could have been etc.?
- Tweets with political standpoint might have personal sentiment tendency?
- How to deal with wish or command?
- How to deal with Sarcasm?
- Criteria regarding to annotation:
- Questions will depend, determine based on the tone.
- Usually subjective tweets have a strong sentiment and objective tweets have a neutral sentiment such as news headlines or ads. Classify subjective ones as pos or neg, except they are obviously neutral and objective ones as neutral unless they are obviously pos or neg.
- If tweets have more than two negative or positive words, tweets should be positive or negative. If it has a tie, classify as neutral.
- Modality will depend, but cases like could have been are usually negative
- Do not pick standpoints, stand in the middle point to classify the sentiments.
- Wishes and hope are classified as positive. Command usually is negative
- Sarcasm should be classified as negative.
- If we have to read more than two times to determine, put ‘del’ or ‘neu’
- Remove duplicates tweets based on text
- Extract
location
fromuser
column, discardcoordination
column - Extract
hashtags
fromentities
column - Split location into the format of city, province(two capitalized letters), county
- Preprocessing the
full_text
column- Exclude URLs
- Exclude mention sign
- Exclude HTML reference characters
- To keep
entities
column clean, convert it to the number of hashtags - Delete the untidy columns such as
entities
,users
,location
- Missing values analysis
- Univariate
- Numeric features frequencies (bar chart)
- Bivariate
- Relationship between predictors against target
- Multivariate
- Relationship among predictors(collinearity)
- WordCloud
- Subjectivity / Objectivity
- Number of first-person pronouns
- Number of second-person pronouns
- Number of third-person pronouns
- Part of speech
- Number of past-tense verbs
- Number of future-tense verbs
- Number of common nouns
- Number of proper nouns
- Number of adverbs
- Number of wh- words
- Number of coordinating conjunctions
- Corpus statistics
- Number of commas
- Number of multi-character punctuation tokens
- Average length of sentences, in tokens
- Average length of tokens, excluding punctuation-only tokens, in characters
- Number of sentences
- Number of words in uppercase(>= 3 letters long)
- Number of slang acronyms
- Number of hashtags
- Sentiment words and phrases:
- Positive words ratio per tweet
- Negative words ratio per tweet
- Other features
- Location(city, province, country)
- Retweet count
- Favourite count
- Further cleaning of text for machine learning algorithm:
- Exclude punctuations and digits
- Convert emojis to text description
- Lowercase all words
- Lemmatization
- Statistic approach:
- Bag of Words(hyperparameter tunning)
- TF-IDF(How relevant a word is associated to a doc)
- Neural network:
- Word embeddings
-Linear Models - Logistic Regression(Hyperparameter Tuning)
- Non-linear Models
- Naive Bayes
- SVM
- Decision Tree
- Ensembles
- Tree-based models
- Randomforest
- LGBM
- XGboost
- CatBoost
- Tree-based models
- Averaging
- Stacking
- Baseline: RNN
- More complex sequence models: GRU, (bidirectional)LSTM
- Integrated GloVe Embedding to improve performance
- Hyperparameters: Dropout Layer, Bidirectional layers, Loss functions(SGD, Adam), learning rate, regularization strength, Hidden size
- Unsupervised learning:
- Kmeans
- SVD + Topic modeling
- Gensim
- Combine corpus stats features with deep learning model(support configurations)
- Flask
- Heroku
- Co-training models(agreement on predicted labels)
- Self-training(set a threshold of the probability for predicted labels, marked as silver data and combined with golden data to retrain agian)
- Data augmentation(replace words with synsets from wordNet)
Data collection: python3 tweet_scraper.py -k <keyword to scrape> --num <max number to scrape> -o <file path of the output file>
Data cleaning: python3 tweet_washer.py -f <input file path> -o <output file path>
- Wiktor Jakubczyc, Verb Form Dictionary, https://github.com/monolithpl/verb.forms.dictionary
- Marcin, Positive word lexicon, https://gist.github.com/mkulakowski2/4289437
- Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA,
- Marcin, Negative word lexicon, https://gist.github.com/mkulakowski2/4289441
- Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA,
- Liang Wu, Fred Morstatter, Huan Liu, SlangSD, http://liangwu.me/slangsd/