Skip to content
Sivasurya Santhanam edited this page Oct 25, 2018 · 3 revisions

Study of word vectors

To study about word2vec and doc2vec models, emotion classification from text is considered as use case scenario. The six basic emotions Happy, Sad, Anger, Fear, Surprise and disgust are considered in addition to emotionless(none) condition. Word2vec and Doc2vec methods are employed to obtain feature vectors from the text sentences. Python has nice modules for NLP and ML models, also gensim is used for word vector models.

Datasets used

In order to generate high quality feature vectors for word2vec and doc2vec, huge datasets are needed. Mainly two datasets are used in this work.

  • Twitter dataset

  • Wikipedia dataset

Twitter dataset is generated manually by streaming german tweets for more than 3 months (November 2015 to February 2016). The Twitter dataset is publicly available here. It contains 8 folders each of appx. 2.5 Million sentences. Concatenate all the files to a single file for use.

Selected files from wortschatz are used in wikipedia dataset and is available here. It contains 18 folders each of 1 Million sentences.

Training and test data for emotion classification is generated from the raw twitter dataset by automatic labeling based on the hashtags.

Preprocessing the raw data and generation of vector models:

The raw data containing stop words, numbers, obscene words, unwanted characters are preprocessed and the vectors are generated from the processed data.

preprocessing

vector_model_generation.py explains the steps used in the process. It makes use of processing functions in dataprocessing.py for preprocessing and modeltrainer.py for generating vector models. The generated vector models are saved with .model extension.

Word cloud representation of hashtags:

To easily visualise the hashtags present in the twitter data, word clouds are used. The script hashtag_cloud.py outputs a word cloud graph from randomly sampled hashtags using extracthashtags.py

word cloud

Analogy based evaluation for vector models:

To test the goodness of the feature vectors generated in step 1, they are evaluated against some analogy based questions. The script analogy_test.py evaluates the model for odd one out using doesnt_match_eval.txt, semantic analogy using semantic_eval.txt and opposite semantic analogy using opposite_eval.txt. The relations among certain words could also be visualised with the help of word_relations.py

countries capitals oddoneout

Generate training and test dataset for the classification:

The automatic labeling of the twitter dataset based on the hashtags in the tweets is applied in extractlabeledtweets.py and from the tweets and the hashtags belonging to each emotional category, the training and test data are generated using generate_training_data.py. The root emotions from which the derived emotions are generated are in emotionslist.py

automaticclassification

Data visualisation of vectors using PCA and TSNE:

Feature vectors for sentences are computed both using Word2vec and Doc2vec models. In Word2vec, CBOW and SG models are used and in Doc2vec, DBOW and DM models are used. For document vectors, each sentence is directly represented by single feature vector and in case of word vectors, all the word vectors in a sentence are summed up and used as a single feature vector. To get a visualisation of the vectors being dealt with, those vectors are randomly sampled and visualised using PCA and TSNE using tsneplots.py.

CBOW model using PCA CBOW model using TSNE

Evaluation of vector models on the datasets applied.

The evaluation of vector models for emotion classification is done both on twitter and wikipedia datasets, where the feature vectors are generated from word2vec and doc2vec with variations in CBOW, SG, DM and DBOW models. Thus, all the 8 models are compared using ROC curves. As the dataset is biased for some classes, sampling based on clustering is applied to test for the unbiased dataset in evaluation_unbiased_dataset.py. It is best to test with 80-20 Train Test ratio than the provided annotated test file evaluation_sents.csv.

twt wiki 8020 roc