THis project is implemented as part of the "Natural Language Processing and Deep learning " course during my master's degree. In this project I have created two word embedding models: Word2Vec-SkipGram and GLoVE using ArWiki Dump 2018 Dataset where the Skipgram model is imporved by tuning the values for vector size and window then the evaluation of each model on 4 stages with visualization of the results using T-SNE. The report of the project provides further inforamtion about the experiments and the results analyses and discussion
Step 1.1: reading the corpus
Parse the compressed arabic wiki articles of the format .bz2
using the Gensim utility WikiCorpus
then make sure the encoding is set to utf-8 as arabic language is encoded as the latin based encoding : utf-8
Step 1.2: remove unwanted characters from the scanned articles
- Non-arabic character (mainly english in upper and lower case )
- Digits [0-9]
- Extra spaces
- Tashkeel and tatweel (arabic diacritics)
The output of the preprocessing is stored hence is this step to generate the cleaned data is only executed once. Note the corpus_cleaned.txt is omitted from the repository as it exceeds the allowed size for github repository
gensim.models.Word2Vec(sentences,vector_size,window,sg,workers)
- List of used Training parameters:
sentences
: corpus of the datasetvector_size
(int, optional) – Dimensionality of the word vectors.window
:(int, optional) – Maximum distance between the current and predicted word within a sentencesg
({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOWworkers
(int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
SkipGram
List of parameters tuning:- vector_size_list=[500,1000]
- window_list=[10,15,20]
GLoVE
List of parameters tuning- learning rate=[0.01,0.05]
- window_list=[10,15,20]
Using GenSim and GLoVE libararies on python the arabic-word-embedding models are trained and saved. It is woth noting that the GLoVE library only worked on Colab with older versions of python (3.7 and lower) as the library implementation is developed for those version of python
Test 1 : Most Similar Words
Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively. link
- Pick 8 Arabic words and, for each one, ask each model about the most similar 10 words to it. Plot the results using t-SNE (or scatterplot) and discuss them
Test 2: Odd-One-Out
- we ask our model to give us the word that does not belong to the list doc
- Pick 5-10 triplets of Arabic words and, for each one, ask each model to pick the word in the triplet that does not belong to it. Discuss the results.
Test 3: Measuring Sentence Similarity
Find the Sentences similar to each other by computing the cosine similarity function of the two embedding vectors as in Paul Minogue blog
write 5 sentences in Arabic. For each sentence, pick 2-3
words and replace them with their synonyms or antonyms. Use your embeddings to
compute the similarity between each sentence and its modified version. Discuss the
results
Test 4: Analogy
- Syntax in link
- pick 5-10 cases of analogies in Arabic, like the one we used in class:
- glove-python-binary
- arabic_reshaper
- python-bidi
- pyarabic.araby
- gensim
- matplotlib.pyplot
- seaborn
- sklearn.manifold.TSNE
All references and resources are documented used in each step are documented in the .ipynb file in markdown
- kdnuggets
- machinelearningmastry: (is using GenSim3 syntax)
- Word2Vec official documentation page 1,page 2
- Updated Gensim Syntax is used as GenSim version 4 is used in the experiments
- GenSim documentation
- Kaggle gensim-word2vec-tutorial