-
Notifications
You must be signed in to change notification settings - Fork 32
3. Preprocessing the Corpus
akoksal edited this page May 1, 2018
·
2 revisions
To train word2vec model with gensim library, you need to put each document into a line without punctuations. So, the output file should include all articles and each article should be in a line. Gensim library provides methods to do this preprocessing step. However, tokenize function is modified for Turkish language. You can run preprocess.py to modify your wikipedia dump corpus. It takes two arguments. First one is the path to the wikipedia dump(without extracting). Second one is the path to the output file. For example:
python3 preprocess.py trwiki-20180101-pages-articles.xml.bz2 wiki.tr.txt
Previous: 2. Getting the Corpus
Next: 4. Training Word2Vec Model