3. Preprocessing the Corpus

To train word2vec model with gensim library, you need to put each document into a line without punctuations. So, the output file should include all articles and each article should be in a line. Gensim library provides methods to do this preprocessing step. However, tokenize function is modified for Turkish language. You can run preprocess.py to modify your wikipedia dump corpus. It takes two arguments. First one is the path to the wikipedia dump(without extracting). Second one is the path to the output file. For example:

python3 preprocess.py trwiki-20180101-pages-articles.xml.bz2 wiki.tr.txt

Previous: 2. Getting the Corpus
Next: 4. Training Word2Vec Model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Preprocessing the Corpus

Clone this wiki locally