Skip to content

Extract bag of words from Wiki dump and save to .txt.gz files

Jencir Lee edited this page May 14, 2017 · 1 revision

We use gensim for extracting the texts in the Wiki dump. The first time we run the following lines it goes through the entire bz2 file and compile the dictionary, which could take hours.

import gensim.corpora.dictionary
import gensim.corpora.wikicorpus
wiki = gensim.corpora.WikiCorpus('enwiki-20170420-pages-articles.xml.bz2', processes=8, dictionary=wiki_dictionary)
wiki.dictionary.save_as_text('wiki_dictionary.txt', sort_by_word=False)

Once we save the dictionary, the 2nd time to construct the WikiCorpus object would be instantaneous.

import gensim.corpora.dictionary
import gensim.corpora.wikicorpus
wiki_dictionary = gensim.corpora.Dictionary.load_from_text('wiki_dictionary.txt')
wiki = gensim.corpora.WikiCorpus('enwiki-20170420-pages-articles.xml.bz2', processes=2, dictionary=wiki_dictionary)

We could use corpuses/dump_bow.py to dump the Bag-of-Words of each documents in a series of .txt.gz files. Each line of the uncompressed text files represent one document, with only lower-case words separated by space. The punctuations are removed.

usage: dump_bow.py [-h] [-j JOBS] [-p PARTITION_SIZE] [-l LIMIT]
                   [-o OUTPUT_PREFIX]
                   wikidump dictionary

Dump bag-of-words in .txt.gz files

positional arguments:
  wikidump              xxx-pages-articles.xml.bz2 wiki dump file
  dictionary            gensim dictionary .txt file

optional arguments:
  -h, --help            show this help message and exit
  -j JOBS, --jobs JOBS  Number of parallel jobs, default: 2
  -p PARTITION_SIZE, --partition-size PARTITION_SIZE
                        Number of documents in each .txt.gz file
  -l LIMIT, --limit LIMIT
                        Total number of documents to dump, or all documents
                        when not specified
  -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Prefix of dump .txt.gz files, default: dump
Clone this wiki locally