-
Notifications
You must be signed in to change notification settings - Fork 64
Home
ogrisel edited this page Dec 31, 2010
·
12 revisions
Here are some tips to use pignlproc tools to mine wikipedia & dbpedia dumps
-
Download and unzip the latest binary release of Mahout from the your closest mirror
-
Add
MAHOUT_HOME/bin
to your path and check that mahout is correctly installed with:$ mahout no HADOOP_HOME set, running locally An example program must be given as the first argument. Valid program names are: arff.vector: : Generate Vectors from an ARFF file or directory canopy: : Canopy clustering [...] wikipediaXMLSplitter: : Reads wikipedia data and creates ch
-
Download the the Wikipedia dump, no need to uncompress it. Change "enwiki" by "frwiki" or other language codes to select the language you are interested in:
$ wget -c http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
-
Split the dump into 100MB xml files on the local file system:
$ mahout wikipediaXMLSplitter -d enwiki-latest-pages-articles.xml.bz2 \ -o wikipedia-xml-chunks -c 100 $ ls -l wikipedia-xml-chunks -rw-r--r-- 1 ogrisel ogrisel 108387581 2010-12-31 17:17 chunk-0001.xml -rw-r--r-- 1 ogrisel ogrisel 108414882 2010-12-31 17:18 chunk-0002.xml -rw-r--r-- 1 ogrisel ogrisel 108221208 2010-12-31 17:18 chunk-0003.xml -rw-r--r-- 1 ogrisel ogrisel 108059995 2010-12-31 17:18 chunk-0004.xml [...]
You can append -n 10
to the previous command to extract only the first 10 chunks for instance.
-
Alternatively you can push the splitted dump directly to your Amazon S3 bucket with (faster if you execute this from a EC2 instance directly):
$ mahout wikipediaXMLSplitter -d enwiki-latest-pages-articles.xml.bz2 \ -o s3://mybucket/wikipedia-xml-chunks -c 100 \ -i <your Amazon S3 ID key> \ -s <your Amazon S3 secret key>
TODO