Home

pignlproc usage tips

Here are some tips to use pignlproc tools to mine wikipedia & dbpedia dumps

Splitting a Wikipedia dump using Mahout

Download and unzip the latest binary release of Mahout from the your closest mirror

Add MAHOUT_HOME/bin to your path and check that mahout is correctly installed with:

$ mahout
no HADOOP_HOME set, running locally
An example program must be given as the first argument.
Valid program names are:
  arff.vector: : Generate Vectors from an ARFF file or directory
  canopy: : Canopy clustering
  [...]
  wikipediaXMLSplitter: : Reads wikipedia data and creates ch

Download the the Wikipedia dump, no need to uncompress it. Change "enwiki" by "frwiki" or other language codes to select the language you are interested in:
```
$ wget -c http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
```

Split the dump into 100MB xml files on the local file system:

$ mahout wikipediaXMLSplitter -d enwiki-latest-pages-articles.xml.bz2 \
  -o wikipedia-xml-chunks -c 100
$ ls -l wikipedia-xml-chunks
-rw-r--r-- 1 ogrisel ogrisel 108387581 2010-12-31 17:17 chunk-0001.xml
-rw-r--r-- 1 ogrisel ogrisel 108414882 2010-12-31 17:18 chunk-0002.xml
-rw-r--r-- 1 ogrisel ogrisel 108221208 2010-12-31 17:18 chunk-0003.xml
-rw-r--r-- 1 ogrisel ogrisel 108059995 2010-12-31 17:18 chunk-0004.xml
[...]

You can append -n 10 to the previous command to extract only the first 10 chunks for instance.

Alternatively you can push the splitted dump directly to your Amazon S3 bucket with (faster if you execute this from a EC2 instance directly):

$ mahout wikipediaXMLSplitter -d enwiki-latest-pages-articles.xml.bz2 \
  -o s3://mybucket/wikipedia-xml-chunks -c 100 \
  -i <your Amazon S3 ID key> \
  -s <your Amazon S3 secret key>

Running pignlproc scripts on a EC2 Hadoop cluster using Apache Whirr

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

pignlproc usage tips

Splitting a Wikipedia dump using Mahout

Running pignlproc scripts on a EC2 Hadoop cluster using Apache Whirr

Clone this wiki locally