Skip to content
ogrisel edited this page Dec 31, 2010 · 12 revisions

pignlproc usage tips

Here are some tips to use pignlproc tools to mine wikipedia & dbpedia dumps

Splitting a Wikipedia dump using Mahout

  • Download and unzip the latest binary release of Mahout from your closest mirror

  • Add MAHOUT_HOME/bin to your PATH and check that mahout is correctly installed with:

    $ mahout
    no HADOOP_HOME set, running locally
    An example program must be given as the first argument.
    Valid program names are:
      arff.vector: : Generate Vectors from an ARFF file or directory
      canopy: : Canopy clustering
      wikipediaXMLSplitter: : Reads wikipedia data and creates ch
  • Download the the Wikipedia dump, no need to uncompress it. Change "enwiki" by "frwiki" or other language codes to select the language you are interested in:

    $ wget -c
  • Split the dump into 100MB xml files on the local file system:

    $ mahout wikipediaXMLSplitter -d enwiki-latest-pages-articles.xml.bz2 \
      -o wikipedia-xml-chunks -c 100
    $ ls -l wikipedia-xml-chunks
    -rw-r--r-- 1 ogrisel ogrisel 108387581 2010-12-31 17:17 chunk-0001.xml
    -rw-r--r-- 1 ogrisel ogrisel 108414882 2010-12-31 17:18 chunk-0002.xml
    -rw-r--r-- 1 ogrisel ogrisel 108221208 2010-12-31 17:18 chunk-0003.xml
    -rw-r--r-- 1 ogrisel ogrisel 108059995 2010-12-31 17:18 chunk-0004.xml

You can append -n 10 to the previous command to extract only the first 10 chunks for instance.

  • Alternatively you can push the splitted dump directly to your Amazon S3 bucket with (faster if you execute this from a EC2 instance directly):

    $ mahout wikipediaXMLSplitter -d enwiki-latest-pages-articles.xml.bz2 \
      -o s3://mybucket/wikipedia-xml-chunks -c 100 \
      -i <your Amazon S3 ID key> \
      -s <your Amazon S3 secret key>

Running pignlproc scripts on a EC2 Hadoop cluster using Apache Whirr


Clone this wiki locally