Skip to content

Installation guide

Felipe Ortega edited this page Mar 4, 2015 · 18 revisions

WikiDAT installation guide

This page provides step-by-step information to install WikiDAT in your system. Please, feel free to include any comments or changes to improve this documentation.

Hardware requirements

WikiDAT has been designed to run even on modest hardware. Dump files are processed on the fly, working with data streams as the files are decompressed. Thus, it should be enough to make sure that you have enough disk space to store extracted information locally.

CPU and memory requirements

There are no absolute minimum requirements regarding CPU or memory. WikiDAT itself consumes less than 200MB of memory, even with large Wikipedia languages. However, your local database may require more memory (and additional configuration) to import extracted information quickly.

The following hardware requirements are recommended to ensure a fluent ETL process:

  • Intel Core i7 CPU (or comparable).
  • At least 8GB RAM (DDR3, 1600 MHz); mostly for your local DB.
  • At least one Solid-State Disk (SSD) drive, to speed up data loading.

Disk Storage requirements

Disk space requirements depend on the number and size of the projects you intend to analyze. For instance, to analyze the complete history dump (all revisions for each page) of the English Wikipedia, it is recommended to have at least 200 GB of free disk space (storing the compressed dump files as well as the DB tables). The following tables provides a (non-exhaustive) list of recommended disk space to analyze some Wikipedia languages:

Language code Recommended storage space
English Wikipedia enwiki ~ 200 GB
German Wikipedia dewiki ~ 60 GB
French Wikipedia frwiki ~ 50 GB
Spanish Wikipedia eswiki ~ 24 GB
Polish Wikipedia plwiki ~ 18 GB
Italian Wikipedia itwiki ~ 24 GB
Japanese Wikipedia jawiki ~ 21 GB
Dutch Wikipedia nlwiki ~ 18 GB
Portuguese Wikipedia ptwiki ~ 18 GB
Russian Wikipedia ruwiki ~ 32 GB

Software dependencies

IMPORTANT: WikiDAT has only been tested with Python 2.7. Python 3 is not supported yet, although plans to address this issue have already been outlined.

The following software dependencies must be met before installing WikiDAT on your system:

Data extraction

  • MySQL (>= v5.5.x) or MariaDB (>= 5.5.x || >= v10.0.x).
  • The Python programming language (>= v2.7.x and < 3).
  • The following Python packages (PyPI installation recommended with easy_install or pip):
    • MySQLdb (>= v1.2.3)
    • lxml (>= v3.3.1-0)
    • requests (>= v2.2.1)
    • beautifulsoup4 (>= v4.2.1)
    • configparser (>= v3.3.0r2)
    • pyzmq (>= v14.3.0)
    • ujson (>= v1.30)
  • The 0MQ (ZeroMQ) message queuing.

Data analysis

  • The R programming language.
  • The following R packages (available from CRAN):
    • RMySQL: Connect to MySQL databases from R.
    • Hmisc: Frank Harrell's miscelaneous functions.
    • car: Companion library for "R Companion to Applied Regression", 2nd ed.
    • ineq: Calcualte inequality metrics and graphics.
    • ggplot2: A wonderful library to create appealing graphics in R.
    • eha: Library for event history and survival analysis.
    • zoo: Library to handle time series data.

Since WikiDAT follows a modular design, you can just run the data extraction process and undertake the data analysis phase with any tool of your choice (e.g. NumPy/SciPy, Pandas or scikit-learn in Python).

Installation procedures

GNU/Linux (Ubuntu 13.10, 14.04) (Debian 7.0)

The following steps install all software dependencies required for both data extraction and data analysis with WikiDAT.

  1. Install pip (if required):

      $ sudo apt-get install python-pip
    
  2. Install or update all Python dependencies listed above:

      $ sudo pip install -U MySQL-python
      $ sudo pip install -U lxml
      $ sudo pip install -U requests
      $ sudo pip install -U beautifulsoup4
      $ sudo pip install -U configparser
      $ sudo pip install -U pyzmq
      $ sudo pip install -U ujson
    
  3. Install 0MQ (ZeroMQ):

      $ sudo apt-get install libzmq3
    
  4. Install R, then all required R packages:

      > install.packages(c('RMySQL', 'Hmisc', 'car', 'ineq', 'ggplot2', 'eha', 'zoo'), dep=T)
    
  5. Clone the latest stable version of WikiDAT on your local machine:

      $ git clone https://github.com/glimmerphoenix/WikiDAT.git
    
  6. If you are not working with virtualenv in Python, make sure that your environment variable PYTHONPATH points to the cloned WikiDAT directory:

      $ export PYTHONPATH=$PYTHONPATH:path/to/WikiDAT
    

    You can add this line to the end of your .bashrc file to make these changes permanent.

  7. Then, change to the WikiDAT/wikidat directory. You should modify the default config.ini file to indicate a valid user and password to connect to your local database. Finally, to run the program execute the main.py file. This will run the whole process for the case of scowiki.

      WikiDAT/wikidat$ python main.py
    

Please, refer to the Quick start guide for more information about how to quickly customize the execution of WikiDAT for your own wishes.

Windows

To be created

MacOS

To be created