-
Notifications
You must be signed in to change notification settings - Fork 11
Installation guide
This page provides step-by-step information to install WikiDAT in your system. Please, feel free to include any comments or changes to improve this documentation.
WikiDAT has been designed to run even on modest hardware. Dump files are processed on the fly, working with data streams as the files are decompressed. Thus, it should be enough to make sure that you have enough disk space to store extracted information locally.
There are no absolute minimum requirements regarding CPU or memory. WikiDAT itself consumes less than 200MB of memory, even with large Wikipedia languages. However, your local database may require more memory (and additional configuration) to import extracted information quickly.
The following hardware requirements are recommended to ensure a fluent ETL process:
- Intel Core i7 CPU (or comparable).
- At least 8GB RAM (DDR3, 1600 MHz); mostly for your local DB.
- At least one Solid-State Disk (SSD) drive, to speed up data loading.
Disk space requirements depend on the number and size of the projects you intend to analyze. For instance, to analyze the complete history dump (all revisions for each page) of the English Wikipedia, it is recommended to have at least 200 GB of free disk space (storing the compressed dump files as well as the DB tables). The following tables provides a (non-exhaustive) list of recommended disk space to analyze some Wikipedia languages:
Language | code | Recommended storage space |
---|---|---|
English Wikipedia | enwiki | ~ 200 GB |
German Wikipedia | dewiki | ~ 60 GB |
French Wikipedia | frwiki | ~ 50 GB |
Spanish Wikipedia | eswiki | ~ 24 GB |
Polish Wikipedia | plwiki | ~ 18 GB |
Italian Wikipedia | itwiki | ~ 24 GB |
Japanese Wikipedia | jawiki | ~ 21 GB |
Dutch Wikipedia | nlwiki | ~ 18 GB |
Portuguese Wikipedia | ptwiki | ~ 18 GB |
Russian Wikipedia | ruwiki | ~ 32 GB |
IMPORTANT: WikiDAT has only been tested with Python 2.7. Python 3 is not supported yet, although plans to address this issue have already been outlined.
The following software dependencies must be met before installing WikiDAT on your system:
Data extraction
- MySQL (>= v5.5.x) or MariaDB (>= 5.5.x || >= v10.0.x).
- The Python programming language (>= v2.7.x and < 3).
- The following Python packages (PyPI installation recommended with
easy_install
orpip
):- MySQLdb (>= v1.2.3)
- lxml (>= v3.3.1-0)
- requests (>= v2.2.1)
- beautifulsoup4 (>= v4.2.1)
- configparser (>= v3.3.0r2)
- pyzmq (>= v14.3.0)
- ujson (>= v1.30)
- The 0MQ (ZeroMQ) message queuing.
Data analysis
- The R programming language.
- The following R packages (available from CRAN):
- RMySQL: Connect to MySQL databases from R.
- Hmisc: Frank Harrell's miscelaneous functions.
- car: Companion library for "R Companion to Applied Regression", 2nd ed.
- ineq: Calcualte inequality metrics and graphics.
- ggplot2: A wonderful library to create appealing graphics in R.
- eha: Library for event history and survival analysis.
- zoo: Library to handle time series data.
Since WikiDAT follows a modular design, you can just run the data extraction process and undertake the data analysis phase with any tool of your choice (e.g. NumPy/SciPy, Pandas or scikit-learn in Python).
The following steps install all software dependencies required for both data extraction and data analysis with WikiDAT.
-
Install
pip
(if required):$ sudo apt-get install python-pip
-
Install or update all Python dependencies listed above:
$ sudo pip install -U MySQL-python $ sudo pip install -U lxml $ sudo pip install -U requests $ sudo pip install -U beautifulsoup4 $ sudo pip install -U configparser $ sudo pip install -U pyzmq $ sudo pip install -U ujson
-
Install 0MQ (ZeroMQ):
$ sudo apt-get install libzmq3
-
Install R, then all required R packages:
> install.packages(c('RMySQL', 'Hmisc', 'car', 'ineq', 'ggplot2', 'eha', 'zoo'), dep=T)
-
Clone the latest stable version of WikiDAT on your local machine:
$ git clone https://github.com/glimmerphoenix/WikiDAT.git
-
If you are not working with
virtualenv
in Python, make sure that your environment variablePYTHONPATH
points to the cloned WikiDAT directory:$ export PYTHONPATH=$PYTHONPATH:path/to/WikiDAT
You can add this line to the end of your
.bashrc
file to make these changes permanent. -
Then, change to the
WikiDAT/wikidat
directory. You should modify the defaultconfig.ini
file to indicate a valid user and password to connect to your local database. Finally, to run the program execute themain.py
file. This will run the whole process for the case ofscowiki
.WikiDAT/wikidat$ python main.py
Please, refer to the Quick start guide for more information about how to quickly customize the execution of WikiDAT for your own wishes.
To be created
To be created
WikiDAT: Wikipedia Data Analysis Tooolkit. CC-BY-SA 3.0 Felipe Ortega. Icons: Font Awesome