SparkWiki toolkit can be used in various scenarios where you are interested in researching Wikipedia graph and pageview statistics. Graph and pageviews can be used and studied separately. You can see a few examples below.
Detecting Spatio-temporal Anomalies in Wikipedia Viewership Statistics
This toolkit was used in the large-scale experiments for an algorithm performing Anomaly Detection in Dynamic Graphs. You can see an example of the usage of the pre-processed data in ch.epfl.lts2.wikipedia.PagecountProcessor
. This class contains an implementation of the algorithm. The experiments used Wikipedia graph and pagecounts to detect anomalies in viewership statistics of Wikipedia visitors. The graph and pagecounts data are pre-processed using the tools presented in this repository. You can find a brief demo here.
What is Trending on Wikipedia?
The same algorithm and implementation were used to detect trends in multiple language editions of Wikipedia. See more details in the paper and a short 8-min presentation.
Forecasting Wikipedia Page Views with Graph Embeddings
Another project used this toolkit to pre-process pagecounts. The project's goal was to forecast page-views on Wikipedia. See more details here.
A Knowledge-graph based Taxonomy Construction Method
A knowledge-graph project used SparkWiki toolkit to construct a Wikipedia-based knowledge graph. See more details here
Sparkwiki is a set of tools written in Scala (2.12) that aims at processing data from Wikipedia, namely
- SQL table dumps
- pageviews
There is a detailed deployment tutorial available here. Consise instructions are provided below.
You need:
- an installation of Apache Spark
- Scala sbt
- data dumps downloaded from the links mentioned above.
Important: for performance reason, you should convert .sql.gz
table dumps
to .sql.bz2
(s.t. Spark can process them in parallel and take advantage of all
processors in the system), e.g. unsing a command such as
zcat dump.sql.gz | bzip2 > dump.sql.gz
Every tool can be run via spark-submit
, e.g.
./spark-submit --class ch.epfl.lts2.wikipedia.[ToolNameHere] --master 'local[*]' --executor-memory 4g --driver-memory 4g --packages org.rogach:scallop_2.12:4.1.0 ./sparkwiki/target/scala-2.12/sparkwiki_2.12-1.0.0.jar [ToolArgsHere]
From the project directory, run sbt package
to build the jar file. If you want to edit the code, run sbt eclipse
This tool can be run using the class ch.epfl.lts2.wikipedia.DumpProcessor
as entry point. It will read the
SQL table dumps starting with value supplied by the namePrefix
argument, in the directory specified by the dumpPath
argument,
and write (compressed) csv files in the directory specified by outputPath
that can later be imported in the neo4j graph database.
This tool requires the following table dumps to be present:
- page
- pagelinks
- redirect
- categorylinks
Arguments:
--dumpPath
directory containing sql.bz2 files (gz not supported)--outputPath
output directory--namePrefix
leading part of each SQL dump, e.g.enwiki-20180801
This tool is available using the class ch.epfl.lts2.wikipedia.DumpParser
as entry point. It reads one or more SQL dumps
and converts them to either a csv file, or a parquet file. All the dumps should be of the same type (e.g. all pages
)
but can span multiple languages. Language will be extracted from the dump filename and appended to each record in the output.
Arguments:
--dumpFilePaths
path to the .sql.gz or sql.bz2 SQL dumps to read--dumpType
should be page, redirect, pagelinks, category, categorylinks or langlinks--outputPath
output directory--outputFormat
(default=csv), should be csv or parquet--languages
(optional, only used for langlinks), extract only links pointing to the specified languages, e.g.--languages fr es
to keep only links pointing to french and spanish versions.
This tool can be run using the class ch.epfl.lts2.wikipedia.PagecountProcessor
as entry point. It will read a collection of pagecounts covering the period
between arguments startDate
and endDate
(inclusive bounds),
filter the counts belonging to wikipedia project and having more daily visits than a threshold given by the minDailyVisit
argument and save the result to Parquet files or into a Cassandra database, after resolving page ids (either a SQL page dump or a processed SQL page dump (as parquet) must be supplied via the pageDump
argument).
Arguments:
--config
path to configuration file (cf. config folder for a sample)--basePath
directory containing pagecounts files--startDate
first day to process, formatted asyyyy-MM-dd
, e.g. 2018-08-03--endDate
last day to process, formatted asyyyy-MM-dd
--languages
list of languages to extract pagecounts for, e.g.--languages en fr ru
to process french, english and russian languages.--pageDump
path to a page SQL dump or a version processed by DumpParser and saved as parquet--outputPath
path to Parquet files with pre-processed pagecounts. You can use these Parquet files for further processing in Spark or any other framework that supports Parquet format. If you omit this option, you need a running Cassandra instance to store pagecount data (and set the appropriate connection information in the configuration file).
Please cite this paper if you use the code or the pre-processed dataset.
@article{aspert2019graph,
title={A Graph-structured Dataset for Wikipedia Research},
author={Aspert, Nicolas and Miz, Volodymyr and Ricaud, Benjamin and Vandergheynst, Pierre},
conference={Companion Proceedings of the 2019 World Wide Web Conference},
year={2019}
}