To allow mappings to wikipedia and popularity calculations, the following three files should be uploaded to their respective directories (NB: these could be symlinks to versions on external storage)
- The
wd_JSON
directory should contain the wikidata JSON dump, aslatest-all.json.bz2
(download from http://dumps.wikimedia.org/wikidatawiki/entities/) - The
wp_SQL
directory should contain the en.wikipedia SQL dump file, asenwiki-latest-page.sql.gz
(download from http://dumps.wikimedia.org/enwiki/latest/) - The
wp_pagecounts
directory should contain the wikipedia pagevisits dump files: multiple files such aswp_pagecounts/pageviews-202403-user.bz2
etc... (download from https://dumps.wikimedia.org/other/pageview_complete/monthly/).
For wp_pagecounts
, as a much faster alternative, you can download preprocessed pageviews files from a release.
You can download the gz file and unpack it in one command. e.g. from data/Wiki/wp_pagecounts
, run:
wget https://github.com/OneZoom/tree-build/releases/download/pageviews-202306-202403/OneZoom_pageviews-202306-202403.tar.gz -O - | tar -xz
You will then omit passing pageviews files when you later run generate_filtered_files
(see build steps).