To install the project:
git clone https://github.com/Felihong/wikidata-sequence-analysis.git
cd wikidata-sequence-analysis
pip install -r requirements.txt
The sample data are collected mainly for the following two perspectives:
- Descriptive statistics of data collected
- Bahaviour patterns with the help of sequence pattern mining
- Identify randomly 100 items per current quality prediction A, B, C, D, E, which are retrived from
wikidatawiki_p page table : page_latest
andores API
- The edit histories of all items are retrieved from
wikidatawiki_p revision table
- All data above would be combined again with the repective editer information from
wikidatawiki_p user table
, together with edit comments fromwikidatawiki_p comment table
, user group information fromwikidatawiki_p user_groups table
article_id | item_id | item_title | label | category |
---|
article_id
table ID, primary keyitem_id
edited item page IDitem_title
respective item page namelabel
English label of the item pagecategory
classified content category based on label and description
editor_id | user_id | user_name | user_group | user_editcount | user_registration |
---|
editor_id
table ID, primary keyuser_id
editor IDuser_name
editer nameuser_group
editor's user group and their corresponding user rightsuser_editcount
rough number of edits and edit-like actions the user has performeduser_registration
editor registration timestamp
rev_id | prediction | itemquality_A | itemquality_B | itemquality_C | itemquality_D | itemquality_E | js_distance |
---|
rev_id
revision(edit) ID, primary keyprediction
quality prediction of this revision ID, chosen as the one with the biggest probabilityitemquality_A, itemquality_B, itemquality_C, itemquality_D, itemquality_E
concrete quality level probability distribution of this revisionjs_distance
Jensen-Shannon divergence value based on given quality distribution
rev_id | comment | edit_summary | edit_type | paraphrase |
---|
rev_id
revision(edit) ID, primary keycomment
original comment information for this editedit_summary
comment information simplified with regular expressionedit_type
schematized and classified edit summary for ease of useparaphrase
paraphrase of edit summary according to Wikibase API
rev_id | parent_id | editor_id | article_id | rev_timestamp |
---|
rev_id
revision(edit) ID, primary keyparent_id
preceding revision(edit) IDeditor_id
foreign key to table editorarticle_id
foreign key to table articlerev_timestamp
revision timestamp
It is strongly suggested using virtual environment to install ores by firstly install python’s virtual environment and create a directory named python-environments, then navigate in the newly created directory:
sudo apt install virtualenv
mkdir python-environments
cd python-environments
Create a virtual environment in python 3 with the environment name of project_ores, then activate the newly created virtual environment:
virtualenv -p python3 project_ores
source ~/project_ores/bin/activate
Alternatively, you may also create the virtual environment with Anaconda:
conda create --name project_ores python=3.5.0
conda activate project_ores
Now install ORES package in the virtual environment:
pip install ores
The following steps show how to use ORES item_quality model scoring the given revisions as input, the data can be fetched from the commandline using the ORES built-in tools.
To pull a sample, start with:
revision_id.csv | tsv2json int | ores score_revisions https://ores.wikimedia.org \
'Example app, here should be your user agend' \
wikidatawiki \
itemquality \
--input-format=plain \
--parallel-requests=4 \
> result.jsonlines
Please make sure that the input file revision_id.csv is beginning with a header "rev_id".
After this, the script itemquality_scores_to_csv.py could be used to parse results into CSV:
python itemquality_scores_to_csv.py < result.jsonlines > result.csv