1 Download wikidata dumps from https://dumps.wikimedia.org/wikidatawiki/entities/
--latest-all.json.bz2 (52.6G)
2 Download wikipedia dumps from https://dumps.wikimedia.org/, including:
--dewiki-20200420-page.sql.gz (261M) --dewiki-20200420-page_props.sql.gz (73M) --dewiki-20200420-page_restrictions.sql.gz (21M) --dewiki-20200420-pages-articles.xml.bz2 (5.2G)
3 Import spl files into mysql. Geting restriction mapping from page_restrictions.sql, geting wikipedia page_id and page_title mapping from page.sql, geting wikipedia page_id and wikidata entity_id mapping from page_props.sql
python gen_redirction.py python get_from_sql.py
4 Using WikiExtractor_01.py to extract articles from wikidumps (https://github.com/attardi/wikiextractor)
Using write_to_one_file.py to write all files into one file.
python WikiExtractor_01.py -o [output file path] -l input python write_to_one_file.py
python get_wikidata_name_des.py
python change_wikiID.py
python integrate_wikidata_wikipedia.py
In this part, we almost followed (https://github.com/dalab/deep-ed ), but we have changed some parts of the code, and adapted it to en wiki data.
10 Install Torch and other torch libraries.
th data_gen/gen_p_e_m/gen_p_e_m_from_wiki.lua -root_data_dir $DATA_PATH
th entities/ent_name2id_freq/e_freq_gen.lua -root_data_dir $DATA_PATH
mkdir $DATA_PATH/generated/test_train_data/ th data_gen/gen_test_train_data/gen_all.lua -root_data_dir $DATA_PATH
i) From Wiki canonical pages: th data_gen/gen_wiki_data/gen_ent_wiki_w_repr.lua -root_data_dir $DATA_PATH
ii) From context windows surrounding Wiki hyperlinks: th data_gen/gen_wiki_data/gen_wiki_hyp_train_data.lua -root_data_dir $DATA_PATH
th words/w_freq/w_freq_gen.lua -root_data_dir $DATA_PATH
17 Compute the restricted training data for learning entity embeddings by using only candidate entities from the relatedness datasets and all ED sets
i) From Wiki canonical pages: th entities/relatedness/filter_wiki_canonical_words_RLTD.lua -root_data_dir $DATA_PATH
ii) From context windows surrounding Wiki hyperlinks: th entities/relatedness/filter_wiki_hyperlink_contexts_RLTD.lua -root_data_dir $DATA_PATH
mkdir $DATA_PATH/generated/ent_vecs th entities/learn_e2v/learn_a.lua -root_data_dir $DATA_PATH | log_train_entity_vecs
python prepro_hipe.py
python prepro_hipe_util.py
python3 -m model.train --batch_size=4 --experiment_name=hipe --training_name=group_global/global_model_v$v --ent_vecs_regularization=l2dropout --evaluation_minutes=10 --nepoch_no_imprv=6 --span_emb="boundaries" --dim_char=50 --hidden_size_char=50 --hidden_size_lstm=150 --nn_components=pem_lstm_attention_global --fast_evaluation=True --all_spans_training=True --attention_ent_vecs_no_regularization=True --final_score_ffnn=0_0 --attention_R=10 --attention_K=100 --train_datasets=HIPE-data-v1.0-train-de --el_datasets=HIPE-data-v1.0-train-de_z_HIPE-data-v1.0-dev-de_z_HIPE-data-v1.0-test-de --el_val_datasets=0 --global_thr=0.001 --global_score_ffnn=0_0