This is the code and data for KDD'23 paper SeCoGD: "Context-aware Event Forecasting via Graph Disentanglement" by Yunshan Ma, Chenchen Ye, Zijian Wu, Xiang Wang, Yixin Cao, and Tat-Seng Chua.
The code is tested to be runnable under the environment with python=3.9; pytorch=1.12; cuda=11.3.
To create environment, you could use commands as below:
conda create --name seco python=3.9
conda activate seco
conda install pytorch==1.12.0 -c pytorch
pip install pandas
pip install tensorboard
conda install tqdm
conda install -c dglteam dgl-cuda11.3
pip install matplotlib
pip install nltk
pip install gensim
Download data from the shared link: https://drive.google.com/file/d/12sREjxkVQnadCBysqYHz_2DXSINclBvt/view?usp=sharing
The input data files need to be unzipped first:
unzip data.zip
unzip data_disentangled.zip
The folder ./data
contains the original event data and the structured data.
The folder ./data_disentangled
contains extra context data.
We analyze the CAMEO event ontology and convert it into python dictionaries in ./data/CAMEO
.
NOTE:
[country_name]
isEG
orIR
orIS
.- In the dataset, each md5 corresponds to a news URL. This conversion is done by
md5 = hashlib.md5(URL.encode()).hexdigest()
. - All scripts are running under
./src
.
The original event data is stored in ./data/[country_name]/[country_name].csv
.
The generated structured data files are also put under ./data/[country_name]
.
To self-generate the structured data:
- Run the data generation script:
python generate_structured_data.py --c [country_name]
NOTE:
[context_generation_method]
isLDA
here, and it can be replaced with other text clustering methods, such as K-Means and GMM.[K]
is the number of contexts, i.e. the number of topics in LDA. We provide data forK=5
as example.
The new event data with context information is stored in ./data_disentangled/[country_name]_[context_generation_method]_K[K]/[country_name].csv
.
The files for generated structured data with context information are also put under ./data_disentangled/[country_name]_[context_generation_method]_K[K]
.
To self-generate the contexts:
- Prepare text data:
- The original news articles:
- Location:
./data/[country_name]/docs_title_paragraph.json
- Format:
[ [title, [paragraph1, paragraph2, ...]], ... ]
- Order:
./data/[country_name]/md5_list.json
- Location:
- Clean text data for LDA model learning and topic prediction:
python generate_clean_text.py --c [country_name]
- The cleaned text data will be saved as:
- Location:
./data/[country_name]/train_docs_cleaned_tokens.json
and./data/[country_name]/docs_cleaned_tokens.json
- Format:
[ [token1, token2, ...], ...]
- Order:
./data/[country_name]/train_md5_list.json
and./data/[country_name]/md5_list.json
respectively.
- Location:
-
Run context generation script:
python generate_context_lda.py --c [country_name] --K_TOPICS [K]
-
Create graph data:
python generate_graphs.py --datapath ../data/[country_name] python generate_graphs_disentangled.py --datapath ../data_disentangled/[country_name]_[context_generation_method]_K[K] --K [K]
-
Train SeCoGD model:
python main.py -d [country_name] \ --context [context_generation_method] \ --k_contexts [K] \ --hypergraph_ent \ --n_layers_hypergraph_ent 1 \ --hypergraph_rel \ --n_layers_hypergraph_rel 1 \ --score_aggregation hard \ --encoder rgcn \ --n_layers 2 \ --n_hidden 200 \ --self_loop \ --layer_norm \ --train_history_len 3 \ --lr 0.001 \ --wd 1e-6
-
View results:
- Loss and runs:
tensorboard --logdir=./runs/
- Logging results will be saved under:
./results/[country_name]/[country_name]_[context_generation_method]_K[K]/