Experiment using node2vec on arXiv papers metadata.
- Python ≥ 3.6
Create and activate a virtual environment (conda)
conda create --name py36_node2vec-arxiv python=3.6
source activate py36_node2vec-arxiv
If pip
is configured in your conda environment,
install dependencies from within the project root directory
pip install -r requirements.txt
The dataset used in this repository should be downloaded from Kaggle
Create a folder data
from within the project root directory.
Place the downloaded file arxivData.json
in the data
folder.
Now that the environment is setup and the dataset is available, you can run the code using the following command:
python main.py
This will by default use the arxivData.json
file as input and generate in the same data
folder the following embedding files:
- kg_node2vec_embed.emb: the embedding file with as first column the
node id
followed by the vector dimensions - kg_node2vec_label.tsv: a mapping of
node id
tonode label
To simplify the visualisation we output as well embeddings and labels compliant with tensorflow projector tool. Note that we filter only to Author nodes for the purpose of the blog post.
- kg_node2vec_tf_proj.tsv: an embedding file compliant with tensorflow project format (vectors without label nor id)
- kg_node2vec_label.tsv: an label file compliant with tensorflow project format
Use Tensorflow projector to visualise the embeddings. You can load the data (embedding and label).