This project focuses on implementing a graph-oriented database using Neo4j to explore and analyze citation networks.
The dataset used in this project is a network of articles and their related information.
The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. Each paper is associated with abstract, authors, year, venue, and title.
The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
DBLP-Citation-network V12: 4,894,081 papers and 45,564,149 citation relationships (2020-04-09)
For deployment, the project uses a Neo4j database to store the data. The database can be deployed locally or on a cloud service. For local deployment, download and install Neo4j Desktop.
Create a new database in Neo4j Desktop and set the following configurations:
- Database Name: citation-network
- Password: <db_pass>
- User: <db_user>
- Port: 7687
- URI: localhost
Open the database configuration and configure the following settings.
For this project, with a PC with 16GB of RAM, I used the following settings:
server.memory.heap.initial_size=8g
server.memory.heap.max_size=8g
server.memory.pagecache.size=6g
Download the dataset from Kaggle and extract 'dblp.v12.json' in ./dataset
folder.
Create an .env
file in the project's root folder and add the following variables with the corresponding values:
DB_URI="localhost:7687"
DB_NAME="citation-network"
DB_PASS="<db_pass>"
DB_USER="<db_user>"
# Optional. If you want to use a test database.
TEST_DB_NAME="citation-network-test"
TEST_DB_PASS="<test_db_pass>"
TEST_DB_USER="<test_db_user>"
DATASET_PATH="./dataset/dblp.v12.json" # If you downloaded a different version, update the file name.
# Optional. If not set, the default values are used.
BATCH_SIZE_PAPER_NODES=5000
BATCH_SIZE_REQUIRED_NODES=10000
Open a terminal in the project's root folder and run:
python -m venv .venv
Activate virtual environment:
.venv\Scripts\activate
Install the required dependencies by running:
pip install -r requirements.pip
Apply labels in the database by running and following the instructions in the terminal:
cd database/utils
install_labels.bat
Load the data into the database by running and following the instructions in the terminal:
python database/populate_db_batches.py