Skip to content

Latest commit

 

History

History
116 lines (82 loc) · 3.87 KB

README.md

File metadata and controls

116 lines (82 loc) · 3.87 KB

Citation Network Graph DataBase

wakatime

About

This project focuses on implementing a graph-oriented database using Neo4j to explore and analyze citation networks.

Dataset

The dataset used in this project is a network of articles and their related information.

The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. Each paper is associated with abstract, authors, year, venue, and title.

The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.

DBLP-Citation-network V12: 4,894,081 papers and 45,564,149 citation relationships (2020-04-09)

Deployment

For deployment, the project uses a Neo4j database to store the data. The database can be deployed locally or on a cloud service. For local deployment, download and install Neo4j Desktop.

Database Configuration

Create a new database in Neo4j Desktop and set the following configurations:

  • Database Name: citation-network
  • Password: <db_pass>
  • User: <db_user>
  • Port: 7687
  • URI: localhost

Open the database configuration and configure the following settings.
For this project, with a PC with 16GB of RAM, I used the following settings:

server.memory.heap.initial_size=8g
server.memory.heap.max_size=8g
server.memory.pagecache.size=6g

Download the Dataset

Download the dataset from Kaggle and extract 'dblp.v12.json' in ./dataset folder.

Setting the Environment

Create an .env file in the project's root folder and add the following variables with the corresponding values:

DB_URI="localhost:7687"

DB_NAME="citation-network"
DB_PASS="<db_pass>"
DB_USER="<db_user>"

# Optional. If you want to use a test database.
TEST_DB_NAME="citation-network-test"
TEST_DB_PASS="<test_db_pass>"
TEST_DB_USER="<test_db_user>"

DATASET_PATH="./dataset/dblp.v12.json" # If you downloaded a different version, update the file name.

# Optional. If not set, the default values are used.
BATCH_SIZE_PAPER_NODES=5000
BATCH_SIZE_REQUIRED_NODES=10000

Virtual environment

Open a terminal in the project's root folder and run:

python -m venv .venv

Activate virtual environment:

.venv\Scripts\activate

Install Dependencies

Install the required dependencies by running:

pip install -r requirements.pip

Apply Constraints and Relations into the Database

Apply labels in the database by running and following the instructions in the terminal:

cd database/utils
install_labels.bat

Load Data into the Database

Load the data into the database by running and following the instructions in the terminal:

python database/populate_db_batches.py

Useful Links

Websites

Documentation

Downloads

Courses