This sample code shows how to import and work with the Open Academic Graph snapshot of the Microsoft Academic Graph using various data platforms - SQL Server / Azure SQL DB, Azure Cosmos DB, Apache Spark etc. For now, the repo only contains scripts for SQL Server / Azure SQL DB.
You need a client VM to run the custom bulk import tool. This VM can ideally be matched in terms of # of CPUs to the VM running SQL / to the Azure SQL DB database size. I use E32s_v3 sized VMs for the SQL instance and for the client, with a 1TB managed disk for the SQL instance to hold the database. For Azure SQL DB, I recommend a Gen 5 database with 32 vCores. For best performance with Azure SQL DB, make sure the client VM is in the same region as the Azure SQL DB instance, and for increased security, please use a VNET service endpoint + firewall rules to only permit that VNET to access Azure SQL DB.
If you choose to use a SQL Server instance, start by running the code in 0_CreateDB.sql. Ideally you need a large VM (I tested with 32-vCPU VMs in Azure) but if you do choose to use smaller VM sizes, the number of data files, the number of threads in the custom importer tool etc. need to be adjusted accordingly. The code is generally very parallelizable, so tweak these parameters according to the hardware at your disposal.
If you choose to run against Azure SQL DB, you can just create a database. Skip running the 0_CreateDB.sql file.
Next, create the tables and objects by running scripts in 1_CreateGraphTables.sql and 2_ConvertToGraph.sql. If you are in Azure SQL DB, please skip the "USE [OpenAcademicGraph]" lines.
On the client VM, download and extract the OAG v1 files from https://www.openacademic.ai/oag/. There's a helper PowerShell script to do this in the repo (download_client.ps1).
The actual bulk load is done by running the ParseAndExplodeBlockingCollection_FastMember.linq or ParseAndExplodeBlockingCollection_FastMember_SQLDB.linq scripts in LinqPad.NET. Before running either of these scripts please check:
- the path to the extracted OAG v1 TXT files
- the connection string to connect to SQL Server / Azure SQL DB
- the number of threads to use (the default is 30)
Basic search is in the 4_BFS.sql file. Pagerank implementation is in 5_PageRank.sql
- Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). pp.990-998.
- Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246.