Skip to content

Latest commit

 

History

History
92 lines (50 loc) · 6.78 KB

File metadata and controls

92 lines (50 loc) · 6.78 KB

Benchmark Datasets

Datasets Details

  1. Graph Datasets

    Dataset Samples Dimension Edges Classes URL
    CORA 2708 1433 5278 7 cora.zip
    CITESEER 3327 3703 4552 6 citeseer.zip
    PUBMED 19717 500 44325 3 pubmed.zip
    DBLP 4057 334 3528 4 dblp.zip
    CITE 3327 3703 4552 6 cite.zip
    ACM 3025 1870 13128 3 acm.zip
    AMAP 7650 745 119081 8 amap.zip
    AMAC 13752 767 245861 10 amac.zip
    CORAFULL 19793 8710 63421 70 corafull.zip
    WIKI 2405 4973 8261 19 wiki.zip
    COCS
    BAT 131 81 1038 4 bat.zip
    EAT 399 203 5994 4 eat.zip
    UAT 1190 239 13599 4 uat.zip
  2. Non-graph Datasets

    Dataset Samples Dimension Type Classes URL
    USPS 9298 256 Image 10 usps.zip
    HHAR 10299 561 Record 6 hhar.zip
    REUT 10000 2000 Text 4 reut.zip

Dataset Introduction

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

Citeseer

The Citeseer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.

Pubmed

The Pubmed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

DBLP

This is an author network from the DBLP dataset. There is an edge between two authors if they are the coauthor relationship. The authors are divided into four areas: database, data mining, machine learning and information retrieval. We label each author’s research area according to the conferences they submitted. Author features are the elements of a bag-of-words represented of keywords.

ACM

This is a paper network from the ACM dataset. There is an edge between two papers if they are written by same author. Paper features are the bag-of-words of the keywords. We select papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM and divide the papers into three classes (database, wireless communication, data mining) by their research area.

AMAP & AMAC

A-Computers and A-Photo are extracted from Amazon co-purchase graph, where nodes represent products, edges represent whether two products are frequently co-purchased or not, features represent product reviews encoded by bag-of-words, and labels are predefined product categories.

CORAFULL
WIKI

The Wikipedia (WIKI) is an online encyclopedia created and edited by volunteers around the world. The dataset is a word co-occurrence network constructed from the entire set of English Wikipedia pages. This data contains 2405 nodes, 17981 edges and 19 labels.

COCS

Coauthor-CS and Coauthor-Physics are two academic networks containing co-authorship relationship based on Microsoft Academic Graph. Nodes in these graphs denote authors, and edges denote co-authored relationship. In each dataset, authors are classified into 15 and 5 classes, respectively, based on the author’s research field, and the node feature is a bag-of-words representation of the paper keywords.

BAT

Data collected from the National Civil Aviation Agency (ANAC) from January to December 2016. It has 131 nodes, 1,038 edges (diameter is 5). Airport activity is measured by the total number of landings plus takeoffs in the corresponding year.

EAT

Data collected from the Statistical Office of the European Union (Eurostat) from January to November 2016. It has 399 nodes, 5,995 edges (diameter is 5). Airport activity is measured by the total number of landings plus takeoffs in the corresponding period.

UAT

Data collected from the Bureau of Transportation Statistics from January to October, 2016. It has 1,190 nodes, 13,599 edges (diameter is 8). Airport activity is measured by the total number of people that passed (arrived plus departed) the airport in the corresponding period.

If you find this repository useful to your research or work, it is really appreciate to star this repository.​ ❤️