Benchmark Datasets

Datasets Details

Graph Datasets

Dataset	Samples	Dimension	Edges	Classes	URL
CORA	2708	1433	5278	7	cora.zip
CITESEER	3327	3703	4552	6	citeseer.zip
PUBMED	19717	500	44325	3	pubmed.zip
DBLP	4057	334	3528	4	dblp.zip
CITE	3327	3703	4552	6	cite.zip
ACM	3025	1870	13128	3	acm.zip
AMAP	7650	745	119081	8	amap.zip
AMAC	13752	767	245861	10	amac.zip
CORAFULL	19793	8710	63421	70	corafull.zip
WIKI	2405	4973	8261	19	wiki.zip
COCS
BAT	131	81	1038	4	bat.zip
EAT	399	203	5994	4	eat.zip
UAT	1190	239	13599	4	uat.zip

Non-graph Datasets

Dataset Samples Dimension Type Classes URL

USPS 9298 256 Image 10 usps.zip

HHAR 10299 561 Record 6 hhar.zip

REUT 10000 2000 Text 4 reut.zip

Dataset Introduction

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

Citeseer

The Citeseer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.

Pubmed

The Pubmed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

DBLP

This is an author network from the DBLP dataset. There is an edge between two authors if they are the coauthor relationship. The authors are divided into four areas: database, data mining, machine learning and information retrieval. We label each author’s research area according to the conferences they submitted. Author features are the elements of a bag-of-words represented of keywords.

ACM

This is a paper network from the ACM dataset. There is an edge between two papers if they are written by same author. Paper features are the bag-of-words of the keywords. We select papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM and divide the papers into three classes (database, wireless communication, data mining) by their research area.

AMAP & AMAC

A-Computers and A-Photo are extracted from Amazon co-purchase graph, where nodes represent products, edges represent whether two products are frequently co-purchased or not, features represent product reviews encoded by bag-of-words, and labels are predefined product categories.

CORAFULL

WIKI

The Wikipedia (WIKI) is an online encyclopedia created and edited by volunteers around the world. The dataset is a word co-occurrence network constructed from the entire set of English Wikipedia pages. This data contains 2405 nodes, 17981 edges and 19 labels.

COCS

Coauthor-CS and Coauthor-Physics are two academic networks containing co-authorship relationship based on Microsoft Academic Graph. Nodes in these graphs denote authors, and edges denote co-authored relationship. In each dataset, authors are classified into 15 and 5 classes, respectively, based on the author’s research field, and the node feature is a bag-of-words representation of the paper keywords.

BAT

Data collected from the National Civil Aviation Agency (ANAC) from January to December 2016. It has 131 nodes, 1,038 edges (diameter is 5). Airport activity is measured by the total number of landings plus takeoffs in the corresponding year.

EAT

Data collected from the Statistical Office of the European Union (Eurostat) from January to November 2016. It has 399 nodes, 5,995 edges (diameter is 5). Airport activity is measured by the total number of landings plus takeoffs in the corresponding period.

UAT

Data collected from the Bureau of Transportation Statistics from January to October, 2016. It has 1,190 nodes, 13,599 edges (diameter is 8). Airport activity is measured by the total number of people that passed (arrived plus departed) the airport in the corresponding period.

If you find this repository useful to your research or work, it is really appreciate to star this repository. ❤️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Benchmark Datasets

Datasets Details

Dataset Introduction

Cora

Citeseer

Pubmed

DBLP

ACM

AMAP & AMAC

CORAFULL

WIKI

COCS

BAT

EAT

UAT

Dataset	Samples	Dimension	Type	Classes	URL
USPS	9298	256	Image	10	usps.zip
HHAR	10299	561	Record	6	hhar.zip
REUT	10000	2000	Text	4	reut.zip

Files

README.md

Latest commit

History

README.md

File metadata and controls

Benchmark Datasets

Datasets Details

Dataset Introduction

Cora

Citeseer

Pubmed

DBLP

ACM

AMAP & AMAC

CORAFULL

WIKI

COCS

BAT

EAT

UAT