-
Kaggle - Kaggle, the leading platform for predictive modeling competitions.
-
UCI MLR - UC Irvine Machine Learning Repository
-
google.com/publicdata - public data maintained by Google
-
Freebase - A community-curated database of well-known people, places, and things
-
mldata.org - machine learning data set repository for uploading and finding data sets
-
Infochimps - a huge collection of large-sized data sets
-
Amazon Web Services - Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
-
Databib - a searchable catalog / registry / directory / bibliography of research data repositories.
-
figshare - an online digital repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos.
-
reddit r/datasets - datasets shared on reddit
-
datahub - the free, powerful data management platform from the Open Knowledge Foundation
-
Quandl - a search engine for numerical data
-
enigma - a search engine for public records published by governments, companies and organizations.
-
Tiny Images Dataset - a dataset of 79,302,017 images, each being a 32x32 color image
-
Mobio - bi-modal (audio and video) data taken from 152 people
-
1000 Genomes Project - A Deep Catalog of Human Genetic Variation
-
The Wayback Machine - 80 terabytes of archived web crawl data available for research
-
ImageNet - a searchable image database
-
Social Network Analysis Interactive Dataset Library - a site that contains an accessible library of many of the 'open' social network analysis datasets
-
Cancer Program Data Sets - a collection of genomic datasets
-
EconData - economic time series, produced by a number of U.S. Government agencies and distributed in a variety of formats and media
-
USGovXML - USGovXML is an index to publicly available web services and XML data sources that are provided by the US government
-
Titanic Survivors - dataset with 1313 samples and 10 features about Titanic survivors
-
SMS Spam Collection - A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site
-
SNAP - Stanford Large Network Dataset Collection
-
Amazon Google Books Ngrams - A data set containing Google Books n-gram corpuses
-
The Million Song Dataset - Audio features and metadata for a million contemporary popular music tracks.
-
Modeling Online Auctions - Datasets of bidding for different ebay auctions
-
CAT Dataset - A dataset of 10,000 cat images
-
Click Dataset - A large dataset of about 53.5 billion HTTP requests made by users at Indiana University
-
Meteorites - Registered meteorites that have impacted on Earth
-
Common Crawl 2012 web corpus - A hyperlink graph of 3.5 billion web pages and 128 billion hyperlinks between these pages
-
PyPi/Maven Dependency Data - State of the Maven/Java dependency graph and state of the PyPi/Python dependency graph.
-
NYPD Crash Data Band-Aid - NYPD traffic crash data as a geocoded CSV
-
Pass rates, race & gender - Detailed data on pass rates, race, and gender for 2013
-
Nominate/vote data - Datasets including all the D-NOMINATE and W-NOMINATE scores
-
aiHit Datasets - Information on random 10,000 UK companies sampled from aiHit DB
-
Amsterdam Library of Object Images (ALOI) - A color image collection of one-thousand small objects, recorded for scientific purposes