Kaggle - Kaggle, the leading platform for predictive modeling competitions.
UCI MLR - UC Irvine Machine Learning Repository
google.com/publicdata - public data maintained by Google
Freebase - A community-curated database of well-known people, places, and things
mldata.org - machine learning data set repository for uploading and finding data sets
Infochimps - a huge collection of large-sized data sets
Amazon Web Services - Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
Databib - a searchable catalog / registry / directory / bibliography of research data repositories.
figshare - an online digital repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos.
reddit r/datasets - datasets shared on reddit
datahub - the free, powerful data management platform from the Open Knowledge Foundation
Quandl - a search engine for numerical data
enigma - a search engine for public records published by governments, companies and organizations.
Tiny Images Dataset - a dataset of 79,302,017 images, each being a 32x32 color image
Mobio - bi-modal (audio and video) data taken from 152 people
1000 Genomes Project - A Deep Catalog of Human Genetic Variation
The Wayback Machine - 80 terabytes of archived web crawl data available for research
ImageNet - a searchable image database
Social Network Analysis Interactive Dataset Library - a site that contains an accessible library of many of the 'open' social network analysis datasets
Cancer Program Data Sets - a collection of genomic datasets
EconData - economic time series, produced by a number of U.S. Government agencies and distributed in a variety of formats and media
USGovXML - USGovXML is an index to publicly available web services and XML data sources that are provided by the US government
Titanic Survivors - dataset with 1313 samples and 10 features about Titanic survivors
SMS Spam Collection - A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site
SNAP - Stanford Large Network Dataset Collection
Amazon Google Books Ngrams - A data set containing Google Books n-gram corpuses
The Million Song Dataset - Audio features and metadata for a million contemporary popular music tracks.
Modeling Online Auctions - Datasets of bidding for different ebay auctions
CAT Dataset - A dataset of 10,000 cat images
Click Dataset - A large dataset of about 53.5 billion HTTP requests made by users at Indiana University
Meteorites - Registered meteorites that have impacted on Earth
Common Crawl 2012 web corpus - A hyperlink graph of 3.5 billion web pages and 128 billion hyperlinks between these pages
PyPi/Maven Dependency Data - State of the Maven/Java dependency graph and state of the PyPi/Python dependency graph.
NYPD Crash Data Band-Aid - NYPD traffic crash data as a geocoded CSV
Pass rates, race & gender - Detailed data on pass rates, race, and gender for 2013
Nominate/vote data - Datasets including all the D-NOMINATE and W-NOMINATE scores
aiHit Datasets - Information on random 10,000 UK companies sampled from aiHit DB
Amsterdam Library of Object Images (ALOI) - A color image collection of one-thousand small objects, recorded for scientific purposes