-
-
Save olivercameron/482dcfe8f34d66b536b1048eefe8b40d to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Name | Purpose | File Size | Link | |
---|---|---|---|---|
20 Newsgroups | The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc. | 61.6MB | http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html | |
Amazon Reviews | Over 142 million product reviews for sentiment analysis, recommender systems, and more. | 20GB | http://jmcauley.ucsd.edu/data/amazon/ | |
Football Strategy | Thousands of scenarios to make the best coaching decisions. | 876KB | https://www.crowdflower.com/wp-content/uploads/2016/03/Football-Scenarios-DFE-832307.csv | |
Horses for Courses | Horse-racing data for predicting race results. | 19MB | https://www.kaggle.com/lukebyrne/horses-for-courses | |
Human Activity Recognition with Smartphones | Sensor data for recognizing the human activity - walking, sitting, etc. | 25MB | https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones | |
Labeled Faces in the Wild | 13,000 named faces for facial recognition. Multiple training and test sets. | 173MB | http://vis-www.cs.umass.edu/lfw/ | |
National Survey on Drug Use and Health | Predict drug use based on health survey questions. | 2GB | http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/34933 | |
NORB 3D Object Recognition | Binocular images of 50 toy figurines for 3D object recognition from image. | Multiple files, over 5GB total | http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/ | |
One Million Songs | Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification. | 1.8GB | http://labrosa.ee.columbia.edu/millionsong/ | |
SMS Spam Collection | A collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering. | 204KB | http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ | |
Hate Speech Identification | A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis. | 2.66MB | https://www.crowdflower.com/wp-content/uploads/2016/03/twitter-hate-speech-classifier-DFE-a845520.csv | |
Hidden Beauty of Flickr Pictures | 15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis. | 138KB, use Flickr API to get images | http://www.di.unito.it/~schifane/dataset/beauty-icwsm15/ | |
Yahoo Instant Messenger Friends Connectivity Graph | Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access. | 28MB | http://webscope.sandbox.yahoo.com/catalog.php?datatype=g | |
Record of Heart Sound | Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc. | 47.7MB | http://mldata.org/repository/data/viewslug/record-of-heart-sound/ | |
Prostate Cancer | Tumor and nontumor samples, used to recognize prostate cancer. | 4.8MB | http://mldata.org/repository/data/viewslug/prostate-cancer/ | |
Wine Quality | Chemical properties of red and white wines (separately) and quality, for classification. | 3 files, 343KB total | http://archive.ics.uci.edu/ml/datasets/Wine+Quality | |
Mushroom Identification | For hypothetically classifying mushrooms as edible or poisonous based on its characteristics. | 3 files, 480KB | http://archive.ics.uci.edu/ml/datasets/Mushroom | |
UFO Reports | 80,000 historic reports for classification or regression. This dataset has been standardized from the source data at nuforc.org. | 14.6MB | https://github.com/planetsig/ufo-reports | |
Militarized Interstate Disputes | Nearly 200 years of international threats, conflicts, etc. for modelling or prediction. Includes action taken, level of hostility, fatalities, and outcomes. | Multiple datasets, e.g., 962KB, 179KB | http://www.correlatesofwar.org/data-sets/MIDs | |
NBA & MLB Stats | Current and past season stats for teams and players for fantasy sports predictions. | Multiple datasets, e.g., 2016 MLB batters = 65KB | http://www.dougstats.com/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment