Overview of Datasets

The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:

  • Wikipedia English-language Article Corpus (wikipedia_corpus; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article, in

  • Wikipedia Pagelink Graph (wikipedia_pagelinks; ) --

  • Wikipedia Pageview Stats (wikipedia_pageviews; 2.3 TB, about 250 billion records (FIXME: verify num records)) -- hour-by-hour pageview

  • ASA SC/SG Data Expo Airline Flights (airline_flights; 12 GB, 120 million records): every US airline flight from 1987-2008, with information on arrival/depature times and delay causes, and accompanying data on airlines, airports and airplanes.

  • NCDC Hourly Global Weather Measurements, 1929-2009 (ncdc_weather_hourly; 59 GB, XX billion records): hour-by-hour weather from the National Climate Data Center for the entire globe, with reasonably-dense spatial coverage back to the 1950s and in some case coverage back to 1929.

  • 1998 World Cup access logs (access_logs/ita_world_cup_apachelogs; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.

=== Wikipedia Page Traffic Statistic V3 ===

  • a 150 GB sample of the data used to power It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).
  • Twilio/ Street Vector Data Set -- -- geo -- Twilio/ database of mapped US street names and address ranges.

  • 2008 TIGER/Line Shapefiles -- 125 GB -- geo -- This data set is a complete set of Census 2000 and Current shapefiles for American states, counties, subdivisions, districts, places, and areas. The data is available as shapefiles suitable for use in GIS, along with their associated metadata. The official source of this data is the US Census Bureau, Geography Division.

==== ASA SC/SG Data Expo Airline Flights

This data set is from the ASA Statistical Computing / Statistical Graphics section 2009 contest, "Airline Flight Status -- Airline On-Time Statistics and Delay Causes". The documentation below is largely adapted from that site.

The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.

The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.

The data comes originally from the DOT's Research and Innovative Technology Administration (RITA) group, where it is described in detail. You can download the original data there. The files here have derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals.

Here are a few ideas to get you started exploring the data:

  • When is the best time of day/day of week/time of year to fly to minimise delays?
  • Do older planes suffer more delays?
  • How does the number of people flying between different locations change over time?
  • How well does weather predict plane delays?
  • Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

===== Support data

The airport datasets contain errors and conflicts; we've done some hand-curation and verification to reconcile them. The file wikipedia_conflicting.tsv shows where my patience wore out.

=== ITA World Cup Apache Logs

=== Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD) ===

  • 20 GB
  • geo, stats

=== Retrosheet

        25	 	Retrosheet: MLB play-by-play, high detail, 1840-2011	ripd/
        25	 	Retrosheet: MLB box scores, 1871-2011               	ripd/

=== Other Datasets ===

approx size	 Mrecs	source Data
       huge		US Patent Data from Google                 [Google Patent Collection]
       huge	      1	Mathematical constants to billion+'th-place
  2_300_000	 250000	Wikipedia Pageview Stats                  
   470_000 Wikipedia Log traces          
   124_000	1300000	Access Logs, 1998 World Cup (Internet Traffic Archive) 	access_logs/ita/ita_world_cup
    40_000	 B	NCDC: Hourly Weather (full)                
    34_000	     10	MLB Gameday Pitch-by-pitch data, 2007-2011 
    16_000	    619	Wikipedia corpus and pagelinks             
    14_000	      	NCDC: Hourly weather (simplified)          
    14_000	       	Memetracker                                
    14_000	      	Amazon Co-Purchasing Data                  
    11_000	      	Crosswikis                                 
     6_400	      	NCDC: Daily Weather                        
     6_300	      	Berkeley Earth Surface Temperature                  	stats/earth_surface_temperature
     2_900	      	Twilio TigerLINE US Street Map                      	geo/us_street_map/addresses
     1_900	      	All US Airline Flights 1987-2009 (ASA Data Expo)
     1_300	      	Geonames Points of Interest                         	geo/geonames/info
     1_300	      	Daily Prices for all US stocks, 1962–2011           	stats/stock_prices
     1_040	      	Patent data (see Google data too)          
       573	      	TAKS Exam Scores for all Texas students, 2007-2010  	ripd/texas_taks_exam
       571	      	Pi to 1 Billion decimal places             
       419	      	Enron Email Corpus                                  	lang/corpora/enron_trial_coporate_email_corpus
       362	      	DBpedia Wikipedia Article Features         
       331	      	DBpedia                                    
       310	       	Grouplens: User-Movie affinity                      	graph/grouplens_movies
       305	 	UFO Sightings (UFORC)                               	geo/ufo_sightings
       223	 	Geonames Postal Codes                               	geo/geonames/postal_codes
       121	 	Book Crossing: User-Book affinity                   	graph/book_crossing
       111		Maxmind GeoLite (IP-Geo) data                       	ripd/
        91	 	Access Logs:'s Star Wars Kid logs          	access_logs/star_wars_kid
        62	 	Metafilter corpus of postings with metadata         	ripd/
        47	 	Word frequencies from the British National Corpus
        36	 	Mobywords thesaurus                                 	lang/corpora/thesaurus_mobywords
        25	 	Retrosheet: MLB play-by-play, high detail, 1840-2011	ripd/
        25	 	Retrosheet: MLB box scores, 1871-2011               	ripd/
        20	 	US Federal Reserve Bank Loans (Bloomberg)           	misc/bank_loans_by_fed
        11	 	Scrabble dictionaries                               	lang/corpora/scrabble
        11	 	All Scrabble tile combinations with rack value      	misc/words_quackle
      1000	 	Marvel Universe Social Graph
         . 		Materials Safety Datasheets
         . 		Crunchbase
         . 		Natural Earth detailed geographic boundaries
         . 		US Census 2009 ACS (Long-form census)
         .		US Census Geographic boundaries
         .		Zillow US Neighborhood Boundaries
         . 		Open Street Map
2_000_000		Google Books N-Grams                       

60_000_000 Common Crawl Web Corpus 600_000 Apache Software Foundation Public Mail Archives 300_000 Million-Song dataset . Reference Energy Disaggregation Dataset (REDD) . US Legislation Co-Sponsorship . VoteView: Political Spectrum Rank of US Legistorls/Laws DW-NOMINATE Rank Orderings all Houses and Senates . World Bank . Record of American Democracy The Record Of American Democracy (ROAD) data includes election returns, socioeconomic summaries, and demographic measures of the American public at unusually low levels of geographic aggregation. The NSF-supported ROAD project covers every state in the country from 1984 through 1990 (including some off-year elections). One collection of data sets includes every election at and above State House, along with party registration and other variables, in each state for the roughly 170,000 precincts nationwide (about 60 times the number of counties). Another collection has added to these (roughly 30-40) political variables an additional 3,725 variables merged from the 1990 U.S. Census for 47,327 aggregate units (about 15 times the number of counties) about the size one or more cities or towns. These units completely tile the U.S. landmass. The collection also includes geographic boundary files so users can easily draw maps with these data. . Human Mortality DB The Human Mortality Database (HMD) was created to provide detailed mortality and population data to researchers, students, journalists, policy analysts, and others interested in the history of human longevity. The project began as an outgrowth of earlier projects in the Department of Demography at the University of California, Berkeley, USA, and at the Max Planck Institute for Demographic Research in Rostock, Germany (see history). It is the work of two teams of researchers in the USA and Germany (see research teams), with the help of financial backers and scientific collaborators from around the world (see acknowledgements). . FCC Antenna locations . Pew Research Datasets . Youtube Related Videos . Westbury Usenet Archive (2005-2010) This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs. . Wikipedia Page Traffic Statistics snap-753dfc1c . Wikipedia Traffic Statistics V2 snap-0c155c67 . Wikipedia Page Traffic Statistic V3 snap-f57dec9a . Marvel Universe Social Graph snap-7766d116 10_000 Daily Global Weather, 1929-2009 snap-ac47f4c5 220_000 Twilio/ Street Vector Data Set snap-5eaf5537 MySQL geo A complete database of US street names and address ranges mapped to zip codes and latitude/longitude ranges, with DTMF key mappings for all street names. . US Economic Data 2003-2006 snap-0bdf3f62 stats US Economic Data for 2003-2006 from the The US Census Bureau -- raw census data (ACS2002-2006) .

==== Wikipedia Log traces ====

  • logs/wikibench_logtraces (470 GB)

==== Amazon Co-Purchasing Data ====

==== Patents ====

==== Marvel Universe Social Graph ====

  • 1 GB
  • graph
  • Social collaboration network of the Marvel comic book universe based on co-appearances.

==== Google Books Ngrams ====

==== Common Crawl web corpus ====


A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 and formatted in the ARC (.arc) file format.


  • Size: 60 TB
  • Source: Common Crawl Foundation -­
  • Created On: February 15, 2012 2:23 AM GMT
  • Last Updated: February 15, 2012 2:23 AM GMT
  • Available at: s3://aws-publicdatasets/common-crawl/crawl-002/

Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data for the purpose of driving innovation in research, education and technology. This data set contains web crawl data from 5 billion web pages and is released under the Common Crawl Terms of Use.

The ARC (.arc) file format used by Common Crawl was developed by the Internet Archive to store their archived crawl data. It is essentially a multi-part gzip file, with each entry in the master gzip (ARC) file being an independent gzip stream in itself. You can use a tool like zcat to spill the contents of an ARC file to stdout. For more information see the Internet Archive's Arc File Format description.

Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.

To learn more about Amazon Elastic MapReduce please see the product detail page.

Common Crawl's Hadoop classes and other code can be found in its GitHub repository.

A tutorial for analyzing Common Crawl's dataset with Amazon Elastic MapReduce called MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl may be found on the Common Crawl blog.

==== Apache Software Foundation Public Mail Archives ====

==== Reference Energy Disaggregation Dataset (REDD) ====[Reference Energy Disaggregation Data Set]

Initial REDD Release, Version 1.0

This is the home page for the REDD data set. Below you can download an initial version of the data set, containing several weeks of power data for 6 different homes, and high-frequency current/voltage data for the main power supply of two of these homes. The data itself and the hardware used to collect it are described more thoroughly in the Readme below and in the paper:

\J. Zico Kolter and Matthew J. Johnson. REDD: A public data set for energy disaggregation research. In proceedings of the SustKDD workshop on Data Mining Applications in Sustainability, 2011. [pdf]

Those wishing to use the dataset in academic work should cite this paper as the reference. Although the data set is freely available, for the time being we still ask those interested in the downloading the data to email us ( to receive the username/password to download the data. See the readme.txt file for a full description of the different downloads and their formats

==== The Book-Crossing dataset ====

  •[Book Crossing] Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books. Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication): Improving Recommendation Lists Through Topic Diversification, Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear. As a courtesy, if you use the data, I would appreciate knowing your name, what research group you are in, and the publications that may result.

The Book-Crossing dataset comprises 3 tables.

  • BX-Users: Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL-values.
  • BX-Books: Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon web site.
  • BX-Book-Ratings: Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

==== Westbury Usenet Archive ====

  •[Westbury Usenet Archive] -- USENET corpus (2005-2010) This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.

==== Million Song Dataset ====

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Its purposes are:

To encourage research on algorithms that scale to commercial sizes To provide a reference dataset for evaluating research As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's) To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.

The Million Song Dataset is also a cluster of complementary datasets contributed by the community:

  • SecondHandSongs dataset: cover songs
  • musiXmatch dataset: lyrics
  • dataset: song-level tags and similarity
  • Taste Profile subset: user data


From the original documentation:

Field name Type Description Link analysis sample rate float sample rate of the audio used url artist 7digitalid int ID from or -1 url artist familiarity float algorithmic estimation url artist hotttnesss float algorithmic estimation url artist id string Echo Nest ID url artist latitude float latitude artist location string location name artist longitude float longitude artist mbid string ID from url artist mbtags array string tags from url artist mbtags count array int tag counts for musicbrainz tags url artist name string artist name url artist playmeid int ID from, or -1 url artist terms array string Echo Nest tags url artist terms freq array float Echo Nest tags freqs url artist terms weight array float Echo Nest tags weight url audio md5 string audio hash code bars confidence array float confidence measure url bars start array float beginning of bars, usually on a beat url beats confidence array float confidence measure url beats start array float result of beat tracking url danceability float algorithmic estimation duration float in seconds end of fade in float seconds at the beginning of the song url energy float energy from listener point of view key int key the song is in url key confidence float confidence measure url loudness float overall loudness in dB url mode int major or minor url mode confidence float confidence measure url release string album name release 7digitalid int ID from or -1 url sections confidence array float confidence measure url sections start array float largest grouping in a song, e.g. verse url segments confidence array float confidence measure url segments loudness max array float max dB value url segments loudness max time array float time of max dB value, i.e. end of attack url segments loudness max start array float dB value at onset url segments pitches 2D array float chroma feature, one value per note url segments start array float musical events, ~ note onsets url segments timbre 2D array float texture features (MFCC+PCA-like) url similar artists array string Echo Nest artist IDs (sim. algo. unpublished) url song hotttnesss float algorithmic estimation song id string Echo Nest song ID start of fade out float time in sec url tatums confidence array float confidence measure url tatums start array float smallest rythmic element url tempo float estimated tempo in BPM url time signature int estimate of number of beats per bar, e.g. 4 url time signature confidence float confidence measure url title string song title track id string Echo Nest track ID track 7digitalid int ID from or -1 url year int song release year from MusicBrainz or 0 url

An Example Track Description

Below is a list of all the fields associated with each track in the database. This is simply an annotated version of the output of the example code For the fields that include a large amount of numerical data, we indicate only the shape of the data array. Since most of these fields are taken directly from the Echo Nest Analyze API, more details can be found at the Echo Nest Analyze API documentation.

A more technically-oriented list of these fields is given on the field list page.

This example data is shown for the track whose track_id is TRAXLZU12903D05F94 - namely, "Never Gonna Give You Up" by Rick Astley.

artist_mbid:                    db92a151-1ac2-438b-bc43-b82e149ddd50            the ID for this artists is db9...
artist_mbtags:                  shape = (4,)                                    this artist received 4 tags on
artist_mbtags_count:            shape = (4,)                                    raw tag count of the 4 tags this artist received on
artist_name:                    Rick Astley                                     artist name
artist_playmeid:                1338                                            the ID of that artist on the service
artist_terms:                   shape = (12,)                                   this artist has 12 terms (tags) from The Echo Nest
artist_terms_freq:              shape = (12,)                                   frequency of the 12 terms from The Echo Nest (number between 0 and 1)
artist_terms_weight:            shape = (12,)                                   weight of the 12 terms from The Echo Nest (number between 0 and 1)
audio_md5:                      bf53f8113508a466cd2d3fda18b06368                hash code of the audio used for the analysis by The Echo Nest
bars_confidence:                shape = (99,)                                   confidence value (between 0 and 1) associated with each bar by The Echo Nest
bars_start:                     shape = (99,)                                   start time of each bar according to The Echo Nest, this song has 99 bars
beats_confidence:               shape = (397,)                                  confidence value (between 0 and 1) associated with each beat by The Echo Nest
beats_start:                    shape = (397,)                                  start time of each beat according to The Echo Nest, this song has 397 beats
danceability:                   0.0                                             danceability measure of this song according to The Echo Nest (between 0 and 1, 0 => not analyzed)
duration:                       211.69587                                       duration of the track in seconds
end_of_fade_in:                 0.139                                           time of the end of the fade in, at the beginning of the song, according to The Echo Nest
energy:                         0.0                                             energy measure (not in the signal processing sense) according to The Echo Nest (between 0 and 1, 0 => not analyzed)
key:                            1                                               estimation of the key the song is in by The Echo Nest
key_confidence:                 0.324                                           confidence of the key estimation
loudness:                       -7.75                                           general loudness of the track
mode:                           1                                               estimation of the mode the song is in by The Echo Nest
mode_confidence:                0.434                                           confidence of the mode estimation
release:                        Big Tunes - Back 2 The 80s                      album name from which the track was taken, some songs / tracks can come from many albums, we give only one
release_7digitalid:             786795                                          the ID of the release (album) on the service
sections_confidence:            shape = (10,)                                   confidence value (between 0 and 1) associated with each section by The Echo Nest
sections_start:                 shape = (10,)                                   start time of each section according to The Echo Nest, this song has 10 sections
segments_confidence:            shape = (935,)                                  confidence value (between 0 and 1) associated with each segment by The Echo Nest
segments_loudness_max:          shape = (935,)                                  max loudness during each segment
segments_loudness_max_time:     shape = (935,)                                  time of the max loudness during each segment
segments_loudness_start:        shape = (935,)                                  loudness at the beginning of each segment
segments_pitches:               shape = (935, 12)                               chroma features for each segment (normalized so max is 1.)
segments_start:                 shape = (935,)                                  start time of each segment (~ musical event, or onset) according to The Echo Nest, this song has 935 segments
segments_timbre:                shape = (935, 12)                               MFCC-like features for each segment
similar_artists:                shape = (100,)                                  a list of 100 artists (their Echo Nest ID) similar to Rick Astley according to The Echo Nest
song_hotttnesss:                0.864248830588                                  according to The Echo Nest, when downloaded (in December 2010), this song had a 'hotttnesss' of 0.8 (on a scale of 0 and 1)
song_id:                        SOCWJDB12A58A776AF                              The Echo Nest song ID, note that a song can be associated with many tracks (with very slight audio differences)
start_of_fade_out:              198.536                                         start time of the fade out, in seconds, at the end of the song, according to The Echo Nest
tatums_confidence:              shape = (794,)                                  confidence value (between 0 and 1) associated with each tatum by The Echo Nest
tatums_start:                   shape = (794,)                                  start time of each tatum according to The Echo Nest, this song has 794 tatums
tempo:                          113.359                                         tempo in BPM according to The Echo Nest
time_signature:                 4                                               time signature of the song according to The Echo Nest, i.e. usual number of beats per bar
time_signature_confidence:      0.634                                           confidence of the time signature estimation
title:                          Never Gonna Give You Up                         song title
track_7digitalid:               8707738                                         the ID of this song on the service
track_id:                       TRAXLZU12903D05F94                              The Echo Nest ID of this particular track on which the analysis was done
year:                           1987                                            year when this song was released, according to

==== Google / Stanford Crosswiki ====[wikipedia_words]

This data set accompanies

Valentin I. Spitkovsky and Angel X. Chang. 2012. A Cross-Lingual Dictionary for English Wikipedia Concepts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).

Please cite the appropriate publication if you use this data. (See for .bib entries.)

There are six line-based (and two other) text files, each of them lexicographically sorted, encoded with UTF-8, and compressed using bzip2 (-9). One way to view the data without fully expanding it first is with the bzcat command, e.g.,

bzcat dictionary.bz2 | grep ... | less

Note that raw data were gathered from heterogeneous sources, at different points in time, and are thus sometimes contradictory. We made a best effort at reconciling the information, but likely also introduced some bugs of our own, so be prepared to write fault-tolerant code... keep in mind that even tiny error rates translate into millions of exceptions, over billions of datums.

==== English Gigaword Dataset (LDC) ====

The[English Gigaword] corpus, now being released in its fourth edition, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC at the University of Pennsylvania. The fourth edition includes all of the contents in English Gigawaord Third Edition (LDC2007T07) plus new data covering the 24-month period of January 2007 through December 2008. Portions of the dataset are © 1994-2008 Agence France Presse, © 1994-2008 The Associated Press, © 1997-2008 Central News Agency (Taiwan), © 1994-1998, 2003-2008 Los Angeles Times-Washington Post News Service, Inc., © 1994-2008 New York Times, © 1995-2008 Xinhua News Agency, © 2009 Trustees of the University of Pennsylvania. The six distinct international sources of English newswire included in this edition are the following:

Agence France-Presse, English Service (afp_eng) Associated Press Worldstream, English Service (apw_eng) Central News Agency of Taiwan, English Service (cna_eng) Los Angeles Times/Washington Post Newswire Service (ltw_eng) New York Times Newswire Service (nyt_eng) Xinhua News Agency, English Service (xin_eng) New in the Fourth Edition

For an example of the data in this corpus, please review[this sample file].

=== Sources of public and Commercial data


  • Infochimps
  • Factual
  • CKAN
  • Get.theinfo
  • Microsoft Azure Data Marketplace
