mrflip/datasets.md

## datasets.md

      
    Raw
  

              datasets.md
            
          
    == Overview of Datasets ==
The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:


Wikipedia English-language Article Corpus (wikipedia_corpus; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article, in


Wikipedia Pagelink Graph (wikipedia_pagelinks; ) --


Wikipedia Pageview Stats (wikipedia_pageviews; 2.3 TB, about 250 billion records (FIXME: verify num records)) -- hour-by-hour pageview


ASA SC/SG Data Expo Airline Flights (airline_flights; 12 GB, 120 million records): every US airline flight from 1987-2008, with information on arrival/depature times and delay causes, and accompanying data on airlines, airports and airplanes.


NCDC Hourly Global Weather Measurements, 1929-2009 (ncdc_weather_hourly; 59 GB, XX billion records): hour-by-hour weather from the National Climate Data Center for the entire globe, with reasonably-dense spatial coverage back to the 1950s and in some case coverage back to 1929.


1998 World Cup access logs (access_logs/ita_world_cup_apachelogs; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.


=== Wikipedia Page Traffic Statistic V3  ===


a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).


Twilio/Wigle.net Street Vector Data Set --  -- geo -- Twilio/Wigle.net database of mapped US street names and address ranges.


2008 TIGER/Line Shapefiles -- 125 GB -- geo -- This data set is a complete set of Census 2000 and Current shapefiles for American states, counties, subdivisions, districts, places, and areas. The data is available as shapefiles suitable for use in GIS, along with their associated metadata. The official source of this data is the US Census Bureau, Geography Division.


==== ASA SC/SG Data Expo Airline Flights
This data set is from the ASA Statistical Computing / Statistical Graphics section 2009 contest, "Airline Flight Status -- Airline On-Time Statistics and Delay Causes". The documentation below is largely adapted from that site.
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
The data comes originally from the DOT's Research and Innovative Technology Administration (RITA) group, where it is described in detail. You can download the original data there. The files here have derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals.
Here are a few ideas to get you started exploring the data:

When is the best time of day/day of week/time of year to fly to minimise delays?
Do older planes suffer more delays?
How does the number of people flying between different locations change over time?
How well does weather predict plane delays?
Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

===== Support data


Openflights.org (ODBL-licensed): user-generated datasets on the world of air flight.
** openflights_airports.tsv (http://openflights.org/data.html#airport:[original]) -- info on about 7000 airports.
** openflights_airlines.tsv (http://openflights.org/data.html#airline:[original]) -- info on about 6000 airline carriers
** openflights_routes.tsv (http://openflights.org/data.html#route:[original]) -- info on about 60_000 routes between 3000 airports on 531 airlines.


Dataexpo (Public domain): The core airline flights database includes
** dataexpo_airports.tsv (http://stat-computing.org/dataexpo/2009/supplemental-data.html:[original]) -- info on about 3400 US airlines; slightly cleaner but less comprehensive than the Openflights.org data.
** dataexpo_airplanes.tsv (http://stat-computing.org/dataexpo/2009/supplemental-data.html:[original]) -- info on about 5030 US commercial airplanes by tail number.
** dataexpo_airlines.tsv (http://stat-computing.org/dataexpo/2009/supplemental-data.html:[original]) -- info on about 1500 US airline carriers; slightly cleaner but less comprehensive than the Openflights.org data.


Wikipedia.org (CC-BY-SA license): Airport identifiers
** wikipedia_airports_iata.tsv (http://en.wikipedia.org/wiki/List_of_airports_by_IATA_code[original]) -- user-generated dataset pairing airports with their IATA (and often ICAO and FAA) identifiers.
** wikipedia_airports_icao.tsv(http://en.wikipedia.org/wiki/List_of_airports_by_ICAO_code[original]) -- user-generated dataset pairing airports with their ICAO (and often IATA and FAA) identifiers.


The airport datasets contain errors and conflicts; we've done some hand-curation and verification to reconcile them. The file wikipedia_conflicting.tsv shows where my patience wore out.
=== ITA World Cup Apache Logs

1998 World Cup access logs (access_logs/ita_world_cup_apachelogs; 123 GB, 1.3 billion records): every request made to the 1998 World Cup Web site between April 30, 1998 and July 26, 1998, in apache log format.

===  Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD) ===

20 GB
geo, stats

=== Retrosheet
        25	 	Retrosheet: MLB play-by-play, high detail, 1840-2011	ripd/www.retrosheet.org-2007/boxesetc/2006
        25	 	Retrosheet: MLB box scores, 1871-2011               	ripd/www.retrosheet.org-2007/boxesetc/2006

=== Other Datasets ===

approx size	 Mrecs	source Data
       huge		US Patent Data from Google                          	www.google.com/googlebooks/uspto-patents.html[Google Patent Collection]
       huge	      1	Mathematical constants to billion+'th-place         	www.numberworld.org/ftp
  2_300_000	 250000	Wikipedia Pageview Stats                           	dumps.wikimedia.org/other/pagecounts-raw
   470_000	      	Wikibench.eu Wikipedia Log traces                   	Wikibench.eu
   124_000	1300000	Access Logs, 1998 World Cup (Internet Traffic Archive) 	access_logs/ita/ita_world_cup
    40_000	 B	NCDC: Hourly Weather (full)                         	ftp.ncdc.noaa.gov/pub/data/noaa
    34_000	     10	MLB Gameday Pitch-by-pitch data, 2007-2011          	gd2.mlb.com/components/game/mlb
    16_000	    619	Wikipedia corpus and pagelinks                      	dumps.wikimedia.org/enwiki/20120601
    14_000	      	NCDC: Hourly weather (simplified)                   	ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite
    14_000	       	Memetracker                                         	snap.stanford.edu/data/bigdata/memetracker9
    14_000	      	Amazon Co-Purchasing Data                           	snap.stanford.edu/data/bigdata/amazon0312.html
    11_000	      	Crosswikis                                          	nlp.stanford.edu/pubs/crosswikis-data.tar.bz2
     6_400	      	NCDC: Daily Weather                                 	ftp.ncdc.noaa.gov/pub/data/gsod
     6_300	      	Berkeley Earth Surface Temperature                  	stats/earth_surface_temperature
     2_900	      	Twilio TigerLINE US Street Map                      	geo/us_street_map/addresses
     1_900	      	All US Airline Flights 1987-2009 (ASA Data Expo)    	stat-computing.org/dataexpo/2009
     1_300	      	Geonames Points of Interest                         	geo/geonames/info
     1_300	      	Daily Prices for all US stocks, 1962–2011           	stats/stock_prices
     1_040	      	Patent data (see Google data too)                   	www.nber.org/~jbessen
       573	      	TAKS Exam Scores for all Texas students, 2007-2010  	ripd/texas_taks_exam
       571	      	Pi to 1 Billion decimal places                      	ja0hxv.calico.jp/value/pai/val01/pi
       419	      	Enron Email Corpus                                  	lang/corpora/enron_trial_coporate_email_corpus
       362	      	DBpedia Wikipedia Article Features                  	downloads.dbpedia.org/3.7/links
       331	      	DBpedia                                             	spotlight.dbpedia.org/datasets
       310	       	Grouplens: User-Movie affinity                      	graph/grouplens_movies
       305	 	UFO Sightings (UFORC)                               	geo/ufo_sightings
       223	 	Geonames Postal Codes                               	geo/geonames/postal_codes
       121	 	Book Crossing: User-Book affinity                   	graph/book_crossing
       111		Maxmind GeoLite (IP-Geo) data                       	ripd/geolite.maxmind.com/download
        91	 	Access Logs: waxy.org's Star Wars Kid logs          	access_logs/star_wars_kid
        62	 	Metafilter corpus of postings with metadata         	ripd/stuff.metafilter.com/infodump
        47	 	Word frequencies from the British National Corpus   	ucrel.lancs.ac.uk/bncfreq/lists
        36	 	Mobywords thesaurus                                 	lang/corpora/thesaurus_mobywords
        25	 	Retrosheet: MLB play-by-play, high detail, 1840-2011	ripd/www.retrosheet.org-2007/boxesetc/2006
        25	 	Retrosheet: MLB box scores, 1871-2011               	ripd/www.retrosheet.org-2007/boxesetc/2006
        20	 	US Federal Reserve Bank Loans (Bloomberg)           	misc/bank_loans_by_fed
        11	 	Scrabble dictionaries                               	lang/corpora/scrabble
        11	 	All Scrabble tile combinations with rack value      	misc/words_quackle
      1000	 	Marvel Universe Social Graph
         . 		Materials Safety Datasheets
         . 		Crunchbase
         . 		Natural Earth detailed geographic boundaries
         . 		US Census 2009 ACS (Long-form census)
         .		US Census Geographic boundaries
         .		Zillow US Neighborhood Boundaries
         . 		Open Street Map
2_000_000		Google Books N-Grams                                	aws.amazon.com/datasets/8172056142375670

60_000_000		Common Crawl Web Corpus
600_000		Apache Software Foundation Public Mail Archives 	aws.amazon.com/datasets/7791434387204566
300_000		Million-Song dataset                             	labrosa.ee.columbia.edu/millionsong
.		Reference Energy Disaggregation Dataset (REDD)      	redd.csail.mit.edu/
.   	US Legislation Co-Sponsorship                        	jhfowler.ucsd.edu/cosponsorship.htm
.   	VoteView: Political Spectrum Rank of US Legistorls/Laws	voteview.org/downloads.asp                       	DW-NOMINATE Rank Orderings all Houses and Senates
.   	World Bank                                           	data.worldbank.org
.      	Record of American Democracy                         	road.hmdc.harvard.edu/pages/road-documentation     	The Record Of American Democracy (ROAD) data includes election returns, socioeconomic summaries, and demographic measures of the American public at unusually low levels of geographic aggregation. The NSF-supported ROAD project covers every state in the country from 1984 through 1990 (including some off-year elections). One collection of data sets includes every election at and above State House, along with party registration and other variables, in each state for the roughly 170,000 precincts nationwide (about 60 times the number of counties). Another collection has added to these (roughly 30-40) political variables an additional 3,725 variables merged from the 1990 U.S. Census for 47,327 aggregate units (about 15 times the number of counties) about the size one or more cities or towns. These units completely tile the U.S. landmass. The collection also includes geographic boundary files so users can easily draw maps with these data.
.		Human Mortality DB    	                             	www.mortality.org/                                  	The Human Mortality Database (HMD) was created to provide detailed mortality and population data to researchers, students, journalists, policy analysts, and others interested in the history of human longevity. The project began as an outgrowth of earlier projects in the Department of Demography at the University of California, Berkeley, USA, and at the Max Planck Institute for Demographic Research in Rostock, Germany (see history). It is the work of two teams of researchers in the USA and Germany (see research teams), with the help of financial backers and scientific collaborators from around the world (see acknowledgements).
.		FCC Antenna locations                                	transition.fcc.gov/mb/databases/cdbs
.		Pew Research Datasets                                	pewinternet.org/Static-Pages/Data-Tools/Download-Data/Data-Sets.aspx
.		Youtube Related Videos                                	netsg.cs.sfu.ca/youtubedata
.		Westbury Usenet Archive (2005-2010)                  	www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html 	This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.
.		Wikipedia Page Traffic Statistics                	aws.amazon.com/datasets/2596              	snap-753dfc1c
.   	Wikipedia Traffic Statistics V2                 	aws.amazon.com/datasets/4182            	snap-0c155c67
.   	Wikipedia Page Traffic Statistic V3                	aws.amazon.com/datasets/6025882142118545	snap-f57dec9a
.   	Marvel Universe Social Graph                      	aws.amazon.com/datasets/5621954952932508	snap-7766d116
10_000      	Daily Global Weather, 1929-2009                   	aws.amazon.com/datasets/2759             	snap-ac47f4c5
220_000		Twilio/Wigle.net Street Vector Data Set         	aws.amazon.com/datasets/2408             	snap-5eaf5537	MySQL	geo	A complete database of US street names and address ranges mapped to zip codes and latitude/longitude ranges, with DTMF key mappings for all street names.
.		US Economic Data 2003-2006                      	aws.amazon.com/datasets/2341             	snap-0bdf3f62		stats	US Economic Data for 2003-2006 from the The US Census Bureau -- raw census data (ACS2002-2006)
.

==== Wikibench.eu Wikipedia Log traces ====

logs/wikibench_logtraces (470 GB)

==== Amazon Co-Purchasing Data ====

http://snap.stanford.edu/data/amazon0312.html

==== Patents ====

http://www.google.com/googlebooks/uspto-patents.html[Google Patent Collection]

====  Marvel Universe Social Graph ====

1 GB
graph
Social collaboration network of the Marvel comic book universe based on co-appearances.

==== Google Books Ngrams ====

http://aws.amazon.com/datasets/8172056142375670[Google Books Ngrams]
2_000 GB
graph, linguistics

==== Common Crawl web corpus ====
http://aws.amazon.com/datasets/41740
s3://aws-publicdatasets/common-crawl/crawl-002
A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 and formatted in the ARC (.arc) file format.
Details

Size:  60 TB
Source:        Common Crawl Foundation - http://commoncrawl.org
Created On:   February 15, 2012 2:23 AM GMT
Last Updated: February 15, 2012 2:23 AM GMT
Available at: s3://aws-publicdatasets/common-crawl/crawl-002/

A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 and formatted in the ARC (.arc) file format.
Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data for the purpose of driving innovation in research, education and technology. This data set contains web crawl data from 5 billion web pages and is released under the Common Crawl Terms of Use.
The ARC (.arc) file format used by Common Crawl was developed by the Internet Archive to store their archived crawl data. It is essentially a multi-part gzip file, with each entry in the master gzip (ARC) file being an independent gzip stream in itself. You can use a tool like zcat to spill the contents of an ARC file to stdout. For more information see the Internet Archive's Arc File Format description.
Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.
To learn more about Amazon Elastic MapReduce please see the product detail page.
Common Crawl's Hadoop classes and other code can be found in its GitHub repository.
A tutorial for analyzing Common Crawl's dataset with Amazon Elastic MapReduce called MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl may be found on the Common Crawl blog.
==== Apache Software Foundation Public Mail Archives ====

Original: http://aws.amazon.com/datasets/7791434387204566[Apache Software Foundation Public Mail Archives]
200 GB
corpus
A collection of all publicly available mail archives from the Apache55 Software Foundation (ASF)

==== Reference Energy Disaggregation Dataset (REDD) ====
http://redd.csail.mit.edu/[Reference Energy Disaggregation Data Set]
Initial REDD Release, Version 1.0
This is the home page for the REDD data set. Below you can download an initial version of the data set, containing several weeks of power data for 6 different homes, and high-frequency current/voltage data for the main power supply of two of these homes. The data itself and the hardware used to collect it are described more thoroughly in the Readme below and in the paper:
\J. Zico Kolter and Matthew J. Johnson. REDD: A public data set for energy disaggregation research. In proceedings of the SustKDD workshop on Data Mining Applications in Sustainability, 2011. [pdf]
Those wishing to use the dataset in academic work should cite this paper as the reference. Although the data set is freely available, for the time being we still ask those interested in the downloading the data to email us (kolter@csail.mit.edu) to receive the username/password to download the data. See the readme.txt file for a full description of the different downloads and their formats
==== The Book-Crossing dataset ====

http://www.informatik.uni-freiburg.de/~cziegler/BX/[Book Crossing] Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books. Freely available for research use when acknowledged with the following reference (further details on the dataset are given in this publication): Improving Recommendation Lists Through Topic Diversification, Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear. As a courtesy, if you use the data, I would appreciate knowing your name, what research group you are in, and the publications that may result.

The Book-Crossing dataset comprises 3 tables.

BX-Users: Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL-values.
BX-Books: Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon web site.
BX-Book-Ratings: Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

==== Westbury Usenet Archive ====

http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html[Westbury Usenet Archive] -- USENET corpus (2005-2010) This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.

==== Million Song Dataset ====

http://labrosa.ee.columbia.edu/millionsong/[BETA VERSION]

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
Its purposes are:
To encourage research on algorithms that scale to commercial sizes
To provide a reference dataset for evaluating research
As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)
To help new researchers get started in the MIR field
The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.
The Million Song Dataset is also a cluster of complementary datasets contributed by the community:

SecondHandSongs dataset: cover songs
musiXmatch dataset: lyrics
Last.fm dataset: song-level tags and similarity
Taste Profile subset: user data

Fields

From the original documentation:
Field name                      Type            Description                                     Link
analysis sample rate            float           sample rate of the audio used                   url
artist 7digitalid               int             ID from 7digital.com or -1                      url
artist familiarity              float           algorithmic estimation                          url
artist hotttnesss               float           algorithmic estimation                          url
artist id                       string          Echo Nest ID                                    url
artist latitude                 float           latitude
artist location                 string          location name
artist longitude                float           longitude
artist mbid                     string          ID from musicbrainz.org                         url
artist mbtags                   array string    tags from musicbrainz.org                       url
artist mbtags count             array int       tag counts for musicbrainz tags                 url
artist name                     string          artist name                                     url
artist playmeid                 int             ID from playme.com, or -1                       url
artist terms                    array string    Echo Nest tags                                  url
artist terms freq               array float     Echo Nest tags freqs                            url
artist terms weight             array float     Echo Nest tags weight                           url
audio md5                       string          audio hash code
bars confidence                 array float     confidence measure                              url
bars start                      array float     beginning of bars, usually on a beat            url
beats confidence                array float     confidence measure                              url
beats start                     array float     result of beat tracking                         url
danceability                    float           algorithmic estimation
duration                        float           in seconds
end of fade in                  float           seconds at the beginning of the song            url
energy                          float           energy from listener point of view
key                             int             key the song is in                              url
key confidence                  float           confidence measure                              url
loudness                        float           overall loudness in dB                          url
mode                            int             major or minor                                  url
mode confidence                 float           confidence measure                              url
release                         string          album name
release 7digitalid              int             ID from 7digital.com or -1                      url
sections confidence             array float     confidence measure                              url
sections start                  array float     largest grouping in a song, e.g. verse          url
segments confidence             array float     confidence measure                              url
segments loudness max           array float     max dB value                                    url
segments loudness max time      array float     time of max dB value, i.e. end of attack        url
segments loudness max start     array float     dB value at onset                               url
segments pitches                2D array float  chroma feature, one value per note              url
segments start                  array float     musical events, ~ note onsets                   url
segments timbre                 2D array float  texture features (MFCC+PCA-like)                url
similar artists                 array string    Echo Nest artist IDs (sim. algo. unpublished)   url
song hotttnesss                 float           algorithmic estimation
song id                         string          Echo Nest song ID
start of fade out               float           time in sec                                     url
tatums confidence               array float     confidence measure                              url
tatums start                    array float     smallest rythmic element                        url
tempo                           float           estimated tempo in BPM                          url
time signature                  int             estimate of number of beats per bar, e.g. 4     url
time signature confidence       float           confidence measure                              url
title                           string          song title
track id                        string          Echo Nest track ID
track 7digitalid                int             ID from 7digital.com or -1                      url
year                            int             song release year from MusicBrainz or 0         url
An Example Track Description
Below is a list of all the fields associated with each track in the database. This is simply an annotated version of the output of the example code display_song.py. For the fields that include a large amount of numerical data, we indicate only the shape of the data array. Since most of these fields are taken directly from the Echo Nest Analyze API, more details can be found at the Echo Nest Analyze API documentation.
A more technically-oriented list of these fields is given on the field list page.
This example data is shown for the track whose track_id is TRAXLZU12903D05F94 - namely, "Never Gonna Give You Up" by Rick Astley.
artist_mbid:                    db92a151-1ac2-438b-bc43-b82e149ddd50            the musicbrainz.org ID for this artists is db9...
artist_mbtags:                  shape = (4,)                                    this artist received 4 tags on musicbrainz.org
artist_mbtags_count:            shape = (4,)                                    raw tag count of the 4 tags this artist received on musicbrainz.org
artist_name:                    Rick Astley                                     artist name
artist_playmeid:                1338                                            the ID of that artist on the service playme.com
artist_terms:                   shape = (12,)                                   this artist has 12 terms (tags) from The Echo Nest
artist_terms_freq:              shape = (12,)                                   frequency of the 12 terms from The Echo Nest (number between 0 and 1)
artist_terms_weight:            shape = (12,)                                   weight of the 12 terms from The Echo Nest (number between 0 and 1)
audio_md5:                      bf53f8113508a466cd2d3fda18b06368                hash code of the audio used for the analysis by The Echo Nest
bars_confidence:                shape = (99,)                                   confidence value (between 0 and 1) associated with each bar by The Echo Nest
bars_start:                     shape = (99,)                                   start time of each bar according to The Echo Nest, this song has 99 bars
beats_confidence:               shape = (397,)                                  confidence value (between 0 and 1) associated with each beat by The Echo Nest
beats_start:                    shape = (397,)                                  start time of each beat according to The Echo Nest, this song has 397 beats
danceability:                   0.0                                             danceability measure of this song according to The Echo Nest (between 0 and 1, 0 => not analyzed)
duration:                       211.69587                                       duration of the track in seconds
end_of_fade_in:                 0.139                                           time of the end of the fade in, at the beginning of the song, according to The Echo Nest
energy:                         0.0                                             energy measure (not in the signal processing sense) according to The Echo Nest (between 0 and 1, 0 => not analyzed)
key:                            1                                               estimation of the key the song is in by The Echo Nest
key_confidence:                 0.324                                           confidence of the key estimation
loudness:                       -7.75                                           general loudness of the track
mode:                           1                                               estimation of the mode the song is in by The Echo Nest
mode_confidence:                0.434                                           confidence of the mode estimation
release:                        Big Tunes - Back 2 The 80s                      album name from which the track was taken, some songs / tracks can come from many albums, we give only one
release_7digitalid:             786795                                          the ID of the release (album) on the service 7digital.com
sections_confidence:            shape = (10,)                                   confidence value (between 0 and 1) associated with each section by The Echo Nest
sections_start:                 shape = (10,)                                   start time of each section according to The Echo Nest, this song has 10 sections
segments_confidence:            shape = (935,)                                  confidence value (between 0 and 1) associated with each segment by The Echo Nest
segments_loudness_max:          shape = (935,)                                  max loudness during each segment
segments_loudness_max_time:     shape = (935,)                                  time of the max loudness during each segment
segments_loudness_start:        shape = (935,)                                  loudness at the beginning of each segment
segments_pitches:               shape = (935, 12)                               chroma features for each segment (normalized so max is 1.)
segments_start:                 shape = (935,)                                  start time of each segment (~ musical event, or onset) according to The Echo Nest, this song has 935 segments
segments_timbre:                shape = (935, 12)                               MFCC-like features for each segment
similar_artists:                shape = (100,)                                  a list of 100 artists (their Echo Nest ID) similar to Rick Astley according to The Echo Nest
song_hotttnesss:                0.864248830588                                  according to The Echo Nest, when downloaded (in December 2010), this song had a 'hotttnesss' of 0.8 (on a scale of 0 and 1)
song_id:                        SOCWJDB12A58A776AF                              The Echo Nest song ID, note that a song can be associated with many tracks (with very slight audio differences)
start_of_fade_out:              198.536                                         start time of the fade out, in seconds, at the end of the song, according to The Echo Nest
tatums_confidence:              shape = (794,)                                  confidence value (between 0 and 1) associated with each tatum by The Echo Nest
tatums_start:                   shape = (794,)                                  start time of each tatum according to The Echo Nest, this song has 794 tatums
tempo:                          113.359                                         tempo in BPM according to The Echo Nest
time_signature:                 4                                               time signature of the song according to The Echo Nest, i.e. usual number of beats per bar
time_signature_confidence:      0.634                                           confidence of the time signature estimation
title:                          Never Gonna Give You Up                         song title
track_7digitalid:               8707738                                         the ID of this song on the service 7digital.com
track_id:                       TRAXLZU12903D05F94                              The Echo Nest ID of this particular track on which the analysis was done
year:                           1987                                            year when this song was released, according to musicbrainz.org

==== Google / Stanford Crosswiki ====
http://www-nlp.stanford.edu/pubs/crosswikis-data.tar.bz2/[wikipedia_words]
This data set accompanies
Valentin I. Spitkovsky and Angel X. Chang. 2012.
A Cross-Lingual Dictionary for English Wikipedia Concepts.
In Proceedings of the Eighth International
Conference on Language Resources and Evaluation (LREC 2012).
Please cite the appropriate publication if you use this data.  (See
http://nlp.stanford.edu/publications.shtml for .bib entries.)
There are six line-based (and two other) text files, each of them
lexicographically sorted, encoded with UTF-8, and compressed using
bzip2 (-9).  One way to view the data without fully expanding it
first is with the bzcat command, e.g.,
bzcat dictionary.bz2 | grep ... | less
Note that raw data were gathered from heterogeneous sources, at
different points in time, and are thus sometimes contradictory.
We made a best effort at reconciling the information, but likely
also introduced some bugs of our own, so be prepared to write
fault-tolerant code...  keep in mind that even tiny error rates
translate into millions of exceptions, over billions of datums.
==== English Gigaword Dataset (LDC) ====
The http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T13[English Gigaword] corpus, now being released in its fourth edition, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC at the University of Pennsylvania. The fourth edition includes all of the contents in English Gigawaord Third Edition (LDC2007T07) plus new data covering the 24-month period of January 2007 through December 2008. Portions of the dataset are © 1994-2008 Agence France Presse, © 1994-2008 The Associated Press, © 1997-2008 Central News Agency (Taiwan), © 1994-1998, 2003-2008 Los Angeles Times-Washington Post News Service, Inc., © 1994-2008 New York Times, © 1995-2008 Xinhua News Agency, © 2009 Trustees of the University of Pennsylvania. The six distinct international sources of English newswire included in this edition are the following:
Agence France-Presse, English Service (afp_eng)
Associated Press Worldstream, English Service (apw_eng)
Central News Agency of Taiwan, English Service (cna_eng)
Los Angeles Times/Washington Post Newswire Service (ltw_eng)
New York Times Newswire Service (nyt_eng)
Xinhua News Agency, English Service (xin_eng)
New in the Fourth Edition
For an example of the data in this corpus, please review http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2009T13.html[this sample file].
=== Sources of public and Commercial data
((data_commons))

Infochimps
Factual
CKAN
Get.theinfo
Microsoft Azure Data Marketplace