These instructions are specifically for OSX and *.nix users who have access to Bash. If you're on Windows...you should probably just use Microsoft Access...
For a more convoluted example of how to use mdbtools to automate the conversion of an Access database, you can see this example I posted with the Florida prisons database (not recommended for Bash novices)
To just get the data as CSV, you can download it from here:
http://stash.compjour.org/samples/faa-wildlife-strikes.zip
A sample of the CSV data is attached to this gist: sample-strikes.csv.
The FAA Wildlife Strike database (homepage) contains self-reported incidents in which aircraft have run into wildlife (e.g. birds -- see the gallery here). The data fields includes the type of aircraft, animal, nearest airport, and descriptions of the resulting damage. At the time of writing, the database included ~175,000 reports from as far back as 1990.
The FAA has a web front end for its Wildlife Strike database at: http://wildlife.faa.gov/database.aspx
The website allows export-to-Excel of selections of the data. To get the entire database, you either have to write a webform-scraper. Or download it in (zipped) Microsoft Access format.
The direct URL to the zipped database is: http://wildlife.faa.gov/downloads/wildlife.zip
To download and unpack the archive:
$ curl -O http://wildlife.faa.gov/downloads/wildlife.zip
$ unzip wildlife.zip
Two files are unpacked: wildlife.accdb and read_me.xls. The latter contains the data schema for the 94 fields, which I've listed in a separate file in this gist: read_me-schema-csv.
Now, onto the Access-to-CSV conversion process...
The github repo for mdbtools has in-depth instructions: https://github.com/brianb/mdbtools
If you're a Mac user and you've installed Homebrew, installing mdbtools is as easy as:
brew install mdbtools
Once mdbtools is installed, you'll have command-line access to several utilities. The ones we care about specifically are mdb-tables
and mdb-export
.
Like any database system, Access allows for the possibility of multiple tables. Ideally, it would seem that the wildlife strike database would only have one table...but let's use mdb-tables to find out:
$ mdb-tables -1 wildlife.accdb
(the -1
flag will put each table name on its own line)
The output is:
STRIKE_REPORTS (1990-1999)
STRIKE_REPORTS (2000-2009)
STRIKE_REPORTS (2010-Current)
STRIKE_REPORTS_BASH (1990-Current)
Um, OK...seems like the data is split into three time periods, each separately represented by table (e.g. "STRIKE_REPORTS (2010-Current)"
)...and then there's "STRIKE_REPORTS_BASH (1990-Current)"
...which covers the entire period.
You can use the mdb-schema tool to eyeball-check that the schemas for each table are nominally the same:
$ mdb-schema wildlife.accdb
To keep things simple, we'll hope that the nature of the data is such that we can dump all 4 tables into one.
Extracting the data as CSV with mdb-export is as simple as:
$ mdb-export wildlife.accdb 'STRIKE_REPORTS (1990-1999)'
The mdb-export command by default will print the header row. Adding the -H
flag will suppress the printing of the headers. There's also a -D
flag, to specify the date format using strftime.
The following loop exports each table into its own file:
$ mdb-tables -1 wildlife.accdb | while read -r tablename; do
fname=$(echo "$tablename" | tr -d '()' | tr ' ' '_').csv
echo "Extracting $tablename to $fname"
mdb-export -D '%Y-%m-%d %H:%M:%S' wildlife.accdb "$tablename" > "$fname"
done
To get a quick line count of each CSV:
$ wc -l *.csv
30085 STRIKE_REPORTS_1990-1999.csv
70132 STRIKE_REPORTS_2000-2009.csv
66203 STRIKE_REPORTS_2010-Current.csv
8047 STRIKE_REPORTS_BASH_1990-Current.csv
174467 total
To get everything as one big CSV named faa-wildlife-strikes.csv
:
# export the headers of the smallest file to set things up; suppress headers elsewhere
$ allname="faa-wildlife-strikes.csv"
$ mdb-export wildlife.accdb 'STRIKE_REPORTS_BASH (1990-Current)' | head -n1 > "$allname"
# then do the loop as before, accept append to $allname
$ mdb-tables -1 wildlife.accdb | while read -r tablename; do
echo "Extracting $tablename to $allname"
mdb-export -H -D '%Y-%m-%d %H:%M:%S' wildlife.accdb \
"$tablename" >> "$allname"
done
The data is, to put it mildly, not normalized. Play around with it using csvkit to get a feel for it.
Here's how to get a frequency count (in descending order) of the SPECIES
column:
$ csvcut -c SPECIES faa-wildlife-strikes.csv | tail -n +2 | sort | uniq -c | sort -rn
(note: tail -n +2
is just an inefficient way to skip the header line)
The top ten results (I've attached the full list to the gist: species-count.txt:
38531 Unknown bird - medium
32981 Unknown bird - small
7653 Mourning dove
6580 Gulls
6400 Unknown bird
4562 Killdeer
4476 American kestrel
4215 Barn swallow
3952 Horned lark
3915 European starling
Some of the other fields have cleaner, more consistent enumerations, but even then, expect to deal with missing data. Here are the top 20 combinations of operator and aircraft type:
$ csvcut -c OPERATOR,ATYPE faa-wildlife-strikes.csv | tail -n +2 | sort | uniq -c | sort -rn | head -n 20
38867 UNKNOWN,UNKNOWN
6734 SOUTHWEST AIRLINES,B-737-700
4862 SOUTHWEST AIRLINES,B-737-300
2340 UNITED AIRLINES,B-737-300
2320 UNITED AIRLINES,A-320
2052 AMERICAN AIRLINES,MD-82
1895 UNITED AIRLINES,B-757-200
1881 JETBLUE AIRWAYS,A-320
1716 AMERICAN AIRLINES,MD-80
1592 SKYWEST AIRLINES,CRJ100/200
1475 AMERICAN AIRLINES,B-737-800
1433 UPS AIRLINES,A-300
1346 DELTA AIR LINES,MD-88
1313 UPS AIRLINES,B-757-200
1271 UNITED AIRLINES,A-319
1240 BUSINESS,C-172
1226 FEDEX EXPRESS,A-300
1181 AMERICAN EAGLE AIRLINES,EMB-145
1178 UNITED AIRLINES,B-737-500
1124 FEDEX EXPRESS,B-727-200
The database seems to not contain up-to-date information (i.e. just part of 2015, none of 2016):
$ csvcut -c REPORTED_DATE faa-wildlife-strikes.csv | tail -n +2 | cut -c -4 | sort | uniq -c
Note that the above command does not capture the blank lines:
1 1006
1278 1990
1376 1991
1048 1992
822 1993
723 1994
871 1995
986 1996
1374 1997
1457 1998
1517 1999
4835 2000
2929 2001
5101 2002
3786 2003
3283 2004
3133 2005
1814 2006
1288 2007
1188 2008
1181 2009
862 2010
874 2011
501 2012
430 2013
252 2014
132 2015
The data seems to cut off sometime in October 2015:
$ csvcut -c REPORTED_DATE faa-wildlife-strikes.csv | tail -n +2 | cut -c -7 | sort | uniq -c | tail -n 20
20 2014-03
20 2014-04
57 2014-05
11 2014-06
21 2014-07
21 2014-08
28 2014-09
26 2014-10
10 2014-11
6 2014-12
5 2015-01
5 2015-02
13 2015-03
18 2015-04
17 2015-05
10 2015-06
19 2015-07
18 2015-08
23 2015-09
4 2015-10