dannguyen/fec-indiv2016-simplified.md

## fec-indiv2016-simplified.md

      
    Raw
  

              fec-indiv2016-simplified.md
            
          
    Fitting 2016 FEC campaign donor data into your humble spreadsheet

The 2016 U.S. Election is so hot and rich that even with a month to go, the Federal Election Commission has recorded 12.6 million individual campaign donations.

For comparison's sake, the 2014 cycle had just 2.1 million donors. And the last presidential cycle, 2012, had 3.3 million donors.
The bottom line is that we can't import all the data directly into Excel (nevermind Google Sheets). But we can at least import the subset of data that we need (under a million rows) without having to crack out a database.
This tutorial assumes you have csvkit installed and know your way around the command-line. The typing in Windows will probably be different. The parts with curl and unzip can be done with however you managed to download files and unzip them via point and click.
Download the data, compile into a single CSV file

The steps here download the indiv16.zip file from the FEC, unzips it, combines it with headers, and makes a new comma-delimited file named all-indiv2016.csv. Which is too big for Excel, but we'll filter it later.
The FEC has separate URL for the headers and for the data, which is a zip file at:
ftp://ftp.fec.gov/FEC/2016/indiv16.zip
And, for whatever reason, the headers are comma-delimited but the data file is pipe-delimted.
# Download zip file and save as indiv16.zip
curl ftp://ftp.fec.gov/FEC/2016/indiv16.zip \
  -o indiv16.zip

# Unzip zip file 
unzip indiv16.zip
# This produces itcont.txt


# Create a new text file named `all-indiv2016.csv`, starting with 
# the headers that the FEC provides
curl http://www.fec.gov/finance/disclosure/metadata/indiv_header_file.csv \
  -o all-indiv2016.csv

# Because the `itcont.txt` file is pipe-delimited, we use csvformat -d
# to read it with pipes, and spit it out with commas
# and then append w/ >> to all-indiv2016.csv
csvformat -d '|' itcont.txt >> all-indiv2016.csv
Filter big data file by committee ID

Every candidate has one primary campaign committee, and up to several other committees. Let's keep things simple and focus on Republican candidate Donald Trump.
Here's his FEC candidate page.
His primary committee, DONALD J. TRUMP FOR PRESIDENT, INC., has an ID of C00580100.
So, we want to filter all-indiv2016.csv by whatever column has committee IDs. Now's a good time to check out the dataset's data dictionary.
The very first column CMTE_ID seems to have what we want:

Filter CSVs by column with csvgrep

The csvgrep utility let's us specify a column name, e.g. CMTE_ID, and then, with the -r flag, specify a regex pattern that we want to match against the value of CMTE_ID
(if you just want a literal match, -m works fine too)
Making trump-indiv2016.csv

Here's the command to read all-indiv2016.csv and filter it for Trump's committee, and then to send the matching rows into a file named trump-indiv2016.csv:
csvgrep -c 'CMTE_ID' \
        -r 'C00580100' all-indiv2016.csv \
          > trump-indiv2016.csv
A couple of warnings: this will take awhile (a literal match might be faster), and it probably won't capture all of the donors to Trump, if there are any other committees affiliated with him.
Making trump-indiv2016.csv

Here's the command to read all-indiv2016.csv and filter it for Trump's committee, and then to send the matching rows into a file named trump-indiv2016.csv:
csvgrep -c 'CMTE_ID' \
        -r 'C00580100' all-indiv2016.csv \
          > trump-indiv2016.csv
A couple of warnings: this will take awhile (a literal match might be faster), and it probably won't capture all of the donors to Trump, if there are any other committees affiliated with him.
Using wc, I get about 73,000 records:
wc -l trump-indiv2016.csv
Getting Trump and Clinton together

We could repeat the process for each of the other presidential candidates, which requires looking up each individual committee ID.
Clinton's committee ID is C00575795.
Here's how to use a regex to get them both in the same file:
csvgrep -c 'CMTE_ID' \
        -r 'C00580100|C00575795' all-indiv2016.csv \
          > trumphill-indiv2016.csv
Clinton's donors add more than 860,000 records to the file. Still small enough to open in Excel, but barely.
Hollywood baby

Knowing the data dictionary better will increase the ways you can filter the data to your interests.
Let's try updating this Los Angeles Times story from 2015: List The Hollywood heavyweights funding the 2016 presidential fight

It's pretty difficult to generate a list of all the names of possible entertainment-donors. So let's just search by name for celebrities we know have donated, and see what they look like in the data.
First, let's create a new file that contains only donors with a STATE of CA or NY:
csvgrep -c 'STATE' -r 'CA|NY' all-indiv2016.csv > nyca-indiv2016.csv
Those two states have 2.8 million records alone.
In this localized data, let's do a name search for a few people listed in the LAT story, just to see how they describe their occupations. In the example below, I chain a couple of other csvkit commands to trim the columns and make it easier to read:
 csvgrep -c 'NAME' \
         -r 'DEVITO, DAN|BRUCKHEIMER, JERRY|SPIELBERG, ST' \
         nyca-indiv2016.csv  \
  | csvcut -c 'CMTE_ID,NAME,STATE,EMPLOYER,OCCUPATION,TRANSACTION_AMT' \
  | csvlook
In the example below, I'm going to throw in a csvcut so that only a few columns are shown, and a csvlook to make things look nice.
Here's a sample of the results:


CMTE_ID
NAME
STATE
EMPLOYER
OCCUPATION
TRANSACTION_AMT


C00311944
SPIELBERG, STEVEN
CA
FREELANCE
DIRECTOR
1200


C00068353
SPIELBERG, STEVEN
CA
SELF
DIRECTOR
2700


C00068353
SPIELBERG, STEVEN
CA
SELF
DIRECTOR
2700


C00578658
DEVITO, DANNY
CA
N/A
ACTOR
2700


C00540195
DEVITO, DANNY
CA
SELF EMPLOYED
ACTOR
5000


C00492991
SPIELBERG, STEVEN
CA
SELF-EMPLOYED
DIRECTOR
2700


C00608497
BRUCKHEIMER, JERRY
CA
JERRY BRUCKHEIMER FILMS
PRODUCER
5400


C00264101
SPIELBERG, STEVEN
CA
DREAMWORKS SKG ENTERTAINMENT
PRINCIPAL PARTNER
2700


C00258475
SPIELBERG, STEVEN
CA
DREAMWORKS SKG
FILM EXECUTIVE
2700


C00346312
SPIELBERG, STEVEN
CA
SELF EMPLOYED
DIRECTOR/PRODUCER
2700


Looks like a variety of fields. Doing a complete search for entertainment-types will require some manual labor, but for now, here's one pattern for scooping many of them up. This time, we focus on OCCUPATION, and dump the results into fun-indiv2016.csv
In the code below, I pipe into csvcut so I can filter out unnecessary columns -- else, the full layout causes Google Sheets to choke:
 csvgrep -c 'OCCUPATION' \
         -r 'ACTOR|ACTRESS|DIRECTOR|PRODUCER|FILM' \
         nyca-indiv2016.csv  \
  | csvcut -c 'CMTE_ID,NAME,CITY,STATE,ZIP_CODE,EMPLOYER,OCCUPATION,TRANSACTION_DT,TRANSACTION_AMT' \
  > fun-indiv2016.csv
This filter produces 57,000+ rows. Here it is as a Google Spreadsheet.
Stars of the stars

Looks like filtering on "DIRECTOR" was too loose of a filter, as we can see a few SPACEX folks in the data:

Let's do a pivot table that includes only "ACTOR/ACTRESS"-types, and then group along the CMTE_ID to get a summation of donations per committee:

So who do actors and actresses give their money to? You'll just have to look it up individually. Or learn a little SQL.
The individual contributions table doesn't have the "label" data that would make it easy to which committee belongs to which candidate. We can't even filter it by party. It's easy enough to just look the FEC data on committee IDs, but it's obviously not scalable for wide-range analysis. Experiencing this difficulty will help you understand why SQL is so powerful.
CMTE_ID	NAME	STATE	EMPLOYER	OCCUPATION	TRANSACTION_AMT
C00311944	SPIELBERG, STEVEN	CA	FREELANCE	DIRECTOR	1200
C00068353	SPIELBERG, STEVEN	CA	SELF	DIRECTOR	2700
C00068353	SPIELBERG, STEVEN	CA	SELF	DIRECTOR	2700
C00578658	DEVITO, DANNY	CA	N/A	ACTOR	2700
C00540195	DEVITO, DANNY	CA	SELF EMPLOYED	ACTOR	5000
C00492991	SPIELBERG, STEVEN	CA	SELF-EMPLOYED	DIRECTOR	2700
C00608497	BRUCKHEIMER, JERRY	CA	JERRY BRUCKHEIMER FILMS	PRODUCER	5400
C00264101	SPIELBERG, STEVEN	CA	DREAMWORKS SKG ENTERTAINMENT	PRINCIPAL PARTNER	2700
C00258475	SPIELBERG, STEVEN	CA	DREAMWORKS SKG	FILM EXECUTIVE	2700
C00346312	SPIELBERG, STEVEN	CA	SELF EMPLOYED	DIRECTOR/PRODUCER	2700