Skip to content

Instantly share code, notes, and snippets.

@marionhalftermeyer
Created January 21, 2014 14:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save marionhalftermeyer/8541631 to your computer and use it in GitHub Desktop.
Save marionhalftermeyer/8541631 to your computer and use it in GitHub Desktop.
Data Visualization II: Final Assignment Marion Halftermeyer
Goal
For my final project with R I wanted to practice creating Decision Trees. I decided to try to recreate what Jeff Larson and ProPublica did with political microtargeting of campaign email messages to constituents during the most recent presidential elections. I am using this exercise as a way to make use of all the knowledge in R that I gained over the course of the class. This exercise in particular and the idea of recreating Larson’s work will enable me to use many functions and ways of pulling data to make sense of it
Code
#Making the Decision Tree
#Merging the separate data frames of data to create one large data frame with #demographics, mailing ids and cluster ids.
cu=merge(clusters_users, users, by.x="user_id", by.y="id")
cu2=merge(cu, clusters[,c("cluster_id","mailing_id")], by.x="cluster_id", by.y="cluster_id")
#Adds in donation amount--how much on average the user was asked to donate by the campaign.
cu3=merge(cu2, users_dollar_amount, by.x="user_id", by.y="user_id")
#Will need to make a decision tree for a specific mailing.
#Mailings found to be targeted by Larson by mailing id number:
# 78, 369, 377, 1001, 1056, 1136, 1281, 1295, 1404, 1737, 1791, 2032, 2059, 2240, 2299,
# 2396, 2964, 3104, 3110, 3112, 3311, 3359, 3475, 3526
# Starting with mailing 1001
versions1001 = clusters[clusters$mailing_id == 1001,]
#There are 9 versions/clusters.
users1001 = clusters_users[clusters_users$cluster_id %in% versions1001$id,]
dollar_amounts_camp2= users_dollar_amount[users_dollar_amount$campaign_id==2,]
data1001 = merge(users1001,users,by.x="user_id",by.y="id")
data1001 = merge(data1001, dollar_amounts_camp2, by.x="user_id", by.y="user_id")
#Looking at the data by cluster
cluster19545=data1001[data1001$cluster_id == 19545,]
#4 out of the 5 are grad school educated. The other is college. 3 are from New York. All but one are above 50K in household income.
# All but one are likely voters. The one who is written down as false is not a registered voter. All but on are white, the one declined to state.
cluster19546=data1001[data1001$cluster_id==19546,]
# Four people: all white, all likely voters, all didnt attend any events,3 grad school, 1 college
cluster19547=data1001[data1001$cluster_id == 19547,]
# 89 total: Mostly white, all went to at least college, some have grad school, (a few have high school)
racewhite=cluster19547[cluster19547$race== "White",]
#74 out of 89 are white for this particular cluster of the mailing 1001.
educationcollege=cluster19547[cluster19547$education== "College",]
educationgrad=cluster19547[cluster19547$education== "Grad School",]
educationhigh=cluster19547[cluster19547$education=="High School",]
educationsome=cluster19547[cluster19547$education== "Some College",]
# 35 have college level education, 36 have grad school, 3 have high school, 7 have some college.
eventattendee=cluster19547[cluster19547$event_attendee== "false",]
#73 of the 89 have not attended events.
recentdonor=cluster19547[cluster19547$recent_donor== "false",]
# 36 did not donate.
likelyvoter=cluster19547[cluster19547$likely_voter== "true",]
# 76 are likely voters.
cluster19548=data1001[data1001$cluster_id == 19548,]
# 18 people total: All likely voters, most have grad school.All white but one listed as two or more races.Majority are recent donors
# and most are full time workers.
cluster19549=data1001[data1001$cluster_id == 19549,]
cat(versions1001$clean_text[5])
#Only one person received this one. She is from California but there are 12 others in this mailing list that are from California
# Earns more than 150K but has not donated recently.
data1001$cluster_id[data1001$state == "California"]
cali2 = data1001$cluster_id[data1001$state == "California"]
cat(clusters$clean_text[clusters$id %in% cali2])
data1001[data1001$state=="California",]
#The 3 others from California that earn more than 150K have all donated recently.
cluster19550=data1001[data1001$cluster_id==19550,]
# 1 person from Ohio, grad school, married, white male, likely voter, donor, event attendee, competitive state
# There is 1 other person from Ohio in this mailing list (that person received cluster 19547)
cluster19551=data1001[data1001$cluster_id==19551,]
# 1 person from New Jersey (there are 9 others), white female, married, grad school, more than 150K in income,
# likely voter, event attendee, and a recent donor
data1001[data1001$state=="New Jersey",]
# All others from New Jersey earn less than 150K.
cluster19552=data1001[data1001$cluster_id==19552,]
# 1 person from Illinois, not registered to vote but is a recent donor, male, grad school,
data1001[data1001$state=="Illinois",]
cluster19553=data1001[data1001$cluster_id==19553,]
# 1 white male from New Jersey, single, unemployed, some college, likely voter, hasn't donated recently, and hasn't attended an event
# Determining factors for targeting according to Larson:
# Donor amount (or donation signal), state, age
# Based on my observations income, recent donorship and education seem to also be #factors.
#Changing date of birth into age for my specific data set
class(data1001$date_of_birth)
[1] "character"
tst = data1001$date_of_birth[1:10]
strptime(tst,"%Y-%m-%d")
[1] "1982-01-22" NA "1939-07-13" "1947-02-27" "1988-03-01" "1978-09-02" "1963-04-10"
[8] "1952-03-10" NA "1960-12-18"
tst
[1] "1982-01-22" "" "1939-07-13" "1947-02-27" "1988-03-01" "1978-09-02" "1963-04-10"
[8] "1952-03-10" "" "1960-12-18"
data1001$age = strptime(data1001$date_of_birth,"%Y-%m-%d")
data1001$age = strptime("2013-12-17","%Y-%m-%d") - strptime(data1001$date_of_birth,"%Y-%m-%d")
data1001$age[1:10]
Time differences in days
[1] 11652.00 NA 27186.04 24400.00 9422.00 12890.04 18514.00 22562.00 NA 19357.00
data1001$age = (strptime("2013-12-17","%Y-%m-%d") - strptime(data1001$date_of_birth,"%Y-%m-%d"))/365
data1001$age[1:10]
Time differences in days
[1] 31.92329 NA 74.48231 66.84932 25.81370 35.31518 50.72329 61.81370 NA 53.03288
data1001$age = trunc((strptime("2013-12-17","%Y-%m-%d") - strptime(data1001$date_of_birth,"%Y-%m-%d"))/365)
data1001$age[1:10]
Time differences in days
[1] 31 NA 74 66 25 35 50 61 NA 53
#Assigning numbers to all the states in the data.
cu3$statenum = factor(cu3$state)
cu3$statenum[1:10]
[1] New York New York New York New York New York New York New York New York New York New York
46 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming
levels(cu3$statenum)
[1] "" "Alabama" "Alaska" "Arizona"
[5] "Arkansas" "California" "Colorado" "Connecticut"
[9] "District of Columbia" "Florida" "Georgia" "Hawaii"
[13] "Idaho" "Illinois" "Indiana" "Iowa"
[17] "Kansas" "Kentucky" "Louisiana" "Maryland"
[21] "Massachusetts" "Michigan" "Minnesota" "Missouri"
[25] "Montana" "Nevada" "New Jersey" "New Mexico"
[29] "New York" "North Carolina" "North Dakota" "Ohio"
[33] "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island"
[37] "South Carolina" "Tennessee" "Texas" "Utah"
[41] "Vermont" "Virginia" "Washington" "West Virginia"
[45] "Wisconsin" "Wyoming"
cu3$statenum = factor(cu3$state)
#Now applying what I did to the states for my specific data
data1001$statenum = factor(data1001$state,levels=levels(cu3$statenum))
data1001$statenum[1:10]
[1] New York New York California Oregon California California Colorado
[9] Florida
46 Levels: Alabama Alaska Arizona Arkansas California Colorado ... Wyoming
as.numeric(data1001$statenum[1:10])
[1] 29 29 6 34 6 6 7 1 1 10
#To change education, need to factor it like the states
cu3$educnum= factor(cu3$education, levels=c("", "Decline to state", "Less than high school", "High School", "Some College", "College", "Grad School"))
levels(cu3$educnum)
data1001$educnum=factor(data1001$education, levels=levels(cu3$educnum))
#To change income level, need to factor it as well
cu3$incomenum=factor(cu3$household_income, levels=c("","Decline to State","0 to 35K", "35K to 50K", "50K to 75K","75K to 100K", "100K to 150K","More than 150K"))
data1001$incomenum=factor(data1001$household_income, levels=levels(cu3$incomenum))
#Changing recent donorship from character to factor
cu3$recent_donornum=factor(cu3$recent_donor, levels=c("","false","true"))
data1001$recent_donornum=factor(data1001$recent_donor, levels=levels(cu3$recent_donornum))
# Changing cluster id into a factor (so that it doesn't take the average of the numbers but reads them as characters not numbers)
data1001$cluster_id = factor(data1001$cluster_id)
#Taking out clusters that only have 1 person in them and cluster that has 89 (most generic email)
table(data1001$cluster_id)
19545 19546 19547 19548 19549 19550 19551 19552 19553
5 4 89 18 1 1 1 1 1
data2 = data1001[data1001$cluster_id %in% c("19545","19546","19548"),]
#Changing the cp means forcing r to make a bushier tree the smaller the number of cp
?rpart
?rpart.control
fit=rpart(cluster_id~, data=data2,cp=0.001)
plot(fit)
text(fit,use.n=T)
#Allows R to spread the tree past the outlines of the plotting space.
par(xpd=TRUE)
?par
#TESTING
library(rpart)
fit=rpart(cluster_id~., data=data2)
plot(fit)
text(fit,use.n=T)
data2 = data1001[data1001$cluster_id %in% c("19545","19546","19548"),]
fit=rpart(cluster_id~donation_amount+statenum+age, data=data2)
plot(fit, main="Donation Signal, State & Age (3 clusters)")
text(fit,use.n=T)
#donation signl was a split at $44.67. If true that it is greater, go right, if false go left.
# Users who were asked to donate large amounts all received 19548, this makes sense: The text below
# is what was in the email. They had all been previous donors too.
#Because you've saved your payment information, your donation will go through immediately:
#QUICK DONATE: $X:
#[url]
#You can also choose from the amounts below.
#--QUICK DONATE: $X:
#[url]
#--QUICK DONATE: $X:
#[url]
#--QUICK DONATE: $X:
#[url]
#--QUICK DONATE: $X:
#[url]
#Or donate another amount:
#Adding in other variables to be considered
data2 = data1001[data1001$cluster_id %in% c("19545","19546","19548"),]
fit=rpart(cluster_id~donation_amount+statenum+age+educnum+incomenum+recent_donornum, data=data2, cp=0.0001)
plot(fit, main="Donation Signal, State, Age, Educ, Income (3 clusters)")
text(fit,use.n=T)
# Adding in fourth cluster
data2 = data1001[data1001$cluster_id %in% c("19545","19546","19547","19548"),]
fit=rpart(cluster_id~donation_amount+statenum+age, data=data2, cp=0.0001)
par(xpd=TRUE)
plot(fit, main="Donation Signal, State, Age (4 clusters)")
text(fit,use.n=T, minlength=3)
#Adding in the fourth cluster and keeping the original three variables changes
#the donation split to around $204. If greater, then splits by states
#(Colorado, Illinois, Massachusetts and Vermont) and sorts between cluster
#19547 and 19548. If not greater, then get 19547 (the presumably control email).
#It misclassified 1 of the 89 that received 19547, 8 of the 18 that received 19548,
#and all 4 and all 5 of those that received 19546 and 19545 respectively.
Memo
To successfully recreate the data Larson was working with I merged all the separate data files he provided into one large data set. From there I looked at the specific mailings he pre-identified as being targeted. It is important to note that at this point we have already let in room for error. I am trusting his tests for which mailings have been targeted.
My intention was to use recursive partitioning in R to identify the splits in the data that made the less misclassification errors. Before even applying the rpart() function in R I needed to explore the data a little more.
I selected mailing 1001 and saw that there were 9 clusters or versions of the mailing by creating a new data frame (versions1001) that only contained the clusters with that particular mailing id number. I then subsetted from the large data frame the users who received the particular mailing and created a new data frame. I also subsetted from the user_dollar_amount data set for the specific campaign id, again creating a new data frame. Finally I merged all these subsetted data frames into one large data frame called data1001. This held all the data for the specific mailing 1001.
From there I was able to pull out the different versions or clusters of the mailings and look at them individually to see what demographics some of the users (receivers) might have in common in an attempt to identify targeting. Four of the nine clusters only had 1 receiver so for these I compared the user with other people from the same state. My intuition here was that if everyone else from the state was receiving the same person there would be some distinguishing factor about this one particular person to identify them as receiving a separate mailings. One particular version was received by 89 people. Larson assumed this might be the more generic mailing or the control mailing.
Some determining factors I found among the mailing clusters (things that the people had in common) were household income levels, education levels, recent donorship. Larson identified donation amount (the amount the mailings suggested to be donated), age, and state.
Now that I understood what were some of the factors that could have been used for targeting, I knew which variables I wanted to test with rpart(). Before doing this though, I needed to transform some of the data by pulling it and then putting it back in as new columns. Firstly the data did not come with age but with date of births. For this, I used strptime and subtracted the birth dates from the present date, which gave me the age in days. To get years, divided by 365 and truncated.
The states were assigned numbers based on alphabetical order. Note that not all states were in the complete data set (only 48 were). I did the same with the education, income and recent donorship variables paying attention to the order of the levels (going from lowest to highest). The cluster_ids also needed to be transformed into factors for the rpart().
I then played around with including different variables into the rpart function. As Larson did I created a separated data frame that excluded 5 of the clusters that only were received by 1 person. I changed between using the 4 other clusters and just using 3 of those clusters (taking out the cluster with 89 receivers) to see the differences.
Larson ended up with starting at a donation signal split of greater $69, then splitting if the state was Massachusetts. If the donation signal was less than $69, it then asked to split by age 44, it then split again on donation signal of $64.
When I used rpart(), excluding the same clusters and using the same variables as Larson (age, state and donation signal): I get a split on donation amount of $44.67. My decision tree in this respect is accurate in placing 18 has having received cluster 19548 where they were all asked donation amounts greater than about $50. It makes sense based on the text of the email as well—they were all previous donors with saved payment information and had about 5 suggested amounts in the email. Adding in other variables such as educ or income doesn’t change anything.
Adding in the fourth cluster and keeping the original three variables changes the donation split to around $204. If greater, then splits by states (Colorado, Illinois, Massachusetts and Vermont) and sorts between cluster 19547 and 19548. If not greater, then get 19547 (the presumably control email). It misclassified 1 of the 89 that received 19547, 8 of the 18 that received 19548, and all 4 and all 5 of those that received 19546 and 19545 respectively.
My experiment wasn’t a perfect match to Larson’s. Presumably there was some other cleaning up that he did and other factors that he looked at. However, my goal was to use the knowledge I gained accumulated over the semester and using this assignment/example provided me with the context to do just that. I was able to read in data from csv files, to merge data frames, to subset, to tabulate, to pull data from data frames, to create variables, to create new data frames, to transform character strings into factor strings, to use the strptime function, to rename columns, to use the rpart function, to plot, to manipulate the plot (by adding labels etc.)
Had I had a little more time I would have liked to apply this recreation to other mailings and look at what factors might have determined targeting there, and see if the rpart splitting made sense with the text in the emails. For mailing 1001, the donation amount asked for in the body of the text certainly made a difference because the campaigns wouldn’t be asking everyone to donate over $200. Many of those who were asked to donate large amounts had high incomes, education levels, and had recently donated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment