Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@briatte
Last active January 2, 2019 02:45
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save briatte/6114696 to your computer and use it in GitHub Desktop.
Save briatte/6114696 to your computer and use it in GitHub Desktop.
A few snippets to quickly get a selection of social science datasets into R.

Here are little chunks of R code to quickly access survey datasets. Some of the linked datasets are provided on an "as-is" basis: typically, you might want to check the official ANES files rather than rely on the extracts linked to below. This also applies to data extracts bundled in packages.

HOWTO

The code occasionally calls the download and foreign packages, respectively to get files from HTTPS sources and to deal with foreign formats. The Gelman and Hill replication also uses plyr and ggplot2, but check the original code for identical functions written in base R.

SEEALSO

Stata survey code on Github:

There's also a Python example with NFGS data in Think Stats.

TODO

  • Same list for country-level data, using the onlineData tag at CRANtastic (FAOSTAT, Quandl, rdatamarket WDI, etc.).
library(foreign)
data = "anes.1948.2002.rda"
if(!file.exists(data)) {
brdata = read.dta("http://www.stat.columbia.edu/~gelman/arm/examples/nes/nes5200_processed_voters_realideo.dta")
save(brdata, file = data)
}
load(data)
## Gelman and Hill code for data cleaning
brdata <- brdata[is.na(brdata$black)==FALSE&is.na(brdata$female)==FALSE&is.na(brdata$educ1)==FALSE & is.na(brdata$age)==FALSE&is.na(brdata$income)==FALSE&is.na(brdata$state)==FALSE,]
kept_cases <- 1952:2000
matched_cases <- match(brdata$year, kept_cases)
keep <- !is.na(matched_cases)
data <- brdata[keep,]
plotyear <- unique(sort(data$year))
year_new <- match(data$year,unique(data$year))
n_year <- length(unique(data$year))
income_new <-data$income-3
age_new <- (data$age-mean(data$age))/10
y <- data$rep_pres_intent
data<-cbind(data, year_new, income_new, age_new, y)
nes_year <- data[,"year"]
age_discrete <- as.numeric (cut (data[,"age"], c(0,29.5, 44.5, 64.5, 200)))
race_adj <- ifelse (data[,"race"]>=3, 1.5, data[,"race"])
data <- cbind (data, age_discrete, race_adj)
female <- data[,"gender"] - 1
black <- ifelse (data[,"race"]==2, 1, 0)
rvote <- ifelse (data[,"presvote"]==1, 0, ifelse(data[,"presvote"]==2, 1, NA))
region_codes <- c(3,4,4,3,4,4,1,1,5,3,3,4,4,2,2,2,2,3,3,1,1,1,2,2,3,2,4,2,4,1,1,4,1,3,2,2,3,4,1,1,3,2,3,3,4,1,3,4,1,2,4)
## partyid model to illustrate secret weapon in chapter 4 (modified)
coef_names <- c("Intercept", "Ideology", "Black", "Age 30-44", "Age 45-64", "Age 65+", "Education", "Female", "Income")
summary2 <- lapply(seq(1972, 2000, 4), function(yr) {
i <- (yr-1968)/4
i <- as.data.frame(regress_year(yr))
i <- cbind(yr, coef_names, i)
names(i) <- c("year", "x", "b", "se")
i
})
library(plyr)
summary2 <- rbind.fill(summary2)
summary2$x <- factor(summary2$x, levels = coef_names)
yrs <- seq(1972,2000,4)
coef_names <- c("Intercept", "Ideology", "Black", "Age.30.44", "Age.45.64", "Age.65.up", "Education", "Female", "Income")
library(ggplot2)
p <- ggplot(summary2, aes(x = year, y = b)) +
geom_pointrange(aes(ymin = b - se, ymax = b + se), colour = "grey25") +
geom_hline(y = 0, linetype = "dashed", colour = "grey50") +
scale_x_continuous(breaks = seq(1972,2000,4)) +
facet_wrap(~ x, ncol = 3, scales = "free_y") +
labs(x = NULL, y = NULL) +
theme_bw(16) +
theme(strip.background = element_rect(fill = NA, linetype = 0), legend.position = "right")
p + geom_smooth(method = "loess", span = 15, se = F, fill = "grey75")

Data used to produce the `secret weapon' plot in Chapter 4 of Gelman and Hill's Data Analysis Using Regression and Multilevel-Hierarchical Models (2007, p. 73-74):

Gelman and Hill's 'secret weapon'

The data and code come from the online example folders at Andrew Gelman's website. See the arm package for related functions.

> table(brdata$year)
1952 1956 1958 1960 1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 
1388 1386 1377 1025 1218 1292 1236 1121 1440 2168 1454 1653 1984 1065 1247 1660 1968 1493 1763 1724 1582 
1996 1998 2000 
1284 1196 1184 

> names(brdata)
 [1] "year"            "resid"           "weight1"         "weight2"         "weight3"        
 [6] "age"             "gender"          "race"            "educ1"           "urban"          
[11] "region"          "income"          "occup1"          "union"           "religion"       
[16] "educ2"           "educ3"           "martial_status"  "occup2"          "icpsr_cty"      
[21] "fips_cty"        "partyid7"        "partyid3"        "partyid3_b"      "str_partyid"    
[26] "father_party"    "mother_party"    "dlikes"          "rlikes"          "dem_therm"      
[31] "rep_therm"       "regis"           "vote"            "regisvote"       "presvote"       
[36] "presvote_2party" "presvote_intent" "ideo_feel"       "ideo7"           "ideo"           
[41] "cd"              "state"           "inter_pre"       "inter_post"      "black"          
[46] "female"          "age_sq"          "rep_presvote"    "rep_pres_intent" "south"          
[51] "real_ideo"       "presapprov"      "perfin1"         "perfin2"         "perfin"         
[56] "presadm"         "age_10"          "age_sq_10"       "newfathe"        "newmoth"        
[61] "parent_party"    "white" 
ANES <- read.csv("http://www.oberlin.edu/faculty/cdesante/assets/downloads/ANES.csv")

ANES 1948-2008:

> table(ANES$year)
1948 1952 1954 1956 1958 1960 1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 
 662 1899 1139 1762 1450 1181 1297 1571 1291 1557 1507 2705 1575 2248 2304 1614 1418 2257 
1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2008 
2176 2040 1980 2485 1795 1714 1281 1807 1511 1212 2322 

> names(ANES)
 [1] "year"      "age"       "cohort"    "female"    "race6"     "religion"  "dems"     
 [8] "ftwelfare" "ftpoor"    "ftaliens"  "ftyoung"   "pid7"      "trust"     "ideo7"    
[15] "inerrant"  "south"     "dempres"

Year 1948 of the ANES is also bundled in the memisc package, in SPSS format:

library(memisc)
vignette(package = "memisc")
library(downloader)
link = "https://raw.github.com/MichaelMBishop/Coefficient-Plot-Driscoll/master/anes1992.csv"
file = "anes1992.txt"
if(!file.exists(file)) download(link, file, mode = "wb")
data = read.csv(file)

ANES 1992:

> names(data)
 [1] "X"                        "Year"                     "Turnout"                 
 [4] "Ideology"                 "No.Identification"        "Ideology.Folded"         
 [7] "Conservative"             "Liberal"                  "Liberal.Thermometer"     
[10] "Conservative.Thermometer" "Republican.Thermometer"   "Education"               
[13] "Party.ID"                 "PartyID.Folded"           "Household.Income"        
[16] "Age"                      "Female"                   "Race"                    
[19] "Black"                    "Region"                   "Union.Member"            
[22] "Church.Attendance"        "Married"                  "Social.Class"            

> nrow(data)
[1] 2485

Used in this script.

file = "esscumulative1_5.zip"
spss = "ESS1-5_cumulative_e01_1.sav"
data = "ESS1-5_cumulative_e01_1.Rda"
if(!file.exists(file))
download.file("http://extweb3.nsd.uib.no/esscumulative1_5.zip", file)
if(!file.exists(spss))
unzip(file, spss)
if(!file.exists(data)) {
library(foreign)
ess <- read.spss(spss, to.data.frame = TRUE)
save(ess, file = data)
}
rm(list = ls())
load(data)

This turns ESS survey waves 1-5 into an R data frame. The SPSS file is 309 MB, or 49 MB zipped). The final Rda file produced by the routine is 28.5 MB (self-compressed). Don't forget to select the appropriate survey weights.

> dim(ess)
[1] 237253    954

> table(ess$cntry)
       Austria        Belgium       Bulgaria    Switzerland 
          6918           8939           6064           9310 
        Cyprus Czech Republic        Germany        Denmark 
          3293           8790          14487           7684 
       Estonia          Spain        Finland         France 
          6960           9729           9991           9096 
United Kingdom         Greece        Croatia        Hungary 
         11117           9759           3133           7806 
       Ireland         Israel          Italy     Luxembourg 
         10472           7283           2736           3187 
   Netherlands         Norway         Poland       Portugal 
          9741           8643           8917          10302 
        Russia         Sweden       Slovenia       Slovakia 
          7544           9201           7126           6944 
        Turkey        Ukraine 
          4272           7809 
setwd("~/Downloads")
get_gss <- function(year = "all", save.rda = TRUE, keep.zip = FALSE, keep.dta = FALSE) {
stopifnot(require(foreign))
if (year == "all") year = "GSS"
url = paste0("http://publicdata.norc.org/GSS/DOCUMENTS/OTHR/", year, "_stata.zip")
zip = paste0(year, "_stata.zip")
if(!file.exists(zip))
download.file(url, zip)
if (year == "GSS")
year = "7212_R2"
dta = paste0("GSS", year, ".dta")
if(!file.exists(dta))
unzip(zip)
if(!keep.zip)
file.remove(zip)
gss = read.dta(dta)
if(!keep.dta)
file.remove(dta)
rda = paste0("GSS", year, ".Rda")
if(save.rda) {
save(gss, file = rda)
return(rda)
}
else
return(gss)
}
# accepts any survey year 1972-2012
get_gss(2004)
gss2012 = get_gss(2012, keep.zip = TRUE, save.rda = FALSE)
# this takes some time to complete
get_gss(keep.zip = TRUE)

Converts any GSS Stata survey year dataset to R. Defaults to all survey years using the current cumulative file (Release 2, June 2013), which weights 354 MB unzipped (30 MB zipped). The final Rda file produced by the routine on the cumulative file is 20 MB (self-compressed).

You can also use Hadley Wickham's extract from the GSS 1972–2006, which is available in two R packages:

# library(GGally)
# data(happy)

library(productplots)
data(happy)

> names(happy)
 [1] "id"      "happy"   "year"    "age"     "sex"     "marital" "degree"  "finrela" "health" 
[10] "wtssall"

> table(happy$year)
1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1993 
1613 1504 1484 1490 1499 1530 1532 1468 1860 1599 1473 1534 1470 1819 1481 1537 1372 1517 1606 
1994 1996 1998 2000 2002 2004 2006 
2992 2904 2832 2817 2765 2812 4510 

For an example analysis, see this repository of a study that was apparently bought by the Obama electoral campaign.

For GSS weight specifications with the survey package, see Anthondy Damico's GSS scripts.

library(questionr)
data(hdv2003)

An extract from the Histoire de vie survey conducted by the Insee (official stats) in 2003:

> names(hdv2003)
 [1] "id"            "age"           "sexe"         
 [4] "nivetud"       "poids"         "occup"        
 [7] "qualif"        "freres.soeurs" "clso"         
[10] "relig"         "trav.imp"      "trav.satisf"  
[13] "hard.rock"     "lecture.bd"    "peche.chasse" 
[16] "cuisine"       "bricol"        "cinema"       
[19] "sport"         "heures.tv"    

The extract is a sample of 2,000 observations and 20 variables by Julien Barnier and comes from his questionr package for survey data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment