Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brendano/bf094c4044adccec429628eb2f0194da to your computer and use it in GitHub Desktop.
Save brendano/bf094c4044adccec429628eb2f0194da to your computer and use it in GitHub Desktop.
Computing then plotting race vs. age counts from ACS data
Brendan O'Connor - 2018-02-07 (http://brenocon.com, https://gist.github.com/brendano/bf094c4044adccec429628eb2f0194da)
PUMS-CSV 2016 ACS 1-year Public Use Microdata Samples (PUMS) - CSV format
2016 ACS 1-year estimates
... using "United States Population Records" link at:
https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2016&prodType=document
or from
https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/
Information
https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.2016.html
https://www2.census.gov/programs-surveys/acs/tech_docs/pums/ACS2016_PUMS_README.pdf
https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt
=====
hobbes:~/census/pums % wget -O 2016_acs_1yr_csv.zip 'https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/csv_pus.zip'
There are two .csv files with the same header. I think they're just two halves
of the sample, intended to be concatted. Summing the weight column (7) yields
323,127,515, which sounds like the total U.S. population (I can't find a live
factfinder page with this info for the 1-year 2016, but the 1-year 2017 is
325,719,178). This looks like a 1% sample
hobbes:~/census/pums % pv ss16pus*.csv | awk -F, '{print $7}' | grep -v PWGTP|awk '{s+= $1; n+=1} END{print s,n,n/s}'
2.79GB 0:00:13 [ 217MB/s] [============================================>] 100%
323127515 3156487 0.00976855
===============
library(readr)
library(questionr)
# first step is slow, loading a few million records
d1=read_csv("ss16pusa.csv");d2=read_csv("ss16pusb.csv")
d=rbind(d1,d2)
# see RAC1P and HISP descriptions in
# https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt
rstr=c('wh','aa','amin','amin','amin','asn','asn','other','multi')[d$RAC1P]
r2=ifelse(d$HISP != "01","hisp", sprintf("nh_%s",rstr))
tt=wtd.table(r2,d$AGEP,weights=as.numeric(d$PWGTP)); tt=tt[order(-rowSums(tt)),]; tt
write_csv(tt,"race_vs_age.csv")
====================
> d=read.csv("race_vs_age.csv")
> x=d %>% group_by(Var1,wh_vs_nonwh=ifelse(Var1=="nh_wh","nh_wh","not_nh_wh"),agebin=cut(d$Var2,breaks=c(-1,seq(9,109,10)))) %>% summarise(n=sum(Freq))
> levels(x$agebin)=c("<10",sprintf("%s0-%s9",1:10,1:10))
> nicenames= c(hisp="Hispanic / Latino", nh_aa="Black / African American (NH)", nh_amin="American Indian, Alaskan Native (NH)","Asian, Native Hawaiian, Pacific Islander (NH)","Other Race (NH)", "Two or More Races (NH)",nh_wh="White (NH)")
# order levels by prevalence since that governs ggplot's ordering
> x$race=nicenames[x$Var1];x$race=factor(x$race,levels=(x %>% group_by(race) %>% summarise(n=sum(n)) %>% arrange(-n))$race)
> spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
Adding missing grouping variables: `wh_vs_nonwh`
Source: local data frame [7 x 12]
Groups: Var1, wh_vs_nonwh [7]
# A tibble: 7 x 12
wh_vs_nonwh Var1 `<10` `10-19` `20-29` `30-39` `40-49` `50-59` `60-69` `70-79` `80-89` `90-99`
* <chr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 not_nh_wh hisp 10313144 9921612 9356556 8724654 7597972 5567980 3386720 1642538 723183 155392
2 not_nh_wh nh_aa 4614624 4874924 5318127 4754134 4473783 4725170 3801549 1939053 857568 205388
3 not_nh_wh nh_amin 288889 289642 318043 294449 285490 312398 261597 134297 55850 13078
4 not_nh_wh nh_asn 2089665 2257305 2789453 2814462 2732702 2600474 2129257 1170944 563307 137442
5 not_nh_wh nh_multi 1541241 1402860 1213400 1021840 896614 890721 730321 390686 175053 41020
6 not_nh_wh nh_other 773277 812905 908492 864089 821845 946265 807962 440870 217319 59325
7 nh_wh nh_wh 20616586 22513242 24989022 24114557 24087952 28421060 25374920 14569237 7138771 1853248
> y=spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
> z=y[,3:ncol(y)]
> p=z[7,]/colSums(z)
The white vs nonwhite population as percentage per age group.
> rbind(1-p,p)
<10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
1 0.4876 0.4649 0.4434 0.4338 0.411 0.3461 0.3047 0.2819 0.2664 0.2481
2 0.5124 0.5351 0.5566 0.5662 0.589 0.6539 0.6953 0.7181 0.7336 0.7519
How much larger the white vs nonwhite population is
> p/(1-p)
<10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
1 1.051 1.151 1.255 1.305 1.433 1.889 2.282 2.548 2.754 3.03
> ggplot(x)+theme_grey()+geom_bar(aes(wh_vs_nonwh,n,fill=race),position='stack',stat='identity')+facet_wrap(~agebin,nrow=1,strip.position='bottom')+theme(panel.spacing=unit(0,"lines"))+theme(axis.ticks.x=element_blank(),axis.text.x=element_blank())+ylab("Number of people")+xlab("Age")+ggtitle("Number of people by race and age, ACS 2016 PUMS data")+scale_y_continuous(breaks=b<-seq(0,30e6,5e6),labels=sprintf("%.0f M",b/1e6))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment