Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Computing then plotting race vs. age counts from ACS data
Brendan O'Connor - 2018-02-07 (http://brenocon.com, https://gist.github.com/brendano/bf094c4044adccec429628eb2f0194da)
PUMS-CSV 2016 ACS 1-year Public Use Microdata Samples (PUMS) - CSV format
2016 ACS 1-year estimates
... using "United States Population Records" link at:
https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2016&prodType=document
or from
https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/
Information
https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.2016.html
https://www2.census.gov/programs-surveys/acs/tech_docs/pums/ACS2016_PUMS_README.pdf
https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt
=====
hobbes:~/census/pums % wget -O 2016_acs_1yr_csv.zip 'https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/csv_pus.zip'
There are two .csv files with the same header. I think they're just two halves
of the sample, intended to be concatted. Summing the weight column (7) yields
323,127,515, which sounds like the total U.S. population (I can't find a live
factfinder page with this info for the 1-year 2016, but the 1-year 2017 is
325,719,178). This looks like a 1% sample
hobbes:~/census/pums % pv ss16pus*.csv | awk -F, '{print $7}' | grep -v PWGTP|awk '{s+= $1; n+=1} END{print s,n,n/s}'
2.79GB 0:00:13 [ 217MB/s] [============================================>] 100%
323127515 3156487 0.00976855
===============
library(readr)
library(questionr)
# first step is slow, loading a few million records
d1=read_csv("ss16pusa.csv");d2=read_csv("ss16pusb.csv")
d=rbind(d1,d2)
# see RAC1P and HISP descriptions in
# https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt
rstr=c('wh','aa','amin','amin','amin','asn','asn','other','multi')[d$RAC1P]
r2=ifelse(d$HISP != "01","hisp", sprintf("nh_%s",rstr))
tt=wtd.table(r2,d$AGEP,weights=as.numeric(d$PWGTP)); tt=tt[order(-rowSums(tt)),]; tt
write_csv(tt,"race_vs_age.csv")
====================
> d=read.csv("race_vs_age.csv")
> x=d %>% group_by(Var1,wh_vs_nonwh=ifelse(Var1=="nh_wh","nh_wh","not_nh_wh"),agebin=cut(d$Var2,breaks=c(-1,seq(9,109,10)))) %>% summarise(n=sum(Freq))
> levels(x$agebin)=c("<10",sprintf("%s0-%s9",1:10,1:10))
> nicenames= c(hisp="Hispanic / Latino", nh_aa="Black / African American (NH)", nh_amin="American Indian, Alaskan Native (NH)","Asian, Native Hawaiian, Pacific Islander (NH)","Other Race (NH)", "Two or More Races (NH)",nh_wh="White (NH)")
# order levels by prevalence since that governs ggplot's ordering
> x$race=nicenames[x$Var1];x$race=factor(x$race,levels=(x %>% group_by(race) %>% summarise(n=sum(n)) %>% arrange(-n))$race)
> spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
Adding missing grouping variables: `wh_vs_nonwh`
Source: local data frame [7 x 12]
Groups: Var1, wh_vs_nonwh [7]
# A tibble: 7 x 12
wh_vs_nonwh Var1 `<10` `10-19` `20-29` `30-39` `40-49` `50-59` `60-69` `70-79` `80-89` `90-99`
* <chr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 not_nh_wh hisp 10313144 9921612 9356556 8724654 7597972 5567980 3386720 1642538 723183 155392
2 not_nh_wh nh_aa 4614624 4874924 5318127 4754134 4473783 4725170 3801549 1939053 857568 205388
3 not_nh_wh nh_amin 288889 289642 318043 294449 285490 312398 261597 134297 55850 13078
4 not_nh_wh nh_asn 2089665 2257305 2789453 2814462 2732702 2600474 2129257 1170944 563307 137442
5 not_nh_wh nh_multi 1541241 1402860 1213400 1021840 896614 890721 730321 390686 175053 41020
6 not_nh_wh nh_other 773277 812905 908492 864089 821845 946265 807962 440870 217319 59325
7 nh_wh nh_wh 20616586 22513242 24989022 24114557 24087952 28421060 25374920 14569237 7138771 1853248
> y=spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
> z=y[,3:ncol(y)]
> p=z[7,]/colSums(z)
The white vs nonwhite population as percentage per age group.
> rbind(1-p,p)
<10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
1 0.4876 0.4649 0.4434 0.4338 0.411 0.3461 0.3047 0.2819 0.2664 0.2481
2 0.5124 0.5351 0.5566 0.5662 0.589 0.6539 0.6953 0.7181 0.7336 0.7519
How much larger the white vs nonwhite population is
> p/(1-p)
<10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
1 1.051 1.151 1.255 1.305 1.433 1.889 2.282 2.548 2.754 3.03
> ggplot(x)+theme_grey()+geom_bar(aes(wh_vs_nonwh,n,fill=race),position='stack',stat='identity')+facet_wrap(~agebin,nrow=1,strip.position='bottom')+theme(panel.spacing=unit(0,"lines"))+theme(axis.ticks.x=element_blank(),axis.text.x=element_blank())+ylab("Number of people")+xlab("Age")+ggtitle("Number of people by race and age, ACS 2016 PUMS data")+scale_y_continuous(breaks=b<-seq(0,30e6,5e6),labels=sprintf("%.0f M",b/1e6))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.