brendano/race vs age, non-whites as stacked bar, ACS 2016 PUMS data.pdf

## race vs age, non-whites as stacked bar, ACS 2016 PUMS data.pdf

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              race vs age, non-whites as stacked bar, ACS 2016 PUMS data.pdf
            
          
            View raw
        
    
## z_NOTES.txt
Computing then plotting race vs. age counts from ACS data
Brendan O'Connor - 2018-02-07 (http://brenocon.com, https://gist.github.com/brendano/bf094c4044adccec429628eb2f0194da)

PUMS-CSV    2016 ACS 1-year Public Use Microdata Samples (PUMS) - CSV format
2016 ACS 1-year estimates
... using "United States Population Records" link at:
https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2016&prodType=document
or from
https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/

Information
https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.2016.html
https://www2.census.gov/programs-surveys/acs/tech_docs/pums/ACS2016_PUMS_README.pdf
https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt

=====
hobbes:~/census/pums % wget -O 2016_acs_1yr_csv.zip 'https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/csv_pus.zip'


There are two .csv files with the same header.  I think they're just two halves
of the sample, intended to be concatted.  Summing the weight column (7) yields
323,127,515, which sounds like the total U.S. population (I can't find a live
factfinder page with this info for the 1-year 2016, but the 1-year 2017 is
325,719,178).  This looks like a 1% sample
hobbes:~/census/pums % pv ss16pus*.csv | awk -F, '{print $7}' | grep -v PWGTP|awk '{s+= $1; n+=1} END{print s,n,n/s}'
2.79GB 0:00:13 [ 217MB/s] [============================================>] 100%
323127515 3156487 0.00976855


===============
library(readr)
library(questionr)

# first step is slow, loading a few million records
d1=read_csv("ss16pusa.csv");d2=read_csv("ss16pusb.csv")
d=rbind(d1,d2)

# see RAC1P and HISP descriptions in
# https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt

rstr=c('wh','aa','amin','amin','amin','asn','asn','other','multi')[d$RAC1P]
r2=ifelse(d$HISP != "01","hisp", sprintf("nh_%s",rstr))

tt=wtd.table(r2,d$AGEP,weights=as.numeric(d$PWGTP)); tt=tt[order(-rowSums(tt)),]; tt

write_csv(tt,"race_vs_age.csv")


====================

> d=read.csv("race_vs_age.csv")

> x=d %>% group_by(Var1,wh_vs_nonwh=ifelse(Var1=="nh_wh","nh_wh","not_nh_wh"),agebin=cut(d$Var2,breaks=c(-1,seq(9,109,10)))) %>% summarise(n=sum(Freq))

> levels(x$agebin)=c("<10",sprintf("%s0-%s9",1:10,1:10))

> nicenames= c(hisp="Hispanic / Latino", nh_aa="Black / African American (NH)", nh_amin="American Indian, Alaskan Native (NH)","Asian, Native Hawaiian, Pacific Islander (NH)","Other Race (NH)", "Two or More Races (NH)",nh_wh="White (NH)")

# order levels by prevalence since that governs ggplot's ordering
> x$race=nicenames[x$Var1];x$race=factor(x$race,levels=(x %>% group_by(race) %>% summarise(n=sum(n)) %>% arrange(-n))$race)


> spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
Adding missing grouping variables: `wh_vs_nonwh`
Source: local data frame [7 x 12]
Groups: Var1, wh_vs_nonwh [7]

# A tibble: 7 x 12
  wh_vs_nonwh     Var1    `<10`  `10-19`  `20-29`  `30-39`  `40-49`  `50-59`  `60-69`  `70-79` `80-89` `90-99`
*       <chr>   <fctr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>
1   not_nh_wh     hisp 10313144  9921612  9356556  8724654  7597972  5567980  3386720  1642538  723183  155392
2   not_nh_wh    nh_aa  4614624  4874924  5318127  4754134  4473783  4725170  3801549  1939053  857568  205388
3   not_nh_wh  nh_amin   288889   289642   318043   294449   285490   312398   261597   134297   55850   13078
4   not_nh_wh   nh_asn  2089665  2257305  2789453  2814462  2732702  2600474  2129257  1170944  563307  137442
5   not_nh_wh nh_multi  1541241  1402860  1213400  1021840   896614   890721   730321   390686  175053   41020
6   not_nh_wh nh_other   773277   812905   908492   864089   821845   946265   807962   440870  217319   59325
7       nh_wh    nh_wh 20616586 22513242 24989022 24114557 24087952 28421060 25374920 14569237 7138771 1853248

> y=spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
> z=y[,3:ncol(y)]
> p=z[7,]/colSums(z)

The white vs nonwhite population as percentage per age group.
> rbind(1-p,p)
     <10  10-19  20-29  30-39 40-49  50-59  60-69  70-79  80-89  90-99
1 0.4876 0.4649 0.4434 0.4338 0.411 0.3461 0.3047 0.2819 0.2664 0.2481
2 0.5124 0.5351 0.5566 0.5662 0.589 0.6539 0.6953 0.7181 0.7336 0.7519

How much larger the white vs nonwhite population is
> p/(1-p)
    <10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
1 1.051 1.151 1.255 1.305 1.433 1.889 2.282 2.548 2.754  3.03


> ggplot(x)+theme_grey()+geom_bar(aes(wh_vs_nonwh,n,fill=race),position='stack',stat='identity')+facet_wrap(~agebin,nrow=1,strip.position='bottom')+theme(panel.spacing=unit(0,"lines"))+theme(axis.ticks.x=element_blank(),axis.text.x=element_blank())+ylab("Number of people")+xlab("Age")+ggtitle("Number of people by race and age, ACS 2016 PUMS data")+scale_y_continuous(breaks=b<-seq(0,30e6,5e6),labels=sprintf("%.0f M",b/1e6))
	Computing then plotting race vs. age counts from ACS data
	Brendan O'Connor - 2018-02-07 (http://brenocon.com, https://gist.github.com/brendano/bf094c4044adccec429628eb2f0194da)

	PUMS-CSV 2016 ACS 1-year Public Use Microdata Samples (PUMS) - CSV format
	2016 ACS 1-year estimates
	... using "United States Population Records" link at:
	https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2016&prodType=document
	or from
	https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/

	Information
	https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.2016.html
	https://www2.census.gov/programs-surveys/acs/tech_docs/pums/ACS2016_PUMS_README.pdf
	https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt

	=====
	hobbes:~/census/pums % wget -O 2016_acs_1yr_csv.zip 'https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/csv_pus.zip'


	There are two .csv files with the same header. I think they're just two halves
	of the sample, intended to be concatted. Summing the weight column (7) yields
	323,127,515, which sounds like the total U.S. population (I can't find a live
	factfinder page with this info for the 1-year 2016, but the 1-year 2017 is
	325,719,178). This looks like a 1% sample
	hobbes:~/census/pums % pv ss16pus*.csv \| awk -F, '{print $7}' \| grep -v PWGTP\|awk '{s+= $1; n+=1} END{print s,n,n/s}'
	2.79GB 0:00:13 [ 217MB/s] [============================================>] 100%
	323127515 3156487 0.00976855


	===============
	library(readr)
	library(questionr)

	# first step is slow, loading a few million records
	d1=read_csv("ss16pusa.csv");d2=read_csv("ss16pusb.csv")
	d=rbind(d1,d2)

	# see RAC1P and HISP descriptions in
	# https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt

	rstr=c('wh','aa','amin','amin','amin','asn','asn','other','multi')[d$RAC1P]
	r2=ifelse(d$HISP != "01","hisp", sprintf("nh_%s",rstr))

	tt=wtd.table(r2,d$AGEP,weights=as.numeric(d$PWGTP)); tt=tt[order(-rowSums(tt)),]; tt

	write_csv(tt,"race_vs_age.csv")


	====================

	> d=read.csv("race_vs_age.csv")

	> x=d %>% group_by(Var1,wh_vs_nonwh=ifelse(Var1=="nh_wh","nh_wh","not_nh_wh"),agebin=cut(d$Var2,breaks=c(-1,seq(9,109,10)))) %>% summarise(n=sum(Freq))

	> levels(x$agebin)=c("<10",sprintf("%s0-%s9",1:10,1:10))

	> nicenames= c(hisp="Hispanic / Latino", nh_aa="Black / African American (NH)", nh_amin="American Indian, Alaskan Native (NH)","Asian, Native Hawaiian, Pacific Islander (NH)","Other Race (NH)", "Two or More Races (NH)",nh_wh="White (NH)")

	# order levels by prevalence since that governs ggplot's ordering
	> x$race=nicenames[x$Var1];x$race=factor(x$race,levels=(x %>% group_by(race) %>% summarise(n=sum(n)) %>% arrange(-n))$race)


	> spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
	Adding missing grouping variables: `wh_vs_nonwh`
	Source: local data frame [7 x 12]
	Groups: Var1, wh_vs_nonwh [7]

	# A tibble: 7 x 12
	wh_vs_nonwh Var1 `<10` `10-19` `20-29` `30-39` `40-49` `50-59` `60-69` `70-79` `80-89` `90-99`
	* <chr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
	1 not_nh_wh hisp 10313144 9921612 9356556 8724654 7597972 5567980 3386720 1642538 723183 155392
	2 not_nh_wh nh_aa 4614624 4874924 5318127 4754134 4473783 4725170 3801549 1939053 857568 205388
	3 not_nh_wh nh_amin 288889 289642 318043 294449 285490 312398 261597 134297 55850 13078
	4 not_nh_wh nh_asn 2089665 2257305 2789453 2814462 2732702 2600474 2129257 1170944 563307 137442
	5 not_nh_wh nh_multi 1541241 1402860 1213400 1021840 896614 890721 730321 390686 175053 41020
	6 not_nh_wh nh_other 773277 812905 908492 864089 821845 946265 807962 440870 217319 59325
	7 nh_wh nh_wh 20616586 22513242 24989022 24114557 24087952 28421060 25374920 14569237 7138771 1853248

	> y=spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
	> z=y[,3:ncol(y)]
	> p=z[7,]/colSums(z)

	The white vs nonwhite population as percentage per age group.
	> rbind(1-p,p)
	<10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
	1 0.4876 0.4649 0.4434 0.4338 0.411 0.3461 0.3047 0.2819 0.2664 0.2481
	2 0.5124 0.5351 0.5566 0.5662 0.589 0.6539 0.6953 0.7181 0.7336 0.7519

	How much larger the white vs nonwhite population is
	> p/(1-p)
	<10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
	1 1.051 1.151 1.255 1.305 1.433 1.889 2.282 2.548 2.754 3.03


	> ggplot(x)+theme_grey()+geom_bar(aes(wh_vs_nonwh,n,fill=race),position='stack',stat='identity')+facet_wrap(~agebin,nrow=1,strip.position='bottom')+theme(panel.spacing=unit(0,"lines"))+theme(axis.ticks.x=element_blank(),axis.text.x=element_blank())+ylab("Number of people")+xlab("Age")+ggtitle("Number of people by race and age, ACS 2016 PUMS data")+scale_y_continuous(breaks=b<-seq(0,30e6,5e6),labels=sprintf("%.0f M",b/1e6))