brendano/race vs age, non-whites as stacked bar, ACS 2016 PUMS data.pdf

## race vs age, non-whites as stacked bar, ACS 2016 PUMS data.pdf

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              race vs age, non-whites as stacked bar, ACS 2016 PUMS data.pdf
            
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## z_NOTES.txt
Computing then plotting race vs. age counts from ACS data
Brendan O'Connor - 2018-02-07 (http://brenocon.com, https://gist.github.com/brendano/bf094c4044adccec429628eb2f0194da)

PUMS-CSV    2016 ACS 1-year Public Use Microdata Samples (PUMS) - CSV format
2016 ACS 1-year estimates
... using "United States Population Records" link at:
https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2016&prodType=document
or from
https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/

Information
https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.2016.html
https://www2.census.gov/programs-surveys/acs/tech_docs/pums/ACS2016_PUMS_README.pdf
https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt

=====
hobbes:~/census/pums % wget -O 2016_acs_1yr_csv.zip 'https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/csv_pus.zip'


There are two .csv files with the same header.  I think they're just two halves
of the sample, intended to be concatted.  Summing the weight column (7) yields
323,127,515, which sounds like the total U.S. population (I can't find a live
factfinder page with this info for the 1-year 2016, but the 1-year 2017 is
325,719,178).  This looks like a 1% sample
hobbes:~/census/pums % pv ss16pus*.csv | awk -F, '{print $7}' | grep -v PWGTP|awk '{s+= $1; n+=1} END{print s,n,n/s}'
2.79GB 0:00:13 [ 217MB/s] [============================================>] 100%
323127515 3156487 0.00976855


===============
library(readr)
library(questionr)

# first step is slow, loading a few million records
d1=read_csv("ss16pusa.csv");d2=read_csv("ss16pusb.csv")
d=rbind(d1,d2)

# see RAC1P and HISP descriptions in
# https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt

rstr=c('wh','aa','amin','amin','amin','asn','asn','other','multi')[d$RAC1P]
r2=ifelse(d$HISP != "01","hisp", sprintf("nh_%s",rstr))

tt=wtd.table(r2,d$AGEP,weights=as.numeric(d$PWGTP)); tt=tt[order(-rowSums(tt)),]; tt

write_csv(tt,"race_vs_age.csv")


====================

> d=read.csv("race_vs_age.csv")

> x=d %>% group_by(Var1,wh_vs_nonwh=ifelse(Var1=="nh_wh","nh_wh","not_nh_wh"),agebin=cut(d$Var2,breaks=c(-1,seq(9,109,10)))) %>% summarise(n=sum(Freq))

> levels(x$agebin)=c("<10",sprintf("%s0-%s9",1:10,1:10))

> nicenames= c(hisp="Hispanic / Latino", nh_aa="Black / African American (NH)", nh_amin="American Indian, Alaskan Native (NH)","Asian, Native Hawaiian, Pacific Islander (NH)","Other Race (NH)", "Two or More Races (NH)",nh_wh="White (NH)")

# order levels by prevalence since that governs ggplot's ordering
> x$race=nicenames[x$Var1];x$race=factor(x$race,levels=(x %>% group_by(race) %>% summarise(n=sum(n)) %>% arrange(-n))$race)


> spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
Adding missing grouping variables: `wh_vs_nonwh`
Source: local data frame [7 x 12]
Groups: Var1, wh_vs_nonwh [7]

# A tibble: 7 x 12
  wh_vs_nonwh     Var1    `<10`  `10-19`  `20-29`  `30-39`  `40-49`  `50-59`  `60-69`  `70-79` `80-89` `90-99`
*       <chr>   <fctr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>
1   not_nh_wh     hisp 10313144  9921612  9356556  8724654  7597972  5567980  3386720  1642538  723183  155392
2   not_nh_wh    nh_aa  4614624  4874924  5318127  4754134  4473783  4725170  3801549  1939053  857568  205388
3   not_nh_wh  nh_amin   288889   289642   318043   294449   285490   312398   261597   134297   55850   13078
4   not_nh_wh   nh_asn  2089665  2257305  2789453  2814462  2732702  2600474  2129257  1170944  563307  137442
5   not_nh_wh nh_multi  1541241  1402860  1213400  1021840   896614   890721   730321   390686  175053   41020
6   not_nh_wh nh_other   773277   812905   908492   864089   821845   946265   807962   440870  217319   59325
7       nh_wh    nh_wh 20616586 22513242 24989022 24114557 24087952 28421060 25374920 14569237 7138771 1853248

> y=spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
> z=y[,3:ncol(y)]
> p=z[7,]/colSums(z)

The white vs nonwhite population as percentage per age group.
> rbind(1-p,p)
     <10  10-19  20-29  30-39 40-49  50-59  60-69  70-79  80-89  90-99
1 0.4876 0.4649 0.4434 0.4338 0.411 0.3461 0.3047 0.2819 0.2664 0.2481
2 0.5124 0.5351 0.5566 0.5662 0.589 0.6539 0.6953 0.7181 0.7336 0.7519

How much larger the white vs nonwhite population is
> p/(1-p)
    <10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
1 1.051 1.151 1.255 1.305 1.433 1.889 2.282 2.548 2.754  3.03


> ggplot(x)+theme_grey()+geom_bar(aes(wh_vs_nonwh,n,fill=race),position='stack',stat='identity')+facet_wrap(~agebin,nrow=1,strip.position='bottom')+theme(panel.spacing=unit(0,"lines"))+theme(axis.ticks.x=element_blank(),axis.text.x=element_blank())+ylab("Number of people")+xlab("Age")+ggtitle("Number of people by race and age, ACS 2016 PUMS data")+scale_y_continuous(breaks=b<-seq(0,30e6,5e6),labels=sprintf("%.0f M",b/1e6))
	Computing then plotting race vs. age counts from ACS data
	Brendan O'Connor - 2018-02-07 (http://brenocon.com, https://gist.github.com/brendano/bf094c4044adccec429628eb2f0194da)

	PUMS-CSV 2016 ACS 1-year Public Use Microdata Samples (PUMS) - CSV format
	2016 ACS 1-year estimates
	... using "United States Population Records" link at:
	https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_pums_csv_2016&prodType=document
	or from
	https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/

	Information
	https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.2016.html
	https://www2.census.gov/programs-surveys/acs/tech_docs/pums/ACS2016_PUMS_README.pdf
	https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt

	=====
	hobbes:~/census/pums % wget -O 2016_acs_1yr_csv.zip 'https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/csv_pus.zip'


	There are two .csv files with the same header. I think they're just two halves
	of the sample, intended to be concatted. Summing the weight column (7) yields
	323,127,515, which sounds like the total U.S. population (I can't find a live
	factfinder page with this info for the 1-year 2016, but the 1-year 2017 is
	325,719,178). This looks like a 1% sample
	hobbes:~/census/pums % pv ss16pus*.csv \| awk -F, '{print $7}' \| grep -v PWGTP\|awk '{s+= $1; n+=1} END{print s,n,n/s}'
	2.79GB 0:00:13 [ 217MB/s] [============================================>] 100%
	323127515 3156487 0.00976855


	===============
	library(readr)
	library(questionr)

	# first step is slow, loading a few million records
	d1=read_csv("ss16pusa.csv");d2=read_csv("ss16pusb.csv")
	d=rbind(d1,d2)

	# see RAC1P and HISP descriptions in
	# https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt

	rstr=c('wh','aa','amin','amin','amin','asn','asn','other','multi')[d$RAC1P]
	r2=ifelse(d$HISP != "01","hisp", sprintf("nh_%s",rstr))

	tt=wtd.table(r2,d$AGEP,weights=as.numeric(d$PWGTP)); tt=tt[order(-rowSums(tt)),]; tt

	write_csv(tt,"race_vs_age.csv")


	====================

	> d=read.csv("race_vs_age.csv")

	> x=d %>% group_by(Var1,wh_vs_nonwh=ifelse(Var1=="nh_wh","nh_wh","not_nh_wh"),agebin=cut(d$Var2,breaks=c(-1,seq(9,109,10)))) %>% summarise(n=sum(Freq))

	> levels(x$agebin)=c("<10",sprintf("%s0-%s9",1:10,1:10))

	> nicenames= c(hisp="Hispanic / Latino", nh_aa="Black / African American (NH)", nh_amin="American Indian, Alaskan Native (NH)","Asian, Native Hawaiian, Pacific Islander (NH)","Other Race (NH)", "Two or More Races (NH)",nh_wh="White (NH)")

	# order levels by prevalence since that governs ggplot's ordering
	> x$race=nicenames[x$Var1];x$race=factor(x$race,levels=(x %>% group_by(race) %>% summarise(n=sum(n)) %>% arrange(-n))$race)


	> spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
	Adding missing grouping variables: `wh_vs_nonwh`
	Source: local data frame [7 x 12]
	Groups: Var1, wh_vs_nonwh [7]

	# A tibble: 7 x 12
	wh_vs_nonwh Var1 `<10` `10-19` `20-29` `30-39` `40-49` `50-59` `60-69` `70-79` `80-89` `90-99`
	* <chr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
	1 not_nh_wh hisp 10313144 9921612 9356556 8724654 7597972 5567980 3386720 1642538 723183 155392
	2 not_nh_wh nh_aa 4614624 4874924 5318127 4754134 4473783 4725170 3801549 1939053 857568 205388
	3 not_nh_wh nh_amin 288889 289642 318043 294449 285490 312398 261597 134297 55850 13078
	4 not_nh_wh nh_asn 2089665 2257305 2789453 2814462 2732702 2600474 2129257 1170944 563307 137442
	5 not_nh_wh nh_multi 1541241 1402860 1213400 1021840 896614 890721 730321 390686 175053 41020
	6 not_nh_wh nh_other 773277 812905 908492 864089 821845 946265 807962 440870 217319 59325
	7 nh_wh nh_wh 20616586 22513242 24989022 24114557 24087952 28421060 25374920 14569237 7138771 1853248

	> y=spread(x,agebin,n) %>% select(-race,-wh_vs_nonwh)
	> z=y[,3:ncol(y)]
	> p=z[7,]/colSums(z)

	The white vs nonwhite population as percentage per age group.
	> rbind(1-p,p)
	<10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
	1 0.4876 0.4649 0.4434 0.4338 0.411 0.3461 0.3047 0.2819 0.2664 0.2481
	2 0.5124 0.5351 0.5566 0.5662 0.589 0.6539 0.6953 0.7181 0.7336 0.7519

	How much larger the white vs nonwhite population is
	> p/(1-p)
	<10 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
	1 1.051 1.151 1.255 1.305 1.433 1.889 2.282 2.548 2.754 3.03


	> ggplot(x)+theme_grey()+geom_bar(aes(wh_vs_nonwh,n,fill=race),position='stack',stat='identity')+facet_wrap(~agebin,nrow=1,strip.position='bottom')+theme(panel.spacing=unit(0,"lines"))+theme(axis.ticks.x=element_blank(),axis.text.x=element_blank())+ylab("Number of people")+xlab("Age")+ggtitle("Number of people by race and age, ACS 2016 PUMS data")+scale_y_continuous(breaks=b<-seq(0,30e6,5e6),labels=sprintf("%.0f M",b/1e6))