Today we will be working with data from the National Health and Nutrition Examination Survey (NHANES).
NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.
The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.
Findings from this survey will be used to determine the prevalence of major diseases and risk factors for diseases. Information will be used to assess nutritional status and its association with health promotion and disease prevention. NHANES findings are also the basis for national standards for such measurements as height, weight, and blood pressure. Data from this survey will be used in epidemiological studies and health sciences research, which help develop sound public health policy, direct and design health programs and services, and expand the health knowledge for the Nation.
The dataset we will be working with data from the 2013-2014 and 2015-2016 surveys. Information about the variables recorded in the data can be found in the following links.
- 2013-2014: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013
- Demographics: https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm
- Sex Steroid Hormone - Serum : https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/TST_H.htm
- Reproductive Health: https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/RHQ_H.htm
- 2015-2016: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015
- Demographics: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm
- Sex Steroid Hormone - Serum: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/TST_I.htm
- Reproductive Health: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/RHQ_I.htm
Explore the NHANES website (https://www.cdc.gov/nchs/nhanes/index.htm) and in Etherpad describe some of the information and data that can be found there.
We are going to create a new RStudio project called “nhanes” with the following subdirectories.
- data
- results
- scripts
NHANES distributes its data in the SAS transport format (.XPT), and Tidyverse
includes a function haven::read_xpt()
to read these files directly. For
example:
- https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.XPT
- https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT
library(tidyverse)
# Construct URLs for the data we are going to use
tables <- c("DEMO", "TST", "RHQ")
url_h <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/{tables}_H.XPT")
url_i <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/{tables}_I.XPT")
url_h
url_i
# Read data using a for loop
nhanes_h <- list()
for(u in url_h) {
nhanes_h[[u]] <- haven::read_xpt(u)
}
# Read data using map()
nhanes_i <- url_i |> map(haven::read_xpt, .progress = TRUE)
# Set the names first
nhanes_i <- set_names(url_i) |> map(haven::read_xpt, .progress = TRUE)
# adjust names
names(nhanes_h) <- names(nhanes_h) |> basename() |> str_remove("_.*$")
names(nhanes_i) <- names(nhanes_i) |> basename() |> str_remove("_.*$")
Various sample weights are available on the data release files – such as the interview weight (WTINT2YR), the MEC exam weight (WTMEC2YR), and several subsample weights. Use of the correct sample weight for NHANES analyses depends on the variables being used. A good rule of thumb is to use “the least common denominator” where the variable of interest that was collected on the smallest number of respondents is the “least common denominator.” The sample weight that applies to that variable is the appropriate one to use for that particular analysis.
In order to work with four years of data, we will bind the rows from both of the datasets and then create new weights appropriate for the larger dataset.
# WTINT2YR - Full sample 2 year interview weight
# WTMEC2YR = Full sample 2 year MEC exam weight
dat <- bind_rows(nhanes_h$DEMO, nhanes_i$DEMO) |>
mutate(WTINT4YR = WTINT2YR/2,
WTMEC4YR = WTMEC2YR/2)
dat
Using the demographic dataset we can measure the the fraction of women who are pregnant based on age. Pregnancy was accessed during the exam stage and we will use the MEC weights in this analysis.
# RIDAGEYR = respondent's age in years
# RIDEXPRG = Pregnancy status at exam
tab <- dat |>
drop_na(RIDEXPRG) |>
count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR)
tab
Wait. What the heck do 1, 2, and 3 mean under RIDEXPRG. Let's look at the documentation. https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm
RIDEXPRG: Pregnancy status at the time of the health examination was ascertained for females 8–59 years of age. Due to disclosure risks pregnancy status is only released for women 20-44 years of age. The information used to code RIDEXPRG values included self-reported pregnancy status and urine pregnancy test results. Persons who reported they were pregnant at the time of exam were assumed to be pregnant (RIDEXPRG=1). Those who reported they were not pregnant or did not know their pregnancy status were further classified based on the results of the urine pregnancy test. If the respondent reported “no” or “don’t know” and the urine test result was positive, the respondent was coded as pregnant (RIDEXPRG=1). If the respondent reported “no” and the urine test was negative, the respondent was coded not pregnant (RIDEXPRG=2). If the respondent reported did not know her pregnancy status and the urine test was negative, the respondent was coded “could not be determined” (RIDEXPRG=3). Persons who were interviewed, but not examined also have an RIDEXPRG value = 3 (could not be determined).
Lets drop the “could not be determined” pregnancy status and recalculate.
tab <- dat |>
filter(RIDEXPRG %in% 1:2) |>
count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR) |>
group_by(RIDAGEYR) |>
mutate(p = n / sum(n))
Let's reshape the table
tab |> select(-n) |>
pivot_wider(names_from="RIDEXPRG", values_from="p")
Let's make the table a bit nicer before saving it.
tab2 <- tab |> select(-n) |>
rename(Age = RIDAGEYR, Status = RIDEXPRG) |>
mutate(Status = case_match(Status,
1 ~ "Yes",
2 ~ "No"
)) |>
pivot_wider(names_from="Status", values_from="p")
tail(tab2)
tab2 <- tab2 |> replace_na(list(Yes = 0.0, No = 0.0))
tab2
write_csv(tab2, "results/frac_preg_by_age.csv")
To work with data from multiple tables, we will need to combine the tables in some fashion. Let's use map to quickly look at the number of rows per table.
map(nhanes_h, nrow)
map_int(nhanes_i, nrow)
The number (and order) of rows vary between the tables. We will use a join operation to combine the tables based on the respondent sequence number, “SEQN". Dplyr offers multiple ways to join two data frames
inner_join(x, y)
keeps observations fromx
that have a matching key iny
.left_join(x, y)
keeps observations inx
.right_join(x, y)
keeps observations iny
.full_join(x, y)
keeps observations inx
andy
.- among others.
# SEQN = Respondent sequence number
dat_h <- inner_join(nhanes_h$DEMO, nhanes_h$TST, by = "SEQN")
dat_i <- inner_join(nhanes_i$DEMO, nhanes_i$TST, by = "SEQN")
dat <- bind_rows(dat_h, dat_i) |>
mutate(WTINT4YR = WTINT2YR/2,
WTMEC4YR = WTMEC2YR/2)
In our next analysis we are going to look at levels of sex hormones vary by age. First we are going to create a csv file of our results.
# RIAGENDR = respondent's sex
# RIDAGEYR = respondent's age in years
# LBXTST = Testosterone, total (ng/dL)
# LBXEST = Estradiol (pg/mL)
# Note: 1 ng/dL = 10 pg/mL
tab <- dat |> group_by(RIAGENDR, RIDAGEYR) |>
summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE))
write_csv(tab, file = "results/tst_by_age_and_sex.csv")
Next we are going to create a plot of our results.
# First plot attempt of TST
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst)) +
geom_line()
# The previous plot did not distinguish males and females.
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = RIAGENDR)) +
geom_line()
# The plot still didn't look good because RIAGENDR was treated as a continuous variable
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = as.factor(RIAGENDR))) +
geom_line()
ggplot(tab, aes(x = RIDAGEYR, y = mean_est, color = as.factor(RIAGENDR))) +
geom_line()
tab <- tab |> pivot_longer(c(mean_est, mean_tst))
ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
facet_wrap(vars(name)) +
geom_line()
There's a lot of noise in the EST data for females. Let's see if it is due to pregnancy. Challenge: Remove pregnant women from the data set and rerun the analysis.
# SOLUTION
# Filter out pregnant women and recreate plot
tab <- dat |>
filter(is.na(RIDEXPRG) | RIDEXPRG != 1) |>
group_by(RIAGENDR, RIDAGEYR) |>
summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE)) |>
pivot_longer(c(mean_est, mean_tst))
ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
facet_wrap(vars(name)) +
geom_line()
Let's split respondents into Male, Female (not-pregnant), and Female (pregnant).
tab <- dat |>
mutate(sex = case_when(
RIAGENDR == 1 ~ "M",
RIDEXPRG == 1 ~ "FP",
TRUE ~ "FN"
)) |>
group_by(sex, RIDAGEYR) |>
summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE)) |>
pivot_longer(c(mean_est, mean_tst))
ggplot(tab, aes(x = RIDAGEYR, y = value, color = sex)) +
facet_wrap(vars(name)) +
geom_line()
If you need to join more than two tables together, you can use multiple inner joins.
# use two inner joins
dat_h <- nhanes_h$DEMO |>
inner_join(nhanes_h$TST) |>
inner_join(nhanes_h$RHQ)
# use the reduce function
dat_i <- nhanes_i |> reduce(inner_join)
dat <- bind_rows(dat_h, dat_i) |>
mutate(WTMEC4YR = WTMEC2YR/2,
WTINT4YR = WTINT2YR/2)
table(dat$RIAGENDR)
Let's plot the pregnancy-by-age data we created previously.
ggplot(tab2, aes(x = Age, y = Yes)) + geom_line()
ggplot(tab2, aes(x = Age, y = Yes)) + geom_col()
We can add more information to this plot if we consider how many deliveries the respondents have had.
# RHQ131 - Ever been pregnant before.
# RHQ171 - How many deliveries live birth result
tab <- dat |>
filter(RIDEXPRG %in% 1:2 & RHQ131 %in% 1:2) |>
rename(Age = RIDAGEYR, Status = RIDEXPRG) |>
mutate(Births = coalesce(RHQ171, 0)) |>
count(Age, Births, Status, wt = WTMEC4YR)
tab <- dat |>
filter(RIDEXPRG %in% 1:2 & RHQ131 %in% 1:2) |>
rename(Age = RIDAGEYR, Status = RIDEXPRG) |>
mutate(Births = coalesce(RHQ171, 0)) |>
count(Age, Births, Status, wt = WTMEC4YR) |>
complete(Age, Births, Status, fill = list(n = 0))
tab2 <- tab |>
group_by(Age) |>
mutate(p = n/sum(n)) |>
filter(Status == 1)
ggplot(tab2, aes(x = Age, y = p, fill = Births)) +
geom_col()
ggplot(tab2, aes(x = Age, y = p, fill = as.character(Births))) +
geom_col() +
scale_fill_brewer(palette = "Set1")
ggplot(tab2, aes(x = Age, y = p, fill = fct_rev(as.character(Births)))) +
geom_col() +
scale_fill_brewer(palette = "Set1", direction = -1)
cut(tab2$Births, breaks = 0:4)
cut(tab2$Births, breaks = 0:4, include.lowest = TRUE)
cut(tab2$Births, breaks = c(0:4,Inf), include.lowest = TRUE)
cut(tab2$Births, breaks = c(0:4,Inf),
include.lowest = TRUE, right = FALSE)
cut(tab2$Births, breaks = c(0:4,Inf),
labels = c(0:3, "4+"),
include.lowest = TRUE, right = FALSE)
cut_births <- function(x) {
cut(x, breaks = c(0:4,Inf),
labels = c(0:3, "4+"),
include.lowest = TRUE,
right = FALSE)
}
ggplot(tab2, aes(x = Age, y = p, fill = fct_rev(cut_births(Births)))) +
geom_col() +
scale_fill_brewer(palette = "Set1", direction = -1)
Calculate the fraction of females who are pregnant in the following age ranges: 20-25, 26-30, 31-35, 36-40, 41-44. Write the results to a file.
# SOLUTION
cut_breaks <- c(20,26,31,36,41,45)
cut_labels <- str_glue("{cut_breaks[-length(cut_breaks)]}-{cut_breaks[-1]-1}")
tab <- dat |>
mutate(Age = cut(RIDAGEYR, breaks = cut_breaks, labels = cut_labels, right = FALSE)) |>
filter(RIAGENDR == 2 & !is.na(Age) & RIDEXPRG %in% 1:2) |>
count(Age, RIDEXPRG, wt = WTMEC4YR) |>
group_by(Age) |>
mutate(Preg = n / sum(n))
tab <- tab |>
filter(RIDEXPRG == 1) |>
select(Age, Preg)
write_csv(tab, "challenge_3.csv")