reedacartwright/lesson.md

## lesson.md

      
    Raw
  

              lesson.md
            
          
    Introduction

Today we will be working with data from the National Health and Nutrition
Examination Survey (NHANES).
NHANES is a program of studies designed to assess the health and nutritional
status of adults and children in the United States. The survey is unique in
that it combines interviews and physical examinations. NHANES is a major
program of the National Center for Health Statistics (NCHS). NCHS is part of
the Centers for Disease Control and Prevention (CDC) and has the responsibility
for producing vital and health statistics for the Nation.
The NHANES interview includes demographic, socioeconomic, dietary, and
health-related questions. The examination component consists of medical,
dental, and physiological measurements, as well as laboratory tests
administered by highly trained medical personnel.
Findings from this survey will be used to determine the prevalence of major
diseases and risk factors for diseases. Information will be used to assess
nutritional status and its association with health promotion and disease
prevention. NHANES findings are also the basis for national standards for such
measurements as height, weight, and blood pressure. Data from this survey will
be used in epidemiological studies and health sciences research, which help
develop sound public health policy, direct and design health programs and
services, and expand the health knowledge for the Nation.
The dataset we will be working with data from the 2013-2014 and 2015-2016
surveys. Information about the variables recorded in the data can be found in
the following links.

2013-2014: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013

Demographics: https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm
Sex Steroid Hormone - Serum : https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/TST_H.htm
Reproductive Health: https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/RHQ_H.htm


2015-2016: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015

Demographics: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm
Sex Steroid Hormone - Serum: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/TST_I.htm
Reproductive Health: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/RHQ_I.htm


Challenge

Explore the NHANES website (https://www.cdc.gov/nchs/nhanes/index.htm) and in
Etherpad describe some of the information and data that can be found there.
Create a New Project

We are going to create a new RStudio project called “nhanes” with the following
subdirectories.

data
results
scripts

Import NHANES Data from the Web

NHANES distributes its data in the SAS transport format (.XPT), and Tidyverse
includes a function haven::read_xpt() to read these files directly. For
example:

https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.XPT
https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT

library(tidyverse)

# Construct URLs for the data we are going to use
tables <- c("DEMO", "TST", "RHQ")

url_h <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/{tables}_H.XPT")
url_i <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/{tables}_I.XPT")

url_h
url_i

# Read data using a for loop
nhanes_h <- list()
for(u in url_h) {
    nhanes_h[[u]] <- haven::read_xpt(u)
}

# Read data using map()
nhanes_i <- url_i |> map(haven::read_xpt, .progress = TRUE)

# Set the names first
nhanes_i <- set_names(url_i) |> map(haven::read_xpt, .progress = TRUE)

# adjust names
names(nhanes_h) <- names(nhanes_h) |> basename() |> str_remove("_.*$")
names(nhanes_i) <- names(nhanes_i) |> basename() |> str_remove("_.*$")

Weights in NHANES

Various sample weights are available on the data release files – such as the
interview weight (WTINT2YR), the MEC exam weight (WTMEC2YR), and several
subsample weights. Use of the correct sample weight for NHANES analyses depends
on the variables being used. A good rule of thumb is to use “the least common
denominator” where the variable of interest that was collected on the smallest
number of respondents is the “least common denominator.” The sample weight that
applies to that variable is the appropriate one to use for that particular
analysis.
In order to work with four years of data, we will bind the rows from both of
the datasets and then create new weights appropriate for the larger dataset.
# WTINT2YR - Full sample 2 year interview weight
# WTMEC2YR = Full sample 2 year MEC exam weight
dat <- bind_rows(nhanes_h$DEMO, nhanes_i$DEMO) |>
    mutate(WTINT4YR = WTINT2YR/2,
           WTMEC4YR = WTMEC2YR/2)

dat
Using the demographic dataset we can measure the the fraction of women who
are pregnant based on age. Pregnancy was accessed during the exam stage
and we will use the MEC weights in this analysis.
# RIDAGEYR = respondent's age in years
# RIDEXPRG = Pregnancy status at exam
tab <- dat |>
    drop_na(RIDEXPRG) |>
    count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR)

tab
Wait. What the heck do 1, 2, and 3 mean under RIDEXPRG. Let's look at the
documentation. https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm

RIDEXPRG: Pregnancy status at the time of the health examination was
ascertained for females 8–59 years of age. Due to disclosure risks pregnancy
status is only released for women 20-44 years of age. The information used to
code RIDEXPRG values included self-reported pregnancy status and urine
pregnancy test results. Persons who reported they were pregnant at the time
of exam were assumed to be pregnant (RIDEXPRG=1). Those who reported they
were not pregnant or did not know their pregnancy status were further
classified based on the results of the urine pregnancy test. If the
respondent reported “no” or “don’t know” and the urine test result was
positive, the respondent was coded as pregnant (RIDEXPRG=1). If the
respondent reported “no” and the urine test was negative, the respondent was
coded not pregnant (RIDEXPRG=2). If the respondent reported did not know her
pregnancy status and the urine test was negative, the respondent was
coded “could not be determined” (RIDEXPRG=3). Persons who were interviewed,
but not examined also have an RIDEXPRG value = 3 (could not be determined).

Lets drop the “could not be determined” pregnancy status and recalculate.
tab <- dat |>
    filter(RIDEXPRG %in% 1:2) |>
    count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR) |>
    group_by(RIDAGEYR) |>
    mutate(p = n / sum(n))
Let's reshape the table
tab |> select(-n) |>
    pivot_wider(names_from="RIDEXPRG", values_from="p")
Let's make the table a bit nicer before saving it.
tab2 <- tab |> select(-n) |>
    rename(Age = RIDAGEYR, Status = RIDEXPRG) |>
    mutate(Status = case_match(Status,
        1 ~ "Yes",
        2 ~ "No"
    )) |>
    pivot_wider(names_from="Status", values_from="p")

tail(tab2)

tab2 <- tab2 |> replace_na(list(Yes = 0.0, No = 0.0))

tab2

write_csv(tab2, "results/frac_preg_by_age.csv")
Changes in Hormones Levels by Age

To work with data from multiple tables, we will need to combine the tables in
some fashion. Let's use map to quickly look at the number of rows per table.
map(nhanes_h, nrow)
map_int(nhanes_i, nrow)
The number (and order) of rows vary between the tables. We will use a join
operation to combine the tables based on the respondent sequence number, “SEQN".
Dplyr offers multiple ways to join two data frames

inner_join(x, y) keeps observations from x that have a matching key in y.
left_join(x, y) keeps observations in x.
right_join(x, y) keeps observations in y.
full_join(x, y) keeps observations in x and y.
among others.

# SEQN = Respondent sequence number
dat_h <- inner_join(nhanes_h$DEMO, nhanes_h$TST, by = "SEQN")
dat_i <- inner_join(nhanes_i$DEMO, nhanes_i$TST, by = "SEQN")

dat <- bind_rows(dat_h, dat_i) |>
    mutate(WTINT4YR = WTINT2YR/2,
           WTMEC4YR = WTMEC2YR/2)
In our next analysis we are going to look at levels of sex hormones
vary by age. First we are going to create a csv file of our results.
# RIAGENDR = respondent's sex
# RIDAGEYR = respondent's age in years
# LBXTST = Testosterone, total (ng/dL)
# LBXEST = Estradiol (pg/mL)
# Note: 1 ng/dL = 10 pg/mL
tab <- dat |> group_by(RIAGENDR, RIDAGEYR) |> 
    summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
        mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE))

write_csv(tab, file = "results/tst_by_age_and_sex.csv")
Next we are going to create a plot of our results.
Testosterone Levels

# First plot attempt of TST
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst)) +
    geom_line()

# The previous plot did not distinguish males and females.
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = RIAGENDR)) +
    geom_line()

# The plot still didn't look good because RIAGENDR was treated as a continuous variable
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = as.factor(RIAGENDR))) +
    geom_line()
Estrogen Levels

ggplot(tab, aes(x = RIDAGEYR, y = mean_est, color = as.factor(RIAGENDR))) +
    geom_line()
Both Levels Together

tab <- tab |> pivot_longer(c(mean_est, mean_tst))

ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
    facet_wrap(vars(name)) +
    geom_line()
Challenge

There's a lot of noise in the EST data for females. Let's see if it is due to
pregnancy. Challenge: Remove pregnant women from the data set and rerun the
analysis.
# SOLUTION
# Filter out pregnant women and recreate plot
tab <- dat |>
    filter(is.na(RIDEXPRG) | RIDEXPRG != 1) |>
    group_by(RIAGENDR, RIDAGEYR) |>
    summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
        mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE)) |>
    pivot_longer(c(mean_est, mean_tst))

ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
    facet_wrap(vars(name)) +
    geom_line()
Splitting

Let's split respondents into Male, Female (not-pregnant), and Female (pregnant).
tab <- dat |>
    mutate(sex = case_when(
        RIAGENDR == 1 ~ "M",
        RIDEXPRG == 1 ~ "FP",
        TRUE ~ "FN"
    )) |>
    group_by(sex, RIDAGEYR) |>
    summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
        mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE)) |>
    pivot_longer(c(mean_est, mean_tst))

ggplot(tab, aes(x = RIDAGEYR, y = value, color = sex)) +
    facet_wrap(vars(name)) +
    geom_line()
Exploring Pregnancy in More Detail

If you need to join more than two tables together, you can use multiple inner joins.
# use two inner joins
dat_h <- nhanes_h$DEMO |>
    inner_join(nhanes_h$TST) |>
    inner_join(nhanes_h$RHQ)

# use the reduce function
dat_i <- nhanes_i |> reduce(inner_join)

dat <- bind_rows(dat_h, dat_i) |>
    mutate(WTMEC4YR = WTMEC2YR/2,
           WTINT4YR = WTINT2YR/2)

table(dat$RIAGENDR)
Let's plot the pregnancy-by-age data we created previously.
ggplot(tab2, aes(x = Age, y = Yes)) + geom_line()

ggplot(tab2, aes(x = Age, y = Yes)) + geom_col()
We can add more information to this plot if we consider how many
deliveries the respondents have had.
# RHQ131 - Ever been pregnant before.
# RHQ171 - How many deliveries live birth result
tab <- dat |>
    filter(RIDEXPRG %in% 1:2 & RHQ131 %in% 1:2) |>
    rename(Age = RIDAGEYR, Status = RIDEXPRG) |>
    mutate(Births = coalesce(RHQ171, 0)) |>
    count(Age, Births, Status, wt = WTMEC4YR)

tab <- dat |>
    filter(RIDEXPRG %in% 1:2 & RHQ131 %in% 1:2) |>
    rename(Age = RIDAGEYR, Status = RIDEXPRG) |>
    mutate(Births = coalesce(RHQ171, 0)) |>
    count(Age, Births, Status, wt = WTMEC4YR) |>
    complete(Age, Births, Status, fill = list(n = 0))

tab2 <- tab |>
    group_by(Age) |>
    mutate(p = n/sum(n)) |>
    filter(Status == 1)

ggplot(tab2, aes(x = Age, y = p, fill = Births)) +
    geom_col()

ggplot(tab2, aes(x = Age, y = p, fill = as.character(Births))) +
    geom_col() +
    scale_fill_brewer(palette = "Set1")

ggplot(tab2, aes(x = Age, y = p, fill = fct_rev(as.character(Births)))) +
    geom_col() +
    scale_fill_brewer(palette = "Set1", direction = -1)


cut(tab2$Births, breaks = 0:4)
cut(tab2$Births, breaks = 0:4, include.lowest = TRUE)
cut(tab2$Births, breaks = c(0:4,Inf), include.lowest = TRUE)
cut(tab2$Births, breaks = c(0:4,Inf),
    include.lowest = TRUE, right = FALSE)
cut(tab2$Births, breaks = c(0:4,Inf),
    labels = c(0:3, "4+"),
    include.lowest = TRUE, right = FALSE)

cut_births <- function(x) {
    cut(x, breaks = c(0:4,Inf),
        labels = c(0:3, "4+"),
        include.lowest = TRUE,
        right = FALSE)
}

ggplot(tab2, aes(x = Age, y = p, fill = fct_rev(cut_births(Births)))) +
    geom_col() +
    scale_fill_brewer(palette = "Set1", direction = -1)
Challenge

Calculate the fraction of females who are pregnant in the following age ranges:
20-25, 26-30, 31-35, 36-40, 41-44. Write the results to a file.
# SOLUTION
cut_breaks <- c(20,26,31,36,41,45)
cut_labels <- str_glue("{cut_breaks[-length(cut_breaks)]}-{cut_breaks[-1]-1}")
tab <- dat |> 
    mutate(Age = cut(RIDAGEYR, breaks = cut_breaks, labels = cut_labels, right = FALSE)) |>
    filter(RIAGENDR == 2 & !is.na(Age) & RIDEXPRG %in% 1:2) |>
    count(Age, RIDEXPRG, wt = WTMEC4YR) |>
    group_by(Age) |>
    mutate(Preg = n / sum(n))

tab <- tab |>
    filter(RIDEXPRG == 1) |>
    select(Age, Preg)

write_csv(tab, "challenge_3.csv")