Skip to content

Instantly share code, notes, and snippets.

@reedacartwright
Last active April 28, 2024 22:06
Show Gist options
  • Save reedacartwright/8d15fd091a30c38d980232b288d7ab69 to your computer and use it in GitHub Desktop.
Save reedacartwright/8d15fd091a30c38d980232b288d7ab69 to your computer and use it in GitHub Desktop.
NHANES Analysis

Introduction

Today we will be working with data from the National Health and Nutrition Examination Survey (NHANES).

NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.

The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.

Findings from this survey will be used to determine the prevalence of major diseases and risk factors for diseases. Information will be used to assess nutritional status and its association with health promotion and disease prevention. NHANES findings are also the basis for national standards for such measurements as height, weight, and blood pressure. Data from this survey will be used in epidemiological studies and health sciences research, which help develop sound public health policy, direct and design health programs and services, and expand the health knowledge for the Nation.

The dataset we will be working with data from the 2013-2014 and 2015-2016 surveys. Information about the variables recorded in the data can be found in the following links.

Challenge

Explore the NHANES website (https://www.cdc.gov/nchs/nhanes/index.htm) and in Etherpad describe some of the information and data that can be found there.

Create a New Project

We are going to create a new RStudio project called “nhanes” with the following subdirectories.

  • data
  • results
  • scripts

Import NHANES Data from the Web

NHANES distributes its data in the SAS transport format (.XPT), and Tidyverse includes a function haven::read_xpt() to read these files directly. For example:

library(tidyverse)

# Construct URLs for the data we are going to use
tables <- c("DEMO", "TST", "RHQ")

url_h <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/{tables}_H.XPT")
url_i <- str_glue("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/{tables}_I.XPT")

url_h
url_i

# Read data using a for loop
nhanes_h <- list()
for(u in url_h) {
    nhanes_h[[u]] <- haven::read_xpt(u)
}

# Read data using map()
nhanes_i <- url_i |> map(haven::read_xpt, .progress = TRUE)

# Set the names first
nhanes_i <- set_names(url_i) |> map(haven::read_xpt, .progress = TRUE)

# adjust names
names(nhanes_h) <- names(nhanes_h) |> basename() |> str_remove("_.*$")
names(nhanes_i) <- names(nhanes_i) |> basename() |> str_remove("_.*$")

Weights in NHANES

Various sample weights are available on the data release files – such as the interview weight (WTINT2YR), the MEC exam weight (WTMEC2YR), and several subsample weights. Use of the correct sample weight for NHANES analyses depends on the variables being used. A good rule of thumb is to use “the least common denominator” where the variable of interest that was collected on the smallest number of respondents is the “least common denominator.” The sample weight that applies to that variable is the appropriate one to use for that particular analysis.

In order to work with four years of data, we will bind the rows from both of the datasets and then create new weights appropriate for the larger dataset.

# WTINT2YR - Full sample 2 year interview weight
# WTMEC2YR = Full sample 2 year MEC exam weight
dat <- bind_rows(nhanes_h$DEMO, nhanes_i$DEMO) |>
    mutate(WTINT4YR = WTINT2YR/2,
           WTMEC4YR = WTMEC2YR/2)

dat

Using the demographic dataset we can measure the the fraction of women who are pregnant based on age. Pregnancy was accessed during the exam stage and we will use the MEC weights in this analysis.

# RIDAGEYR = respondent's age in years
# RIDEXPRG = Pregnancy status at exam
tab <- dat |>
    drop_na(RIDEXPRG) |>
    count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR)

tab

Wait. What the heck do 1, 2, and 3 mean under RIDEXPRG. Let's look at the documentation. https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm

RIDEXPRG: Pregnancy status at the time of the health examination was ascertained for females 8–59 years of age. Due to disclosure risks pregnancy status is only released for women 20-44 years of age. The information used to code RIDEXPRG values included self-reported pregnancy status and urine pregnancy test results. Persons who reported they were pregnant at the time of exam were assumed to be pregnant (RIDEXPRG=1). Those who reported they were not pregnant or did not know their pregnancy status were further classified based on the results of the urine pregnancy test. If the respondent reported “no” or “don’t know” and the urine test result was positive, the respondent was coded as pregnant (RIDEXPRG=1). If the respondent reported “no” and the urine test was negative, the respondent was coded not pregnant (RIDEXPRG=2). If the respondent reported did not know her pregnancy status and the urine test was negative, the respondent was coded “could not be determined” (RIDEXPRG=3). Persons who were interviewed, but not examined also have an RIDEXPRG value = 3 (could not be determined).

Lets drop the “could not be determined” pregnancy status and recalculate.

tab <- dat |>
    filter(RIDEXPRG %in% 1:2) |>
    count(RIDAGEYR, RIDEXPRG, wt = WTMEC4YR) |>
    group_by(RIDAGEYR) |>
    mutate(p = n / sum(n))

Let's reshape the table

tab |> select(-n) |>
    pivot_wider(names_from="RIDEXPRG", values_from="p")

Let's make the table a bit nicer before saving it.

tab2 <- tab |> select(-n) |>
    rename(Age = RIDAGEYR, Status = RIDEXPRG) |>
    mutate(Status = case_match(Status,
        1 ~ "Yes",
        2 ~ "No"
    )) |>
    pivot_wider(names_from="Status", values_from="p")

tail(tab2)

tab2 <- tab2 |> replace_na(list(Yes = 0.0, No = 0.0))

tab2

write_csv(tab2, "results/frac_preg_by_age.csv")

Changes in Hormones Levels by Age

To work with data from multiple tables, we will need to combine the tables in some fashion. Let's use map to quickly look at the number of rows per table.

map(nhanes_h, nrow)
map_int(nhanes_i, nrow)

The number (and order) of rows vary between the tables. We will use a join operation to combine the tables based on the respondent sequence number, “SEQN". Dplyr offers multiple ways to join two data frames

  • inner_join(x, y) keeps observations from x that have a matching key in y.
  • left_join(x, y) keeps observations in x.
  • right_join(x, y) keeps observations in y.
  • full_join(x, y) keeps observations in x and y.
  • among others.
# SEQN = Respondent sequence number
dat_h <- inner_join(nhanes_h$DEMO, nhanes_h$TST, by = "SEQN")
dat_i <- inner_join(nhanes_i$DEMO, nhanes_i$TST, by = "SEQN")

dat <- bind_rows(dat_h, dat_i) |>
    mutate(WTINT4YR = WTINT2YR/2,
           WTMEC4YR = WTMEC2YR/2)

In our next analysis we are going to look at levels of sex hormones vary by age. First we are going to create a csv file of our results.

# RIAGENDR = respondent's sex
# RIDAGEYR = respondent's age in years
# LBXTST = Testosterone, total (ng/dL)
# LBXEST = Estradiol (pg/mL)
# Note: 1 ng/dL = 10 pg/mL
tab <- dat |> group_by(RIAGENDR, RIDAGEYR) |> 
    summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
        mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE))

write_csv(tab, file = "results/tst_by_age_and_sex.csv")

Next we are going to create a plot of our results.

Testosterone Levels

# First plot attempt of TST
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst)) +
    geom_line()

# The previous plot did not distinguish males and females.
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = RIAGENDR)) +
    geom_line()

# The plot still didn't look good because RIAGENDR was treated as a continuous variable
# Let's fix that.
ggplot(tab, aes(x = RIDAGEYR, y = mean_tst, color = as.factor(RIAGENDR))) +
    geom_line()

Estrogen Levels

ggplot(tab, aes(x = RIDAGEYR, y = mean_est, color = as.factor(RIAGENDR))) +
    geom_line()

Both Levels Together

tab <- tab |> pivot_longer(c(mean_est, mean_tst))

ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
    facet_wrap(vars(name)) +
    geom_line()

Challenge

There's a lot of noise in the EST data for females. Let's see if it is due to pregnancy. Challenge: Remove pregnant women from the data set and rerun the analysis.

# SOLUTION
# Filter out pregnant women and recreate plot
tab <- dat |>
    filter(is.na(RIDEXPRG) | RIDEXPRG != 1) |>
    group_by(RIAGENDR, RIDAGEYR) |>
    summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
        mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE)) |>
    pivot_longer(c(mean_est, mean_tst))

ggplot(tab, aes(x = RIDAGEYR, y = value, color = as.factor(RIAGENDR))) +
    facet_wrap(vars(name)) +
    geom_line()

Splitting

Let's split respondents into Male, Female (not-pregnant), and Female (pregnant).

tab <- dat |>
    mutate(sex = case_when(
        RIAGENDR == 1 ~ "M",
        RIDEXPRG == 1 ~ "FP",
        TRUE ~ "FN"
    )) |>
    group_by(sex, RIDAGEYR) |>
    summarize(mean_est = weighted.mean(LBXEST, w = WTMEC4YR, na.rm = TRUE),
        mean_tst = weighted.mean(LBXTST, w = WTMEC4YR, na.rm = TRUE)) |>
    pivot_longer(c(mean_est, mean_tst))

ggplot(tab, aes(x = RIDAGEYR, y = value, color = sex)) +
    facet_wrap(vars(name)) +
    geom_line()

Exploring Pregnancy in More Detail

If you need to join more than two tables together, you can use multiple inner joins.

# use two inner joins
dat_h <- nhanes_h$DEMO |>
    inner_join(nhanes_h$TST) |>
    inner_join(nhanes_h$RHQ)

# use the reduce function
dat_i <- nhanes_i |> reduce(inner_join)

dat <- bind_rows(dat_h, dat_i) |>
    mutate(WTMEC4YR = WTMEC2YR/2,
           WTINT4YR = WTINT2YR/2)

table(dat$RIAGENDR)

Let's plot the pregnancy-by-age data we created previously.

ggplot(tab2, aes(x = Age, y = Yes)) + geom_line()

ggplot(tab2, aes(x = Age, y = Yes)) + geom_col()

We can add more information to this plot if we consider how many deliveries the respondents have had.

# RHQ131 - Ever been pregnant before.
# RHQ171 - How many deliveries live birth result
tab <- dat |>
    filter(RIDEXPRG %in% 1:2 & RHQ131 %in% 1:2) |>
    rename(Age = RIDAGEYR, Status = RIDEXPRG) |>
    mutate(Births = coalesce(RHQ171, 0)) |>
    count(Age, Births, Status, wt = WTMEC4YR)

tab <- dat |>
    filter(RIDEXPRG %in% 1:2 & RHQ131 %in% 1:2) |>
    rename(Age = RIDAGEYR, Status = RIDEXPRG) |>
    mutate(Births = coalesce(RHQ171, 0)) |>
    count(Age, Births, Status, wt = WTMEC4YR) |>
    complete(Age, Births, Status, fill = list(n = 0))

tab2 <- tab |>
    group_by(Age) |>
    mutate(p = n/sum(n)) |>
    filter(Status == 1)

ggplot(tab2, aes(x = Age, y = p, fill = Births)) +
    geom_col()

ggplot(tab2, aes(x = Age, y = p, fill = as.character(Births))) +
    geom_col() +
    scale_fill_brewer(palette = "Set1")

ggplot(tab2, aes(x = Age, y = p, fill = fct_rev(as.character(Births)))) +
    geom_col() +
    scale_fill_brewer(palette = "Set1", direction = -1)


cut(tab2$Births, breaks = 0:4)
cut(tab2$Births, breaks = 0:4, include.lowest = TRUE)
cut(tab2$Births, breaks = c(0:4,Inf), include.lowest = TRUE)
cut(tab2$Births, breaks = c(0:4,Inf),
    include.lowest = TRUE, right = FALSE)
cut(tab2$Births, breaks = c(0:4,Inf),
    labels = c(0:3, "4+"),
    include.lowest = TRUE, right = FALSE)

cut_births <- function(x) {
    cut(x, breaks = c(0:4,Inf),
        labels = c(0:3, "4+"),
        include.lowest = TRUE,
        right = FALSE)
}

ggplot(tab2, aes(x = Age, y = p, fill = fct_rev(cut_births(Births)))) +
    geom_col() +
    scale_fill_brewer(palette = "Set1", direction = -1)

Challenge

Calculate the fraction of females who are pregnant in the following age ranges: 20-25, 26-30, 31-35, 36-40, 41-44. Write the results to a file.

# SOLUTION
cut_breaks <- c(20,26,31,36,41,45)
cut_labels <- str_glue("{cut_breaks[-length(cut_breaks)]}-{cut_breaks[-1]-1}")
tab <- dat |> 
    mutate(Age = cut(RIDAGEYR, breaks = cut_breaks, labels = cut_labels, right = FALSE)) |>
    filter(RIAGENDR == 2 & !is.na(Age) & RIDEXPRG %in% 1:2) |>
    count(Age, RIDEXPRG, wt = WTMEC4YR) |>
    group_by(Age) |>
    mutate(Preg = n / sum(n))

tab <- tab |>
    filter(RIDEXPRG == 1) |>
    select(Age, Preg)

write_csv(tab, "challenge_3.csv")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment