Skip to content

Instantly share code, notes, and snippets.

@jenniferthompson
Last active November 11, 2023 23:56
Show Gist options
  • Star 17 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save jenniferthompson/1e6059569214807bbc7db472ff117442 to your computer and use it in GitHub Desktop.
Save jenniferthompson/1e6059569214807bbc7db472ff117442 to your computer and use it in GitHub Desktop.
Example structure for data dictionary + code used for derivation using RMarkdown. Creates three data tables and documents general + field-specific info.
---
title: "Example Data Dictionary"
author: "Jennifer Thompson"
date: "11/1/2018"
output:
html_document:
theme: yeti
code_folding: hide
---
*This is a toy example of how I created a data dictionary for tables with flags
to include patients in specific cohorts (eg, all hospital survivors, all patients
with data at a follow-up time point, etc).*
Several analysts will be using the combined data from Study A and Study B,
necessitating a single "source of truth" for criteria to determine common
cohorts. Indicators for specific cohorts will be created and documented below,
alongside a data dictionary for each of three data tables which should be
incorporated into all future analyses using this data. Our goal is to eliminate
confusion and inconsistencies resulting from different analysts making slightly
different decisions when building a cohort for a project.
# File Structure
These cohort definitions are applied to *analysis* datasets, created in separate
scripts and stored in `[data file]`. Code for creating individual variables for
analysis is available in `derivationscripts/[study]_datamgmt.R`; code used to
combine the two studies' data into a single deidentified data file is in
`combine_data.R`, all in this directory.
All tables created in this document are stored in `cohorttables/` within this
directory, in both RDS and CSV formats, and are intended for merging with the
tables in `[data file]`.
```{r setup, message = FALSE}
knitr::opts_chunk$set(message = FALSE, eval = FALSE) ## obvs, eval = TRUE in real life
library(knitr)
library(tidyverse)
library(kableExtra)
```
```{r load_data}
load('analysisdata/datafile.Rdata')
```
# In-Hospital Indicators
```{r inhosp, eval = TRUE}
################################################################################
## In-hospital indicators
################################################################################
## -- Prep ---------------------------------------------------------------------
## Derivation code would go here!
## -- Create indicators for in-hospital statuses: ------------------------------
## More derivation code!
## data.frame to contain info; putting it in a data.frame allows the table to be
## prettified with kableExtra
inhosp_info <- tribble(
~ indicator, ~ description,
"`wd_data`", "Patient withdrew from study *and* revoked permission to use any data. **All other data should be missing.**",
"`died_inhosp`", "Died during index hospitalization at any point after enrollment.",
"`wd_inhosp`", "Withdrew from study during index hospitalization (but allowed use of data already collected).",
"`hosp_survivor`", "Survived index hospitalization without death or withdrawal.",
"`had_biomarker`", "Had >=1 measurement of the following biomarkers during hospitalization (per protocol, these were drawn on days 1, 3, 5 following enrollment): [list specific markers]"
)
```
`inhosp_df` includes one row per enrolled patient (total N = `nrow(df1)`) and
one column for each of the following indicators (`TRUE/FALSE`):
```{r print_inhosp, eval = TRUE}
kable(
inhosp_info,
format = "html",
col.names = c("", "")
) %>%
group_rows(index = c("Discharge Status" = 4, "In-Hospital Cohorts" = 1)) %>%
kable_styling(bootstrap_options = c("hover"))
```
Each patient must have >=1 of `died_inhosp`, `wd_inhosp`, and `hosp_survivor` =
`TRUE`. It is possible to have death information on a withdrawn patient, as some
patients allowed us to continue to access their medical records after
withdrawal.
# Overall Patient Status at Each Time Point
These studies are longitudinal, and have several prespecified time points for
data collection:
- **In-hospital**: Data was collected daily in the hospital, during and
following critical illness, until death, withdrawal, or discharge from the
hospital.
- **[Original follow-up points]**: The original follow-up time points for these
studies, when a full this and thta battery was performed for all available
patients. If patients were found to have died since last contact, this
information was also entered; similarly, if patients withdrew from the study
when they were contacted for assessment, this was noted. Some patients could not
be found or could not complete an assessment; these patients are considered
"lost to follow-up."
- **[Later follow-up points]**: These follow-up points were added for Study A
**only** after the initial studies were complete, performing a similar battery
as time1- and time2 follow-up.
```{r overall_status}
################################################################################
## Status indicators at ALL time points
## (in-hospital, ...)
################################################################################
## -- Prep ---------------------------------------------------------------------
## You guessed it! More derivation code!
## -- Create dummy dataset: all IDs, all time points ---------------------------
## And yet MORE CODE
status_info <- tribble(
~ indicator, ~ description,
"`timept`", "Time point (`inhosp`, `time1`, `time2`, `time3`, `time4`)",
"`wd_data`", "Patient withdrew from study *and* revoked permission to use any data. **All other data should be missing.**",
"`status`", "Status at this time point (`Deceased`, `Withdrawn`, or `Survived, in study`)",
"`died`", "Patient deceased at (or prior to) this time point",
"`wd`", "Patient withdrew at (or prior to) this time point",
"`alive_instudy`", "Patient remained alive and in the study at this time point; may or may not have assessment data"
)
```
`status_df` includes one row per enrolled patient per time point (5 time points
for Study A patients, and 3 time points for Study B patients), and one column
for each of the following indicators (`TRUE/FALSE`):
```{r print_status}
kable(
status_info,
format = "html",
col.names = c("", "")
) %>%
group_rows(index = c(" " = 3, "Status Indicators" = 3)) %>%
kable_styling(bootstrap_options = c("hover"))
```
Patients are only `alive_instudy` if they have neither died nor withdrawn. A
patient who withdrew could, however, also be deceased, if the patient still
allowed us to access health records or public information after withdrawal, and
that patient was found to have died.
<insert specific example>
Patients who withdrew and revoked access to all data have `NA` for each
indicator. Patients who died or withdrew, but allowed continued record access,
have `FALSE` for each indicator.
# Follow-Up Status
At each follow-up assessment point, patients could be fully assessed; partially
assessed; alive, but not assessed; withdrawn; or deceased. Some tests on the
assessment battery (..., ..., ...) had to be done in person, whereas other tests
(...) could be done over the phone; therefore, more patients were able to
complete the other tests.
<info about how we tend to handle analysis for patients with incomplete data>
```{r followup}
################################################################################
## Follow-Up Indicators
################################################################################
## COOOODDDDEEEEE
fu_info <- tribble(
~ indicator, ~ description,
"`timept`", "Time point (`time1`, `time2`, `time3`, `time4`)",
"`any_outcomes`", "Patient has data for any this and/or that outcome assessment",
"`this_outcomes`", "Patient has data for >=1 *this* assessment (...)",
"`that_outcomes`", "Patient has data for >=1 *that* assessment (...)"
)
```
`fu_df` includes one row per enrolled patient per follow-up time point (time1 +
time2 for all patients; time3 + time4 for Study A only) and one column for each
of the following indicators (`TRUE/FALSE`):
```{r print_fu}
kable(
fu_info,
format = "html",
col.names = c("", "")
) %>%
group_rows(index = c(" " = 1, "Assessment Indicators" = 3)) %>%
kable_styling(bootstrap_options = c("hover"))
```
Patients who withdrew and revoked access to all data have `NA` for each
indicator. Patients who died or withdrew, but allowed continued access, have
`FALSE` for each indicator.
### Code to Save All Reference Tables
```{r save_tables}
walk2(
.x = list(inhosp_df, status_df, fu_df),
.y = c("inhosp_cohorts", "overall_status", "fu_cohorts"),
~ saveRDS(.x, file = paste0("cohorttables/", .y, ".rds"))
)
walk2(
.x = list(inhosp_df, status_df, fu_df),
.y = c("inhosp_cohorts", "overall_status", "fu_cohorts"),
~ write.csv(.x, row.names = FALSE, file = paste0("cohorttables/", .y, ".csv"))
)
```
@jenniferthompson
Copy link
Author

Screenshot of first data dictionary section:

image

@caitlinhudon
Copy link

This is so beautiful I can hardly stand it. Thank you so much for putting this together and sharing.

@jenniferthompson
Copy link
Author

🙌

Shoutout to @haozhu233 for enabling these beautiful kableExtra tables!

@sushmitavgopalan16
Copy link

I love this! Literally using it right away. Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment