Skip to content

Instantly share code, notes, and snippets.

@nstrayer
Created December 6, 2017 15:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nstrayer/d3b04808a1fae907cb10ad37bba57d7d to your computer and use it in GitHub Desktop.
Save nstrayer/d3b04808a1fae907cb10ad37bba57d7d to your computer and use it in GitHub Desktop.
---
title: "Homework6"
author: "Ben Lane"
date: "11/22/2017"
output: pdf_document
---
### Grade 44/50
*Due Thursday, 30 November, 1:00 PM*
$5^{n=day}$ points taken off for each day late.
50 points total.
Submit a single knitr file (named `homework6.rmd`), along with a valid PDF output file. Inside the file, clearly indicate which parts of your responses go with which problems (you may use the original homework document as a template). Add your name as `author` to the file's metadata section. Raw R code/output or word processor files are not acceptable.
Failure to name file `homework6.rmd` or include author name may result in 5 points taken off.
### Question 1 ###
**15 points**
Consider the following very simple genetic model (*very* simple -- don't worry if you're not a geneticist!). A population consists of equal numbers of two sexes: male and female. At each generation men and women are paired at random, and each pair produces exactly two offspring, one male and one female. We are interested in the distribution of height from one generation to the next. Suppose that the height of both children is just the average of the height of their parents, how will the distribution of height change across generations?
Represent the heights of the current generation as a dataframe with two variables, m and f, for the two sexes. We can use `rnorm` to randomly generate the population at generation 1:
```{r}
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
```
The following function takes the data frame `pop` and randomly permutes the ordering of the men. Men and women are then paired according to rows, and heights for the next generation are calculated by taking the mean of each row. The function returns a data frame with the same structure, giving the heights of the next generation.
```{r}
next_gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- rowMeans(pop)
pop$f <- pop$m
pop
}
```
Use the function `next_gen` to generate nine generations (you already have the first), then use the function `hist` to plot the distribution of male heights in each generation (this will require multiple calls to `hist`). The phenomenon you see is called regression to the mean. Provide (at least) minimal decorations such as title and x-axis labels.
```{r}
m.height.list <- list()
for (i in 1:9){
m.height.list[[i]] <- pop$m
pop <- next_gen(pop)
}
head(m.height.list)
par(mfrow = c(3,3))
for (i in 1:9){
hist(m.height.list[[i]], xlim = c(120,200), ylim = c(0, 30), breaks = 10,
xlab = "heights",
ylab = "counts",
main = "Male height regressing to mean")
}
# careful of all the extra printout this has before the plot. That being said, good job with making it efficient.
```
### Question 2 ###
**10 points**
Use the simulated results from question 1 to reproduce (as closely as possible) the following plot in ggplot2.
![generations plot](http://d.pr/i/Xh0d+)
```{r}
rm(list = ls())
# why do you need this?
set.seed(42)
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
next_gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- rowMeans(pop)
pop$f <- pop$m
pop
}
height.list <- list()
for (i in 1:9){
height.list[[i]] <- pop
pop <- next_gen(pop)
}
height.data <- data.frame(matrix(unlist(height.list), nrow = 100, byrow = T))
names(height.data) <- c("m1", "f1", "m2", "f2", "m3", "f3", "m4", "f4", "m5", "f5", "m6", "f6", "m7", "f7", "m8", "f8", "m9", "f9")
library(tidyr)
library(ggplot2)
# rm(height.data.tidy)
height.data.tidy1 <- gather(height.data, key = "gen.m", value = "height.m", c(m1, m2, m3, m4, m5, m6, m7, m8, m9))
height.data.tidy2 <- gather(height.data, key = "gen.f", value = "height.f", c(f1, f2, f3, f4, f5, f6, f7, f8, f9))
height.data.tidy3 <- height.data.tidy1[,10:11]
height.data.tidy4 <- height.data.tidy2[,10:11]
height.data.tidy <- cbind(height.data.tidy3, height.data.tidy4)
names(height.data.tidy)
height.data.tidy$generation <- rep(seq(1:9), each = 100)
height.plot <- ggplot(data = height.data.tidy) + geom_point(mapping = aes(x = height.m, y = height.f), alpha = .5) + facet_wrap( ~ generation)
height.plot
# -4
```
```{r}
# Here's how I would do it. I think you overthought it a tiny bit. We just want to iterate a few times through the next_gen function and store the results each time then plot those.
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
# setup the holder dataframe and add a column to keep track of the generation
pop_hist <- pop %>% dplyr::mutate(gen = 1)
for (i in 2:9){
new_pop <- next_gen(pop)
# add new population to the population history dataframe.
pop_hist <- pop_hist %>%
dplyr::bind_rows(new_pop %>% dplyr::mutate(gen = i))
# set the newest pop and the current pop and restart
pop <- new_pop
}
ggplot(pop_hist, aes(x = m, y = f)) +
geom_point(alpha = 0.5) +
facet_wrap(~gen)
```
<br><br>
I'm not sure why my heights aren't regressing to the mean. Some more guidance on how to structure the data correctly would also have been welcome.
### Question 3 ###
**10 points**
You calculated the power of a study design in question #2 of assignment 3. The study has two variables, treatment group and outcome. There are two treatment groups (0, 1) and they should be assigned randomly with equal probability. The outcome should be a random normal variable with a mean of 60 and standard deviation of 20. If a patient is in the treatment group, add 5 to the outcome.
Starting with a sample size of 250, create a 95% bootstrap percentile interval for the mean of each group. Then create a new bootstrap interval by increasing the sample size by 250 until the sample is 2500. Thus you will create a total of 10 bootstrap intervals. Each bootstrap should create 1000 bootstrap samples. (4 points)
Produce a line chart that includes the bootstrapped mean and lower and upper percentile intervals for each group. Add appropriate labels and a legend. (6 points)
You may use base graphics or ggplot2. It should look similar to this (in base).
Here's an example of how you could create transparent shaded areas.
```{r}
size <- seq(250, 2500, by = 250)
z <- 0
bootmean1 <- rep(NA, 1000)
bootmean0 <- rep(NA, 1000)
boot.interval1 <- matrix(ncol = 2, nrow = 10)
boot.interval0 <- matrix(ncol = 2, nrow = 10)
for (j in size) {
z <- z + 1
n <- j # why not just do for (n in size)?
grp <- sample(0:1, n, replace = T)
out <- rnorm(n, 60, 20)
out[grp == 1] <- out[grp == 1] + 5
for (i in 1:1000) {
bootmean1[i] <- mean(sample(out[grp == 1], 1000, replace = T))
bootmean0[i] <- mean(sample(out[grp == 0], 1000, replace = T))
}
boot.interval1[z, ] <- quantile(bootmean1, c(.025, .975))
boot.interval0[z, ] <- quantile(bootmean0, c(.025, .975))
}
boot.interval1
boot.interval0
bmeans1 <- rowMeans(boot.interval1)
bmeans0 <- rowMeans(boot.interval0)
bmeans.data <- data.frame(cbind(boot.interval1, bmeans1, boot.interval0, bmeans0))
names(bmeans.data) <- c("lb1", "ub1", "bmeans1", "lb0", "ub0", "bmeans0")
bmeans.data$strapnumber <- seq(1:10)
library(ggplot2)
bmeans.plot <- ggplot(data = bmeans.data, mapping = aes(x = strapnumber)) +
geom_line(mapping = aes(y = bmeans1)) +
geom_line(mapping = aes(y = bmeans0)) +
ylim(50, 75) +
ggtitle("Bootstrap means and percentile interval") +
ylab("treatment value") +
xlab("bootstrap number") +
theme_bw()
bmeans.plot + geom_ribbon(aes(ymin = lb1, ymax = ub1), alpha = 0.3, fill = "red") +
geom_ribbon(aes(ymin = lb0, ymax = ub0), alpha = 0.3, fill = "blue") +
scale_x_continuous(breaks = unique(bmeans.data$strapnumber)[seq(2, 10, 2)]) +
theme(
plot.title = element_text(hjust = 0.5),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black")
) + #I'm not sure why my legend is not appearing.
# in order for a legend to appear you need to use the same geometry and have something like
# aes(color = different group).
scale_color_manual(name =c("bmeans1", "bmeans0"), values = c("red", "blue"))
#I didn't use any of the below code, but I'm keeping it here so I can have it for future reference.
# makeTransparent <- function(..., alpha=0.5) {
# if (alpha < 0 | alpha > 1) stop("alpha must be between 0 and 1")
# alpha <- floor(255 * alpha)
# newColor <- col2rgb(col = unlist(list(...)), alpha = FALSE)
# .makeTransparent <- function(col, alpha) {
# rgb(red = col[1], green = col[2], blue = col[3], alpha = alpha, maxColorValue = 255)
# }
# newColor <- apply(newColor, 2, .makeTransparent, alpha = alpha)
# return(newColor)
# }
#
# par(new = FALSE)
# plot(
# NULL,
# xlim = c(-1, 1),
# ylim = c(-1, 1),
# xlab = "",
# ylab = ""
# )
#
# polygon(
# x = c(seq(-0.75, 0.25, length.out = 100), seq(0.25, -0.75, length.out = 100)),
# y = c(rep(-0.25, 100), rep(0.75, 100)), border = NA, col = makeTransparent("blue", alpha = 0.5)
# )
# polygon(
# x = c(seq(-0.25, 0.75, length.out = 100), seq(0.75, -0.25, length.out = 100)),
# y = c(rep(-0.75, 100), rep(0.25, 100)), border = NA, col = makeTransparent("red", alpha = 0.5)
# )
```
_Your lines should be getting thinner as the plot goes along. I think what you did was accidentally use the same n for every bootstrap sample. Below is a super-tidyverse method to get the correct looking plot. -2_
```{r}
library(tidyverse)
get_boot_int <- function(n){
data <- data_frame(
outcome = rnorm(n, 60, 20),
treat = sample(c(0,1), n, replace = T)
) %>%
mutate(outcome = ifelse(treat == 1, outcome + 5, outcome))
data %>%
group_by(treat) %>%
summarise(
bs_mean = replicate(1000, mean(sample(outcome, replace = T))) %>%
quantile(c(0.025, 0.975)) %>%
paste(collapse = '-')
) %>%
separate(bs_mean, c('lower', 'upper'), sep = '-') %>%
mutate(lower = as.numeric(lower),
upper = as.numeric(upper),
treat = as.character(treat),
size = n)
}
all_sizes <- seq(250, 2500, 250) %>%
purrr::map_df(get_boot_int)
all_sizes %>%
ggplot(aes(x = size, ymin = lower, ymax = upper)) +
geom_ribbon(aes(fill = treat), alpha = 0.5)
```
### Question 4 ###
**15 points**
Programming with classes. The following function will generate random patient information.
```{r}
makePatient <- function() {
vowel <- grep("[aeiou]", letters)
cons <- grep("[^aeiou]", letters)
name <- paste(sample(LETTERS[cons], 1), sample(letters[vowel], 1), sample(letters[cons], 1), sep='')
gender <- factor(sample(0:1, 1), levels=0:1, labels=c('female','male'))
dob <- as.Date(sample(7500, 1), origin="1970-01-01")
n <- sample(6, 1)
doa <- as.Date(sample(1500, n), origin="2010-01-01")
pulse <- round(rnorm(n, 80, 10))
temp <- round(rnorm(n, 98.4, 0.3), 2)
fluid <- round(runif(n), 2)
list(name, gender, dob, doa, pulse, temp, fluid)
}
```
1. Create an S3 class `medicalRecord` for objects that are a list with the named elements `name`, `gender`, `date_of_birth`, `date_of_admission`, `pulse`, `temperature`, `fluid_intake`. Note that an individual patient may have multiple measurements for some measurements. Set the RNG seed to `8` and create a medical record by taking the output of `makePatient`. Print the medical record, and print the class of the medical record. (5 points)
```{r}
medicalRecord <- function(patient) {
names(patient) <- c('name','gender','date_of_birth','date_of_admission',
'pulse','temperature','fluid_intake')
class(patient) <- 'medicalRecord'
patient
}
set.seed(8)
(patient <- medicalRecord(makePatient()))
```
<br><br>
I don't think I did this correctly. I don't fully understand what the question is asking - is the patient supposed to be class "medicalRecord?" The only way I know how to do that would be to assign it manually, which is what we did in the lecture.
2. Write a `medicalRecord` method for the generic function `mean`, which returns averages for pulse, temperature and fluids. Also write a `medicalRecord` method for `print`, which employs some nice formatting, perhaps arranging measurements by date, and `plot`, that generates a composite plot of measurements over time. Call each function for the medical record created in part 1. (5 points)
```{r}
mean.medicalRecord <- function(patient){
cat(sprintf("pulse: %s \n temperature: %s \n fluids: %s",
mean(patient$pulse), mean(patient$temperature), mean(patient$fluid_intake), "\n"))
}
mean(patient)
library(dplyr)
print.medicalRecord <- function(patient) {
data.measure <- data.frame(matrix(unlist(patient[5:7]), ncol = 3, nrow = length(patient[[4]])))
names(data.measure) <- c("pulse", "temperature", "fluid_intake")
dates <- patient[4]
data.patient <- cbind(dates, data.measure)
data.patient.sorted <- arrange(data.patient, date_of_admission)
cat(sprintf("Name: %s \n Gender: %s \n Date of Birth: %s \n \n", patient$name, patient$gender, patient$date_of_birth))
print(data.patient.sorted)
}
patient
plot.medicalRecord <- function(patient){
library(ggplot2)
data.measure <- data.frame(matrix(unlist(patient[5:7]), ncol = 3, nrow = length(patient[[4]])))
names(data.measure) <- c("pulse", "temperature", "fluid_intake")
dates <- patient[4]
data.patient <- cbind(dates, data.measure)
library(dplyr)
data.patient.sorted <- arrange(data.patient, date_of_admission)
detach("package:dplyr", unload=TRUE)
library(gridExtra)
p <- ggplot(data = data.patient.sorted, mapping = aes(x = date_of_admission)) +
ggtitle("Patient Data") + xlab("Date Admitted") + theme(axis.text.x=element_text(angle=90, hjust=1))
p1 <- p + geom_line(mapping = aes(y = pulse))
p2 <- p + geom_line(mapping = aes(y = temperature))
p3 <- p + geom_line(mapping = aes(y = fluid_intake))
grid.arrange(p1, p2, p3, ncol = 3)
}
plot(patient)
```
3. Create a further class for a cohort (group) of patients, and write methods for `mean` and `print` which, when applied to a cohort, apply mean or print to each patient contained in the cohort. Hint: think of this as a "container" for patients. Reset the RNG seed to 8 and create a cohort of ten patients, then show the output for `mean` and `print`. (5 points)
```{r}
library(dplyr)
data.cohort <- list()
for (i in 1:10){
data.cohort[[i]] <- medicalRecord(makePatient())
}
cohort <- function(x){
class(x) <- "cohort"
x
}
mean(patient)
mean(data.cohort[[1]])
length(data.cohort)
cohort(data.cohort)
mean.cohort <- function(x){
for (i in 1:length(x)){
mean(x[[i]])
}
}
mean.cohort(data.cohort)
print.cohort <- function(x){
for (i in 1:length(x)){
print(x[[i]])
}
}
print.cohort(data.cohort)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment