Last active
April 26, 2020 03:59
-
-
Save nmatzke/13eb664b915e6b88e50badf1495cf199 to your computer and use it in GitHub Desktop.
week_06_exercise_v4.R
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
####################################################### | |
# BIOSCI220 WEEK 6 LAB EXERCISE | |
####################################################### | |
################################################ | |
# Note for people using online R | |
################################################ | |
# | |
# In previous labs, people who couldn't get R | |
# installed on a home machine could use the | |
# "R Snippets" website to run the lab code. | |
# | |
# However, I have discovered that R Snippets | |
# puts a 10-second limit on R runs, and our | |
# Maximum Likelihood searches in this lab | |
# will often take a little longer than this. | |
# | |
# Therefore, we have to use something else. | |
# The best option I have found is RStudio Cloud. | |
# | |
# I have run all of the lab code in RStudio Cloud, | |
# and it works just like regular RStudio. | |
# | |
# Also, it remembers your previous actions, so | |
# you no longer have to do the "copy-paste the | |
# whole script down to your question" business | |
# that R Snippets required. (However, this can | |
# still be a useful tactic for "resetting" | |
# everything, if you are getting errors due to | |
# having accidentally skipped some steps.) | |
# | |
# However, you do need to register for a free | |
# account. I have put instructions below. | |
# | |
################################################ | |
# Instructions for using RStudio Cloud: | |
################################################ | |
# | |
# 1. Go to https://rstudio.cloud/ | |
# | |
# 2. Register for a free account, e.g. using a gmail address. | |
# | |
# 3. Wait awhile for your Workspace to load. Be patient. | |
# | |
# 4. Click the blue "New Project" button in the upper right. | |
# | |
# 5. Wait awhile for your new project to load. Be patient. | |
# | |
# 6. Click File -> New File -> R script | |
# | |
# 7. Paste your R script into this window. You should be able to | |
# run the script's code by either (a) copy-pasting into | |
# the Console, or (b) clicking the "Run" button. | |
# | |
# 8. Go to the Console. Type | |
# | |
# install.packages("deSolve") | |
# | |
# ...to get deSolve installed. | |
# | |
# 9. Type library(deSolve). If no errors are returned, then | |
# deSolve is working, and you can start the lab. | |
################################################ | |
####################################################### | |
# Introduction | |
####################################################### | |
# In week 5, we got a lot of experience working with SIR models, | |
# and simulating possible futures, given different values of R0 | |
# for different levels of lockdown, different timings of interventions, | |
# different numbers of starting infections, etc. | |
# | |
# However, many students came across one big difficulty -- how do we | |
# know what parameters to use? In some cases, we can justify parameters | |
# with background knowledge. For example, it appears, from medical | |
# observations, that COVID-19 patients are infectious for about 14 days. | |
# We don't necessarily need fancy inference methods for that. | |
# | |
# However, some parameters, like R0, are fairly uncertain -- in the | |
# media, you will see scientists quoting numbers between 2 and 6. | |
# Furthermore, the R0 will (hopefully!) change depending on how well | |
# a country's leadership and population enact the various control | |
# measures. We would like to *measure* this, if possible. | |
# | |
# Regarding the number of starting infections, we rarely know | |
# exactly how many infections were imported into a country at | |
# the start of an epidemic. We can guesstimate based on the number | |
# of early cases, but we might want to use a more formal method. | |
# | |
# Furthermore, we would like to have some measure of uncertainty | |
# about these parameters. Are we highly confident that R0 is 3.0, | |
# plus or minus 0.1? Or are we highly uncertain, and we can only | |
# put R0 between 2 and 6? | |
# | |
# Answers to these questions can be found, pretty easily, with | |
# MODEL-BASED INFERENCE. | |
# | |
# In this lab, we will perform some model-based inference on SIR | |
# models. This will give you a decent taste of what real-life | |
# model-fitting is like -- although I hasten to add that we are | |
# using quite simple models, and professional epidemiologists | |
# would use somewhat more complex models to help fix parts of | |
# our models that are unrealistic (for example, our models allow | |
# there to be 0.0639 of an infection, whereas in real life there | |
# is either 0 or 1 infections). | |
# | |
# In a quick appendix, I connect up the model-based, Maximum | |
# Likelihood inference we have studied, to Ordinary Least Squares | |
# (OLS) or linear regression, which you probably remember | |
# from introductory statistics. It turns | |
# out that a great deal of statistics is tied to likelihood, | |
# Maximum Likelihood, and model-based inference in one | |
# form or another, although often the connections are not | |
# always apparent in introductory courses. | |
####################################################### | |
####################################################### | |
# Likelihood of a case count | |
####################################################### | |
# In week 4, we introduced the concept of likelihood, where | |
# "likelihood" means the probability of the data, given a model. | |
# We can write this definition like this: | |
# | |
# L = P(data | model) | |
# | |
# The "|" means "given". | |
# | |
# Sometimes, we write the likelihood as a function of the model, | |
# and of the parameters. The list of parameters is often abbreviated | |
# as the Greek symbol "theta". We cannot use greek symbols in | |
# R scripts, which have to be plain-text, so we will just write | |
# "theta". | |
# | |
# L = P(data | model, theta) | |
# | |
# ...this just means that the likelihood (L) is the probability (P) | |
# of the data, given a model and particular parameter values stored | |
# in theta. | |
# | |
# | |
# As we saw in week 4, we can try to learn what parameter values | |
# maximize the probability of the data. This is the method of | |
# "Maximum Likelihood" (ML), and is one common method to attempt | |
# to estimate likely values of parameters. It is commonly used | |
# in hundreds of different scientific fields. | |
# | |
# Let's set up a ML analysis, in order to fit SIR model parameters | |
# to real data. | |
####################################################### | |
# NOTE: AS IN PREVIOUS WEEKS, IT IS CRUCIAL TO COPY- | |
# PASTE *ALL* OF THE CODE, WITHOUT SKIPPING STEPS. | |
# | |
# IT IS A BIT TEDIOUS, COMPARED TO E.G. LOADING FUNCTIONS | |
# FROM A WEBSITE, BUT AT LEAST IT KEEPS US ALL | |
# ON THE SAME PAGE, AND MAKES THE LAB PORTABLE | |
# | |
# In R on a computer, you can run the code in chunks. | |
# | |
# This should also be true on RStudio Cloud. | |
# | |
####################################################### | |
# We are using the deSolve library again; see Week 5 for install instructions | |
library(deSolve) | |
# Let's start with the New Zealand COVID-19 "active cases" | |
# data. I acquired this data from: | |
# | |
# Source: Hancock, Faith (2020). Covid-19 in NZ. NZ Herald, April 13, 2020. | |
# https://www.newsroom.co.nz/2020/03/27/1101216/covid-19-in-nz-the-numbers | |
# | |
# ...which gets its info from the Ministry of Health. These are | |
# basically the numbers that are announced by the government | |
# in a press conference at 1 pm each day. | |
# | |
# I have the data saved as a text file, and normally I would | |
# load it with: | |
################################################ | |
## EXAMPLE CODE FOR LOADING A TEXT FILE IN R | |
################################################ | |
## Set the working directory (wd) to wherever you have the file saved: | |
# getwd() # getwd = "get working directory", to check what | |
# working directory you are in | |
## You would type the full path for the directory containing your file here: | |
# wd = "/drives/GDrive/__classes/BIOSCI220/__week06/" | |
# | |
## Then, set R to make this your working directory | |
# setwd(wd) # setwd = "set working directory: | |
# getwd() # getwd = "get working directory", to check what | |
# working directory you are in | |
## Load the file into table "dat" | |
## fn = filename | |
# fn = "2020-04-13_NZ_daily_cases.txt" | |
# dat = read.table(file=fn, header=TRUE, stringsAsFactors=FALSE) | |
# dat | |
# dim(dat) | |
################################################ | |
# WHY WE AREN'T TRYING TO LOAD FILES IN THIS WORKSHEET | |
# | |
# Getting everyone to save the file somewhere they can | |
# find it on their computer systems, then getting everyone to | |
# successfully reference that exact file path in their R | |
# script, is always a huge pain. This is why people say, | |
# "the hardest part about teaching R is getting students to | |
# successfully load their data." | |
# | |
# It would be even harder to do this while teaching remotely, | |
# so we'll "cheat", and just have the data pasted straight | |
# into the R script. | |
# | |
# In real life R analyses, you will have to put some effort | |
# into saving your data files into a directory, setting | |
# R's working directory with setwd(), and loading | |
# the data file with a function like read.table(). | |
# | |
################################################ | |
datafile_as_a_string = ' | |
# Source: Hancock, Faith (2020). Covid-19 in NZ. NZ Herald, April 13, 2020. | |
# https://www.newsroom.co.nz/2020/03/27/1101216/covid-19-in-nz-the-numbers | |
Date Total_cases Active_cases | |
2020-02-28 1 1 | |
2020-02-29 1 1 | |
2020-03-01 1 1 | |
2020-03-02 1 1 | |
2020-03-03 1 1 | |
2020-03-04 2 2 | |
2020-03-05 3 3 | |
2020-03-06 3 3 | |
2020-03-07 5 5 | |
2020-03-08 5 5 | |
2020-03-09 5 5 | |
2020-03-10 5 5 | |
2020-03-11 5 5 | |
2020-03-12 5 5 | |
2020-03-13 5 5 | |
2020-03-14 6 6 | |
2020-03-15 8 8 | |
2020-03-16 8 8 | |
2020-03-17 12 12 | |
2020-03-18 20 20 | |
2020-03-19 28 28 | |
2020-03-20 39 39 | |
2020-03-21 52 52 | |
2020-03-22 66 66 | |
2020-03-23 102 102 | |
2020-03-24 155 143 | |
2020-03-25 205 183 | |
2020-03-26 283 256 | |
2020-03-27 368 331 | |
2020-03-28 451 401 | |
2020-03-29 514 457 | |
2020-03-30 589 525 | |
2020-03-31 647 572 | |
2020-04-01 708 625 | |
2020-04-02 797 704 | |
2020-04-03 868 764 | |
2020-04-04 950 822 | |
2020-04-05 1039 882 | |
2020-04-06 1106 929 | |
2020-04-07 1160 918 | |
2020-04-08 1210 927 | |
2020-04-09 1239 921 | |
2020-04-10 1283 908 | |
2020-04-11 1312 886 | |
2020-04-12 1330 855 | |
2020-04-13 1349 798 | |
2020-04-14 1366 729 | |
2020-04-15 1386 649 | |
2020-04-16 1401 622 | |
2020-04-17 1409 582 | |
' | |
# Here, we will load the data by reading in the text string we | |
# created above | |
dat = read.table(text=datafile_as_a_string, header=TRUE, stringsAsFactors=FALSE) | |
# Look at the full data: | |
# Look at the head/tail of the data | |
head(dat) | |
tail(dat) | |
# Looking at dat, please answer the questions below: | |
# | |
# QUESTION #1: How many rows are in the dataset loaded into 'dat'? | |
# | |
# QUESTION #2: How many columns are in the dataset loaded into 'dat'? | |
# | |
# QUESTION #3: What is the first day data are reported? Put your answer in | |
# YYYY-MM-DD format. | |
# | |
# QUESTION #4: What is the last day data are reported? Put your answer in | |
# YYYY-MM-DD format. | |
# | |
# Hint: You can also at the dimensions of dat, with dim(). | |
# R *ALWAYS* refers to rows, then columns | |
dim(dat) | |
# Or, you can use "nrow" and "ncol" | |
nrow(dat) | |
ncol(dat) | |
####################################################### | |
# Let's plot this data | |
####################################################### | |
# Let's make x-values just be 1-46 | |
xvals = 1:nrow(dat) | |
plot(x=xvals, y=dat$Total_cases, col="grey70", pch=19, xlab="Day", ylab="Number of cases") | |
points(x=xvals, y=dat$Active_cases, col="red", pch=19) | |
legend(x="topleft", legend=c("Total cases", "Active cases"), col=c("grey50", "red"), pch=c(19,19)) | |
# QUESTION #5: Why are "Total cases" and "Active cases" the | |
# same at the start, but then diverge later in the plot? | |
# Distinguishing "classes" (different data types) in R | |
# | |
# It might be nice to make this plot have dates on the | |
# x-axis. | |
# | |
# What happens if we type this? | |
# | |
# plot(x=dat$Date, y=dat$Total_cases, col="grey70", pch=19, xlab="Day", ylab="Number of cases") | |
# | |
# (I am commenting this out, as it will "crash" peoples' | |
# R-Snippets runs. Uncomment it to try it.) | |
# | |
# The error message returned is: | |
# | |
# Error in plot.window(...) : need finite 'xlim' values | |
# | |
# This error occurs because the x-values are not numbers | |
# or dates, they are character strings representing | |
# dates, and character strings don't provide a way for | |
# R to determine the finite numbers determining the | |
# limits of the x-axis. | |
# | |
# Although the dates in dat$Date look like dates to us humans, | |
# R just sees them as character strings. In R, character strings | |
# have quote marks around them. You can see this by looking at a | |
# single value from the column: | |
dat$Date[1] # the first value in column "Date" | |
dat$Date[2] # the second value in column "Date" | |
dat$Date[3] # the third value in column "Date" | |
# We can do the plot with calendar dates. To do this, | |
# we have to convert the "Dates" column | |
# - from "character" format (meaning a string of text) | |
# - to "Date" format (meaning a number R will recognize as a calendar date) | |
# Convert the dates to R's "Date" format, then order d by date | |
# | |
# Functions like "as.character", and "as.Date", can convert between | |
# these formats (also known as data types or variable classes). | |
# | |
# You can check the format/class of a variable or column with the class() function. | |
# The initial class for column "Date" | |
class(dat$Date) | |
# QUESTION #6: What was the initial class of dat$Date? (put the exact text returned, | |
# i.e. the text inside the quotes, but not the quotes themselves) | |
dat$Date = as.Date(dat$Date, format="%Y-%m-%d") | |
# The new class for column "Date" | |
class(dat$Date) | |
# QUESTION #7: What was the new class of dat$Date? (put the exact text returned, | |
# i.e. the text inside the quotes, but not the quotes themselves) | |
# Now we can plot the data, with dates on the x-axis: | |
plot(x=dat$Date, y=dat$Total_cases, col="grey70", pch=19, xlab="Day", ylab="Number of cases") | |
points(x=dat$Date, y=dat$Active_cases, col="red", pch=19) | |
legend(x="topleft", legend=c("Total cases", "Active cases"), col=c("grey70", "red"), pch=c(19,19)) | |
# Compare this plot to the "Total and active cases plot" available here: | |
# https://www.newsroom.co.nz/2020/03/27/1101216/covid-19-in-nz-the-numbers | |
# | |
# QUESTION #8: What is different about the NZ Herald plot, compared to | |
# your R plot? | |
# | |
# QUESTION #9: What would you have to do to make your R plot match the NZ Herald plot? | |
# | |
####################################################### | |
# Maximum likelihood inference of R0, | |
# pre- and post-lockdown | |
####################################################### | |
# You will recall the several steps required to do | |
# maximum likelihood inference: | |
# | |
# PART 1. Be able to calculate the likelihood of a single | |
# datum, given a model and parameter values. | |
# | |
# PART 2. Calculate the likelihood for all data points under | |
# a model+parameters. | |
# | |
# NOTE: For raw likelihoods, this means MULTIPLYING the likelihoods | |
# of the individual datapoints together. (In other words, | |
# getting the product of all of the individual likelihoods.) | |
# Remember the coin-flipping example: if the likelihood of | |
# a single heads is 0.7, then the likelihood of two Heads is | |
# 0.7 * 0.7, or 0.7^2. | |
# | |
# For log-likelihoods, we ADD UP the log-likelihoods of | |
# all of the individual data. This is true because adding | |
# logged likelihoods is equivalent to multiplying raw likelihoods. | |
# | |
# Proof of this: | |
# | |
# If Prob_heads is 0.7, then the raw likelihood of 2 Heads is: | |
0.7 * 0.7 | |
# If Prob_heads is 0.7, then the log-likelihood of 2 Heads is: | |
log(0.7) + log(0.7) | |
# ...which is the same as... | |
log(0.7 * 0.7) | |
log(0.49) | |
# ...so, the log operation turns multiplication into addition. | |
# | |
# (The log operation also turns power operations into multiplication, | |
# so these are equivalent: | |
log(0.7^2) | |
2*log(0.7) | |
# ...remember, whenever you find yourself confused about some bit of | |
# math like this, just experiment with it in R to figure it out.) | |
# | |
# PART 3. Optimize the parameters in some way, in order | |
# to find the parameter values that maximize | |
# the likelihood. | |
# | |
# The "maximum likelihood parameter values" are the | |
# parameter values that maximize the probability of | |
# the observed data. They are the "best fit" parameters. | |
# The method of Maximum Likelihood is one very common, | |
# very simple method of trying to learn parameter | |
# values from the data. Learning parameters from the | |
# data is also called "estimation" or "inference". | |
# A vast amount of statistics is, deep down, based on | |
# Maximum Likelihood methods, even if they are | |
# (unfortunately) not usually emphasized in | |
# introductory classes. | |
# | |
####################################################### | |
# Here, we will use ML to infer R0 ("R-naught") for | |
# the NZ Covid epidemic. | |
# | |
# PLEASE NOTE that this is a practice exercise, using | |
# simplified models. "Real-life" epidemiology models | |
# would be somewhat more complex, and allow, for | |
# example, multiple different introductions of Covid | |
# cases on multiple different days. | |
# | |
# I think the models we are using are very instructive -- | |
# I'm just saying that you shouldn't go around claiming | |
# you are a professional epidemiologist based on this | |
# exercise. | |
####################################################### | |
####################################################### | |
# Part 1. Be able to calculate the likelihood of a single | |
# datum, given a model and parameter values. | |
####################################################### | |
# Let's start with calculating the likelihood of a | |
# single data point. | |
# | |
# We will do this with the "likelihood_single_point()" | |
# function. Following Funk (2019), we will calculate | |
# the probability of a particular observed | |
# count with a Poisson distribution. The Poisson | |
# distribution is often used for count data, | |
# for several reasons: | |
# | |
# * It has only one parameter: lambda, which is equal | |
# to both the expected mean and the variance | |
# * It has a minimum of zero (so it is impossible | |
# to produce negative counts). | |
# | |
# So, all that likelihood_single_point() does is | |
# take a model projection for a particular day | |
# ("model.point"), set that as the lambda (expected | |
# mean) of the the dpois() function, and then get | |
# the probability of the observed case count | |
# ("data.point") for the day. | |
# | |
# The "log=TRUE" statement means that the | |
# log-likelihood is returned, instead of the | |
# regular or "raw" likelihood. | |
# | |
# Let's load this function, then use it to | |
# calculate a few likelihoods for single | |
# points. | |
# Function definition for: likelihood_single_point | |
# This function returns the likelihood or log-likelihood of a | |
# single day's case count (data.point), given a prediction | |
# for that day from an SIR model (model.point) | |
likelihood_single_point <- function(data.point, model.point, log=TRUE) | |
{ | |
# Get the probability density of the count, assuming | |
# the observed count is a poisson process with an | |
# average value matching the true process. | |
# | |
# This allows for variation, e.g. if a case is reported | |
# days after it occurs. | |
likelihood = dpois(x=data.point, lambda=model.point, log=log) | |
# Return the likelihood value | |
return(likelihood) | |
} # END of likelihood_single_point | |
# Let's say that a particular set of model parameters | |
# predict that, on average, 10 active cases will | |
# be observed. What is the likelihood for an observation | |
# of exactly 10 cases? | |
# What is the likelihood for an observation | |
# of exactly 10 cases? | |
likelihood_single_point(data.point=10, model.point=10, log=FALSE) | |
# What is the log-likelihood for an observation | |
# of exactly 10 cases? | |
likelihood_single_point(data.point=10, model.point=10, log=TRUE) | |
# Now that you have seen how this code works, please answer the questions | |
# below. | |
# QUESTION #10: What is the log-likelihood for an observation of | |
# 8 active cases, if the model predicts 10 cases? | |
# QUESTION #11: What is the log-likelihood for an observation of | |
# 6 active cases, if the model predicts 10 cases? | |
# QUESTION #12: What is the log-likelihood for an observation of | |
# 4 active cases, if the model predicts 10 cases? | |
# QUESTION #13: Does the log-likelihood get higher or lower as | |
# the number of observed cases gets further from the model | |
# projection? | |
# NOTE: FOR QUESTIONS #14-15, MAKE SURE YOU READ AND UNDERSTAND | |
# THE NOTES ABOVE ON MULTIPLYING RAW LIKELIHOOD, VERSUS SUMMING | |
# LOG-LIKELIHOOD: | |
# | |
# "For raw likelihoods, this means MULTIPLYING the likelihoods | |
# of the individual datapoints together..." | |
# "For log-likelihoods, we ADD UP the log-likelihoods of | |
# all of the individual data..." | |
# QUESTION #14: What is the total log-likelihood of the 3 | |
# observed counts (8, 6, and 4), assuming the model prediction | |
# was 10 for each? | |
# QUESTION #15: What is the total log-likelihood of the 3 | |
# observed counts (8, 6, and 4), assuming the model prediction | |
# was 6 for each? | |
# QUESTION #16: Which model prediction yields a higher | |
# total log-likelihood for the 3 observed counts, the | |
# prediction of 6, or 10? | |
############################################################### | |
# Part 2. Calculate the likelihood for all data points under | |
# a model+parameters, and multiply all of these | |
# likelihoods together to get their product. | |
# (Or, more commonly, add together the logs of the | |
# likelihoods, to get their sum.) | |
############################################################### | |
# Rather than summing log-likelihoods by hand, we can just | |
# have another function. This way, we can take a series | |
# of observed counts, and a model prediction, and get the | |
# total log-likelihood easily. | |
# | |
# I have done this in likelihood_time_series(), which | |
# just runs the likelihood_single_point() function across | |
# each pair of data.points and model.points. | |
likelihood_time_series <- function(data.points, model.points, log=TRUE) | |
{ | |
# Error check - return a stop() message if check fails | |
if (length(data.points) != length(model.points)) | |
{ stop("ERROR in likelihood_time_series(): the lengths of data.points and model.points must match.") } | |
# Calculate the likelihoods of the data.points | |
likelihoods = mapply(FUN=likelihood_single_point, data.point=data.points, model.point=model.points, MoreArgs=list(log=log)) | |
# If you calculated log-likelihoods, then sum them | |
# If you calculated raw likelihoods, then multiply them | |
if (log == TRUE) | |
{ | |
total_likelihood = sum(likelihoods) | |
} else { | |
total_likelihood = prod(likelihoods) # prod=take the product | |
} # END if/else statement: if (log == TRUE) | |
# Return that value | |
return (total_likelihood) | |
} # END of likelihood_time_series | |
# We can use this function as another way to answer our question, above. | |
likelihood_time_series(data.points=c(8,6,4), model.points=c(10,10,10), log=TRUE) | |
# Now, in order to fit a series of observations to a | |
# model projection, we need the functions we used last | |
# week to make model projections. As before, load | |
# these functions by pasting them in ALL AT ONCE, | |
# i.e. from the function name, all the way through to the | |
# close of the function at the "}" character. | |
####################################################### | |
# Functions from Week 5 | |
####################################################### | |
library(deSolve) | |
SIR_ode <- function(time, state, parameters) | |
{ | |
# The above inputs are: | |
# time = a time point | |
# state = an R list of the 3 state values, i.e. the | |
# population sizes of S, I, and R | |
# parameters = an R list of the parameters R0 and D_inf | |
# Convert the input parameters "R0" and "D_inf" | |
# into the rates "beta" and "nu" | |
beta <- parameters[["R0"]] / parameters[["D_inf"]] | |
nu <- 1 / parameters[["D_inf"]] | |
# Extract the current values of the states S, I, and R | |
S <- state[["S"]] | |
I <- state[["I"]] | |
R <- state[["R"]] | |
# Calculate N, the total of S+I+R, | |
# i.e. the population size | |
N <- S + I + R | |
# Write out the system of | |
# Ordinary Differential Equations (ODEs) | |
dS <- -beta * I/N * S | |
dI <- beta * I/N * S - nu * I | |
dR <- nu * I | |
# Return the current rates of change of S, I, and R: | |
return(list(c(dS, dI, dR))) | |
} # End of function SIR_ode | |
simulate_SIR <- function(theta, init.state, times) | |
{ | |
trajectory <- data.frame(ode(y = init.state, | |
times = times, | |
func = SIR_ode, | |
parms = theta, | |
method = "ode45")) | |
return(trajectory) | |
} # End of simulate_SIR() | |
simulate_SIR_changes <- function(thetas_states_df, times) | |
{ | |
# Loop through the rows | |
trajectory = NULL | |
for (rn in 1:nrow(thetas_states_df)) | |
{ | |
if (rn == 1) | |
{ | |
init.state = c(S=thetas_states_df$S[rn], I=thetas_states_df$I[rn], R=thetas_states_df$R[rn]) | |
theta = list(R0=thetas_states_df$R0[rn], D_inf=thetas_states_df$D_inf[rn]) | |
} else { | |
last_row = trajectory[nrow(trajectory),] | |
init.state = c(S=last_row$S+thetas_states_df$S[rn], I=last_row$I+thetas_states_df$I[rn], R=last_row$R+thetas_states_df$R[rn]) | |
theta = list(R0=thetas_states_df$R0[rn], D_inf=thetas_states_df$D_inf[rn]) | |
} | |
# Read in the parameters for the initial, versus later, time-bins | |
if (rn >= nrow(thetas_states_df)) | |
{ | |
start_time = thetas_states_df$time[rn] | |
end_time = max(times) | |
} else { | |
start_time = thetas_states_df$time[rn] | |
end_time = thetas_states_df$time[rn+1] | |
} | |
TF1 = times >= start_time | |
TF2 = times <= end_time | |
TF = (TF1 + TF2) == 2 | |
tmp_times = times[TF] | |
tmp_trajectory = simulate_SIR(theta=theta, init.state=init.state, times=tmp_times) | |
if (rn == 1) | |
{ | |
trajectory = rbind(trajectory, tmp_trajectory) | |
} else { | |
trajectory_minus_last_state = trajectory[-nrow(trajectory),] | |
trajectory = rbind(trajectory_minus_last_state, tmp_trajectory) | |
} # if (rn == 1) | |
} # END for (rn in 1:nrow(thetas_states_df)) | |
return(trajectory) | |
} # End simulate_SIR_changes() | |
####################################################### | |
# Code to plot a model projection, observed counts, | |
# and the total log-likelihood | |
####################################################### | |
# Let's get our active case counts into a variable, "data.points" | |
data.points = dat$Active_cases | |
# ...and, let's make a model projection like we did last week: | |
times = 1:length(data.points) | |
time = c(1, 38) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case; | |
# Assume 24+14=38 days to see effect. | |
R0 = c(3.0, 0.3) | |
D_inf = c(14.0, 14.0) | |
S = c(5000000, 0) | |
I = c(1, 0) | |
R = c(0, 0) | |
thetas_states_table = cbind(time, R0, D_inf, S, I, R) | |
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE) | |
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_df, times=1:length(data.points)) | |
# Plot the results | |
maxy = max(max(trajectory$I), max(data.points)) | |
xvals = c(trajectory$time, times) | |
yvals = c(trajectory$I, data.points) | |
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals") | |
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2") | |
points(times, data.points) | |
legend(x="topleft", legend=c("Active COVID-19 case count", 'Model projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8) | |
# Print the total log-likelihood in the plot title | |
# NOTE how we take the model projection of # infectious, trajectory$I, and | |
# store that in model.points: | |
model.points = trajectory$I | |
total_log_likelihood = likelihood_time_series(data.points=data.points, model.points=model.points, log=TRUE) | |
titletxt = paste0("Total log-likelihood = ", round(total_log_likelihood, 2)) | |
title(titletxt) | |
####################################################### | |
# Now, that model projection (red curve) doesn't look very close to your | |
# observations (the circles), does it? Can we do better? | |
# | |
# Let's spend a few minutes changing the two "R0" values, and/or the | |
# first "I" value, and re-running the code above. | |
# | |
# For example, I've made another attempt: | |
# | |
####################################################### | |
# Code to plot a model projection, observed counts, | |
# and the total log-likelihood | |
####################################################### | |
# Let's get our active case counts into a variable, "data.points" | |
data.points = dat$Active_cases | |
# ...and, let's make a model projection like we did last week: | |
times = 1:length(data.points) | |
time = c(1, 38) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case; | |
# Assume 24+14=38 days to see effect. | |
R0 = c(3.5, 0.4) | |
D_inf = c(14.0, 14.0) | |
S = c(5000000, 0) | |
I = c(2, 0) | |
R = c(0, 0) | |
thetas_states_table = cbind(time, R0, D_inf, S, I, R) | |
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE) | |
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_df, times=1:length(data.points)) | |
# Plot the results | |
maxy = max(max(trajectory$I), max(data.points)) | |
xvals = c(trajectory$time, times) | |
yvals = c(trajectory$I, data.points) | |
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals") | |
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2") | |
points(times, data.points) | |
legend(x="topleft", legend=c("Active COVID-19 case count", 'Model projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8) | |
# Print the total log-likelihood in the plot title | |
# NOTE how we take the model projection of # infectious, trajectory$I, and | |
# store that in model.points: | |
model.points = trajectory$I | |
total_log_likelihood = likelihood_time_series(data.points=data.points, model.points=model.points, log=TRUE) | |
titletxt = paste0("Total log-likelihood = ", round(total_log_likelihood, 2)) | |
title(titletxt) | |
####################################################### | |
# QUESTION #17: Spend about 10 minutes re-running the | |
# code chunk above, changing the two "R0" values, | |
# and/or the first "I" value. | |
# | |
# Answer these questions in the answer box: | |
# | |
# What is the highest log-likelihood you found? | |
# | |
# What parameter values did you use to find this | |
# log-likelihood (lnL)? | |
# | |
# How hard was it to find parameter values that | |
# made a curve that matched the data? | |
# | |
# I will not mark down for finding the "wrong" answer, this | |
# is about getting an intuition about how | |
# changing parameters -> changes model projections -> | |
# changes fit between line & data -> changes the | |
# log-likelihood. | |
# | |
# Report your answer like this: | |
# The highest log-likelihood (lnL) I found was lnL = -12500.74 | |
# | |
# What parameter values did you use to find this | |
# log-likelihood (lnL)? | |
# | |
# R0 = c(3.0, 0.3) | |
# I = c(1, 0) | |
# | |
# How hard was it to find parameter values that | |
# made a curve that roughly matched the data? | |
# | |
# I found it very easy/very difficult to find | |
# the parameter values because... | |
# | |
########################################################## | |
# Part 3. Optimize the parameters in some way, in order | |
# to find the parameter values that maximize | |
# the likelihood. | |
########################################################## | |
# Hopefully, you just saw that "optimization by hand" | |
# is pretty difficult, although possible if you | |
# had no other options. | |
# | |
# Fortunately, we have other options! | |
# | |
# The function below, "params_to_likelihood_v1()", will help us | |
# automate this process of finding good parameter values, by: | |
# | |
# (1) taking the free parameters you want to infer ("params") | |
# (2) loading them into simulate_SIR_changes() to get a trajectory (model.points) | |
# (3) calculating the likelihood of the observed data.points against the model.points | |
# | |
# It takes parameters, simulates the model projection, | |
# then calculates the total log-likelihood - this just | |
# combines everything we have done separately above! | |
# | |
# It takes as inputs: | |
# | |
# params - starting values for the parameters that you want to infer | |
# data.points - the observed active cases, as before | |
# thetas_states_df - the data.frame holding time,R0, D_inf, S, I, R | |
# delay_val - if you think there is a delay between an intervention, | |
# and seeing the effects, put it here. Default is 14. | |
# printvals - if TRUE, prints the parameter values you are trying. | |
# | |
params_to_likelihood_v1 <- function(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE) | |
{ | |
# The times are just the length of the data.points | |
times = 1:length(data.points) | |
# (1) input the free parameters | |
thetas_states_temp = thetas_states_df | |
TF = thetas_states_temp == "free" | |
thetas_states_temp[TF] = params | |
# Error check | |
if ((max(thetas_states_temp$time)+delay_val) > max(times)) | |
{ | |
stop("ERROR in params_to_likelihood_v1: there is a 'thetas_states_df$time'+delay_val greater than found in 'times'") | |
} | |
# Make sure everything in thetas_states_df is numeric, instead of character | |
for (i in 1:ncol(thetas_states_temp)) | |
{ | |
thetas_states_temp[,i] = as.numeric(thetas_states_temp[,i]) | |
} | |
# Ad2 the delay (allows time for interventions to show up | |
# in the detections) | |
thetas_states_temp$time[2:length(thetas_states_temp$time)] = thetas_states_temp$time[2:length(thetas_states_temp$time)] + delay_val | |
# Calculate the trajectory given the parameters, | |
# input into model.points | |
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_temp, times) | |
model.points = trajectory$I | |
lnL = likelihood_time_series(data.points, model.points, log=TRUE) | |
if (is.finite(lnL)== FALSE) | |
{ | |
lnL=-10e100 | |
} | |
# Print, if desired | |
if (printvals == TRUE) | |
{ | |
print(thetas_states_temp) | |
cat("log-likelihood = ", lnL, "\n", sep="") | |
} | |
# Make plots, if desired | |
if (makePlots == TRUE) | |
{ | |
plot_trajectory_lines(trajectory, theta=NULL, thetas_states_df=thetas_states_df, title_prefix="Plot #5:") | |
} | |
return(lnL) | |
} # END params_to_likelihood_v1 | |
# We can use params_to_likelihood_v1(), in conjunction with with | |
# the ML optimization function optim() (we used this in Week 4!), | |
# to try lots of parameter values, and gradually find the | |
# parameters that maximize the likelihood. | |
# | |
# Let's try it! | |
# | |
# The major difference from what we did above is that, after | |
# setting up thetas_states_df, we write "free" in certain | |
# cells of the table. These "free" parameters are then filled | |
# in with the values from "params" -- and the optim() function | |
# will change the values in params, in an attempt to find the | |
# best parameters. | |
# | |
####################################################### | |
# Fitting a 2-regime model | |
####################################################### | |
time = c(1, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case | |
R0 = c(3.0, 0.3) | |
D_inf = c(14.0, 14.0) | |
S = c(5000000, 0) | |
I = c(1, 0) | |
R = c(0, 0) | |
thetas_states_table = cbind(time, R0, D_inf, S, I, R) | |
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE) | |
# Make some parameters into "free" parameters, to be inferred | |
thetas_states_df$R0 = c("free", "free") | |
thetas_states_df$I[1] = "free" | |
thetas_states_df | |
# Run the params_to_likelihood_v1() function once, to see your starting likelihood | |
data.points = dat$Active_cases | |
params = c(2.5, 1.0, 0.03) # starting parameter values | |
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE) | |
# Running the Maximum Likelihood search with the optim() function. | |
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!! | |
delay_val = 14 | |
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1)) | |
# Graphing the ML model | |
# Take the learned parameters from "ML_results", put them | |
# into a theta_states data.frame for simulation and plotting | |
thetas_states_ML = thetas_states_df | |
TF = thetas_states_ML == "free" | |
thetas_states_ML[TF] = ML_results$par | |
for (i in 1:ncol(thetas_states_ML)) | |
{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) } | |
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val | |
# Plot the results | |
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points)) | |
maxy = max(max(trajectory$I), max(data.points)) | |
xvals = c(trajectory$time, times) | |
yvals = c(trajectory$I, data.points) | |
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals") | |
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2") | |
points(times, data.points) | |
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8) | |
titletxt = paste0("ML fit, 3-regime model\nTotal log-likelihood = ", round(ML_results$value, 2)) | |
title(titletxt) | |
# Congratulations! You have run a Maximum Likelihood inference | |
# on New Zealand's Covid-19 active case count data. | |
# Using the plot, and the variables & screen output above, please answer | |
# the questions below. | |
# | |
# QUESTION #18. What was the log-likelihood obtained for the | |
# Maximum Likelihood parameters? Provide the answer to 2 | |
# decimal places. | |
# | |
# QUESTION #19. Was this lnL higher than the one you found | |
# when you searched the parameter space "by hand"? | |
# | |
# QUESTION #20. During the ML search, the thetas_states table, | |
# and the log-likelihood, were printed to screen at each | |
# step of the search. Which option best describes what this | |
# looks like? | |
# | |
# NOTE: FOR QUESTIONS 21-24, AND ANY OTHER QUESTIONS ASKING ABOUT | |
# THE "first", "second", etc. R0 parameters, see lecture slide | |
# #101. And this example: | |
# | |
# If your thetas_states_ML table looks like this: | |
# | |
# time R0 D_inf S I R | |
# 1 1 2.574564 14 5e+06 1.028189 0 | |
# 2 38 1.007426 14 0e+00 0.000000 0 | |
# | |
# ...then: | |
# the 1st R0 parameter is: 2.574564 | |
# the 2nd R0 parameter is: 1.007426 | |
# the starting number of Infectious is: 1.028189 | |
# | |
# | |
# QUESTION #21. What was the first R0 parameter, as inferred by ML? | |
# (This parameter was operating from Days 1-36.) | |
# | |
# QUESTION #22. What was the second R0 parameter, as inferred by ML? | |
# (This parameter was operating from Day 37 to the end.) | |
# | |
# QUESTION #23. What was the starting number of Infected, as inferred by ML? | |
# (This parameter was operating from Day 37 to the end.) | |
# | |
# QUESTION #24. Is this "starting number of Infected" a plausible number in | |
# real life, or is this more of a "nuisance parameter" in our analysis? | |
# | |
# You may have noticed that even our ML-fitted model parameters | |
# did not produce an excellent match to our data points. | |
# | |
# For example, the projected curve first rises faster than | |
# our observed case counts, but then it gets lower than the | |
# counts, and then it gets higher than the counts again. | |
# | |
# This is probably the result of trying to fit a single | |
# R0 parameter, which will produce a curve increasing | |
# exponentially by a constant proportion, to a dataset | |
# where, possibly, multiple things are going on. | |
# | |
# For example, what if, before the Level 4 lockdown happened, | |
# New Zealanders were already reducing transmission of the virus, | |
# for example through: | |
# | |
# * Level 2 and Level 3 restrictions | |
# * Increased handwashing & use of sanitizer | |
# * Stopping handshaking etc. | |
# | |
# This suggests that we might end up with a better-fitting | |
# model, and better parameter estimates, if we fit a | |
# 3-regime model, instead of a 2-regime model. | |
# | |
# Let's try that! | |
####################################################### | |
# Fitting a 3-regime model | |
####################################################### | |
time = c(1, 16, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case | |
R0 = c(3.0, 1.0, 0.3) | |
D_inf = c(14.0, 14.0) | |
S = c(5000000, 0) | |
I = c(1, 0) | |
R = c(0, 0) | |
thetas_states_table = cbind(time, R0, D_inf, S, I, R) | |
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE) | |
# Make some parameters into "free" parameters, to be inferred | |
thetas_states_df$R0 = c("free", "free", "free") | |
thetas_states_df$I[1] = "free" | |
thetas_states_df | |
# Run the params_to_likelihood_v1() function once, to see your starting likelihood | |
data.points = dat$Active_cases | |
params = c(2.5, 1.0, 0.3, 0.03) # starting parameter values | |
params_to_likelihood_v1(params, data.points, thetas_states_df, printvals=TRUE, makePlots=FALSE) | |
# Running the Maximum Likelihood search with the optim() function. | |
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!! | |
delay_val = 14 | |
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1)) | |
# Graphing the ML model | |
# Take the learned parameters from "ML_results", put them | |
# into a theta_states data.frame for simulation and plotting | |
thetas_states_ML = thetas_states_df | |
TF = thetas_states_ML == "free" | |
thetas_states_ML[TF] = ML_results$par | |
for (i in 1:ncol(thetas_states_ML)) | |
{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) } | |
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val | |
# Plot the results | |
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points)) | |
maxy = max(max(trajectory$I), max(data.points)) | |
xvals = c(trajectory$time, times) | |
yvals = c(trajectory$I, data.points) | |
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals") | |
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2") | |
points(times, data.points) | |
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8) | |
titletxt = paste0("ML fit, 3-regime model\nTotal log-likelihood = ", round(ML_results$value, 2)) | |
title(titletxt) | |
# QUESTION #25. Visually, in the 3-regime model, does it look like | |
# this curve fits the data better? | |
# | |
# QUESTION #26. In the 3-regime model, what was the log-likelihood obtained | |
# for the Maximum Likelihood parameters? Provide the answer to 2 | |
# decimal places. | |
# | |
# QUESTION #27. In the 3-regime model, is the lnL higher than the one you found | |
# for the 2-regime model? | |
# | |
# QUESTION #28. What was the first R0 parameter, as inferred by ML? | |
# (This parameter was operating from Days 1-30.) | |
# | |
# QUESTION #29. What was the second R0 parameter, as inferred by ML? | |
# (This parameter was operating from Days 31-38.) | |
# | |
# QUESTION #30. What was the third R0 parameter, as inferred by ML? | |
# (This parameter was operating from Day 39-end.) | |
# | |
# QUESTION #31. What was the starting number of Infected, as inferred by ML? | |
# (note: as we saw before, this is a nuisance parameter, not intended | |
# as a real estimate in this simplified model) | |
# | |
# QUESTION #32. What do our inferred R0 values suggest about the | |
# coronavirus's natural spreading ability, and about the impact of | |
# sanitation and physical distancing measures? | |
# | |
####################################################### | |
# Statistical model comparison | |
# | |
# One huge advantage of likelihood-based inference is | |
# that the likelihoods can be used to statistically | |
# compare model fits. | |
# | |
# One very common method for comparing ML-fitted models | |
# is Akaike Information Criterion (AIC). AIC has been | |
# used in tens of thousands of research papers to | |
# compare multiple model fits. See the lecture slides/ | |
# audio, and the Week 6 reading, for an introduction to | |
# AIC. | |
# | |
# The very-very short version of AIC is that AIC | |
# *estimates the relative distance from the truth*. | |
# | |
# We never know what the "true model" is in real life, | |
# since any true biological history that produced observations | |
# is preposterously complex. However, we can at least | |
# compare the *relative* fit of models to data, using AIC. | |
# | |
# AIC is a simple function of likelihood: | |
# | |
# AIC = -2 * (lnL - nparams) | |
# | |
# lnL = the maximized log-likelihood under some model | |
# nparams = the number of free parameters that you fit with ML | |
# | |
# Basically, the AIC is a form of penalized likelihood: we | |
# favor models that give higher lnLs, but then we penalize | |
# models base on the number of free parameters that we fit. | |
# | |
# This helps enforce parsimony in our explanations. If we | |
# put no penalty on extra free parameters, we could always | |
# fit models with higher and higher likelihood, until we | |
# ended up with a free parameter for every data point, at | |
# which point we fit all the data perfectly, but we have | |
# explained nothing at all, because the model just restates | |
# the dataset. | |
# | |
# Some details: | |
# | |
# The "-2" in the AIC formula needs explanation. The "-" | |
# just converts the log-likelihood (which is always negative, | |
# if the data are discrete) from negative to positive. This | |
# makes sense because AIC is a distance, and distances | |
# should be positive. | |
# | |
# The "2" part in AIC is actually an arbitrary convention - | |
# it would work the same if we used 1, 3, etc. The number "2" | |
# was chosen because another famous statistical test, the | |
# Likelihood Ratio Test, uses 2*(difference in log-likelihoods) | |
# as the test statistic, often called the "deviance". | |
# | |
# The rationale for 1 parameter = 1 unit of log-likelihood | |
# is complex, but it can be derived from a view of the world | |
# where we think that processes are controlled by a series of | |
# parameters of decreasing importance, such that there are | |
# a few major controlling variables, and an exponential | |
# decay in the importance of other controlling variables. | |
# | |
# Other rationales can produce other penalties, creating | |
# other model comparison criteria, such as BIC (Bayesian | |
# Information Criterion). Don't worry about these for now. | |
# | |
# AIC model weights | |
# | |
# AIC is just a number that doesn't mean much by itself. | |
# AIC values are used to *compare* multiple models. We | |
# do this with AIC weights, which assign each model a | |
# weight, where all of the weights sum to 1. | |
# | |
# For example, if you had these weights: | |
# | |
# Model 1 AIC weight = 0.05 | |
# Model 2 AIC weight = 0.9 | |
# Model 3 AIC weight = 0.05 | |
# | |
# You would say that "Model 2 acquired 90% of the AIC | |
# weight, and was the best-fitting model in our | |
# model set. It fits about 9 times better than the | |
# other models combined, which is moderately good | |
# evidence that it is a better fit. Models 1 and 3 | |
# have equal model weight, so they are approximately | |
# equivalent fits to the data." | |
# | |
############################################## | |
# Here is how to calculate AIC weights: | |
############################################## | |
# | |
# 1. Calculate AIC values for all the models | |
# in your model set. | |
# | |
# 2. Find the AIC-best model. This is the model | |
# with the *LOWEST* AIC (i.e., the smallest | |
# relative distance from the truth). | |
# | |
# 3. For all the models, calculate the AIC difference | |
# from the best model. (The AIC difference between | |
# the best model and itself is 0.0.). This | |
# number is called "deltaAIC" (in science, "delta" | |
# often means "change" or "difference", so "deltaAIC" | |
# just means "difference in AIC". | |
# | |
# 4. For each model, calculate exp(-0.5 * deltaAIC). These | |
# are the relative likelihoods. | |
# | |
# 5. Sum the relative likelihoods, and divide each | |
# individual relative likelihood by the sum of | |
# relative likelihoods. This is the AIC weight. | |
# The AIC weights will sum to 1.0. | |
# | |
# 6. If desired, convert these fractions (numbers | |
# between 0 and 1) to percentages, by multiplying | |
# by 100, or by clicking the "%" option in Excel. | |
# | |
############################################### | |
############################################### | |
# Advantages and disadvantages of AIC | |
############################################### | |
# The advantages of the AIC framework include: | |
# | |
# * Multiple models (more than 2) can be compared easily. | |
# Traditional statistical tests, which give you p-values, | |
# typically only compare two models at a time, namely | |
# a null model and an alternate. | |
# | |
# * There is no concept of "p-value cutoffs" or | |
# "statistical signficance" in AIC-based analyses. | |
# The proponents of AIC acknowledge that evidence is | |
# continuous. If AIC favours a model 1 over model 2 | |
# 60%-40%, this is a tiny bit of evidence that the | |
# model 1 is closer to the truth. If AIC favours a | |
# model 1 99.999% to 0.001%, then this is strong | |
# evidence that model 1 closer to the truth. | |
# | |
# * AIC assumes that we never have access to the | |
# complete, true model that produced our data. It | |
# acknowledges that the "true" model behind most | |
# empirical data would be fantastically complex. So | |
# all we are really trying to do is find better | |
# simple approximations of a very complex reality. | |
# AIC is one easy way to measure the *relative* | |
# distance of models from the truth. We never know | |
# the "absolute" distance from the truth, because | |
# that would require knowing the true model, which | |
# we never do. | |
# | |
# * Assuming you have maximized lnL values for each | |
# model, the formula for AIC, and AIC weights, is very | |
# simple. The calculations can be done by hand, or | |
# in Excel or Google Sheets (or R). All you have to | |
# be able to do is count the number of free parameters | |
# in each model. | |
# | |
# * AIC can be used very broadly for comparing models. | |
# This includes not just models where the lnL is | |
# explicitly reported (like those above), but | |
# also in linear regression and other linear models. | |
# Here, model fit is more commonly reported in | |
# terms of "deviance" or "RSS" (residual sum | |
# of squares, the sum of the squared errors between | |
# the line of best fit, and the data). These are | |
# all proportional to -1 * log-likelihood, under | |
# the assumption that the errors are normally | |
# distributed. So, the "least squares" method can | |
# also be interpret as a Maximum Likelihood method! | |
# | |
# (Unfortunately, many introductory students only | |
# ever learn about least-squares, when Maximum | |
# Likelihood has much broader application to | |
# all sorts of models, not just cases where | |
# a normal distribution of errors can be assumed.) | |
# | |
# AIC has some limitations as well: | |
# | |
# * If you are literally doing a randomized controlled | |
# experiment, where you have two hypotheses (models) | |
# that exhaust the space of models, namely "no effect" | |
# (the null model) and "effect" (the alternative model), | |
# then the traditional p-value framework can make a | |
# lot of sense. This is especially the case where | |
# further decisions will be based on the result, so | |
# the effect of false-positives and false-negatives | |
# needs serious consideration. | |
# | |
# * If you *do* have the true model in your set, for | |
# example when your data come from a computer simulation | |
# where you know the true model, then a criterion | |
# like BIC, which has a stronger penalty against | |
# extra parameters. | |
# | |
# * If you have a dataset where the number of data is | |
# small (typically <30 data points), and/or the number | |
# of free parameters is large, it is more appropriate | |
# to use "AICc", which is sample-size corrected AIC. | |
# The formula for this is a little more complex, so | |
# we will ignore it here. | |
# | |
# * There are all kinds of intense philosophical debates | |
# in statistics and philosophy of science about | |
# models, inference, and the best way to measure | |
# support for models/hypotheses. AIC is just one | |
# choice amongst these. I would put it roughly in | |
# the "likelihoodist" school. Other major schools of | |
# thought are "frequentism" (which includes p-values | |
# and worries about the long-term error rate) and | |
# Bayesianism (which explicitly puts prior probability | |
# distributions on parameter values, and on models.) | |
# | |
# * It is easy forget that "all models are wrong", and | |
# get very attached to your best model. This is a | |
# challenge in all of statistics & data analysis, | |
# however. The famous statement from George Box is | |
# "All models are wrong, but some models are useful." | |
# Constant critical thinking is required, as I have | |
# tried to encourage! | |
####################################################### | |
####################################################### | |
# EXAMPLE USE OF AIC TO COMPARE MODELS | |
####################################################### | |
# Let's compare four models for our NZ Covid-19 | |
# active cases dataset. Here, we run the 2-regime | |
# and 3-regime models we used before, but I also | |
# include a 1-regime model, and a 4-regime model. | |
####################################################### | |
# Model 1: A 1-regime model | |
####################################################### | |
time = c(1, 49) # Day 49 is the 2nd-to-last day, days 49-50 will have | |
# a fixed R0 of 2.5 | |
R0 = c(3.0, 2.5) | |
D_inf = c(14.0, 14.0) | |
S = c(5000000, 0) | |
I = c(1, 0) | |
R = c(0, 0) | |
thetas_states_table = cbind(time, R0, D_inf, S, I, R) | |
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE) | |
# NOTE: ONLY 2 FREE PARAMETERS | |
thetas_states_df$R0[1] = "free" | |
thetas_states_df$I[1] = "free" | |
thetas_states_df | |
# Run the params_to_likelihood_v1() function once, to see your starting likelihood | |
data.points = dat$Active_cases | |
params = c(2.5, 0.03) # starting parameter values | |
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=0, printvals=TRUE, makePlots=FALSE) | |
# Running the Maximum Likelihood search with the optim() function. | |
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!! | |
delay_val = 0 | |
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1)) | |
# Graphing the ML model | |
# Take the learned parameters from "ML_results", put them | |
# into a theta_states data.frame for simulation and plotting | |
thetas_states_ML = thetas_states_df | |
TF = thetas_states_ML == "free" | |
thetas_states_ML[TF] = ML_results$par | |
for (i in 1:ncol(thetas_states_ML)) | |
{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) } | |
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val | |
# Plot the results | |
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points)) | |
maxy = max(max(trajectory$I), max(data.points)) | |
xvals = c(trajectory$time, times) | |
yvals = c(trajectory$I, data.points) | |
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals") | |
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2") | |
points(times, data.points) | |
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8) | |
titletxt = paste0("ML fit, Model 1 (a 1-regime model)\nTotal log-likelihood = ", round(ML_results$value, 2)) | |
title(titletxt) | |
# Save this model's parameters and log-likelihood | |
thetas_states_ML_model1 = thetas_states_ML | |
total_lnL_Model1 = ML_results$value | |
####################################################### | |
# Model 2: A 2-regime model | |
####################################################### | |
time = c(1, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case | |
R0 = c(3.0, 0.3) | |
D_inf = c(14.0, 14.0) | |
S = c(5000000, 0) | |
I = c(1, 0) | |
R = c(0, 0) | |
thetas_states_table = cbind(time, R0, D_inf, S, I, R) | |
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE) | |
# Make some parameters into "free" parameters, to be inferred | |
thetas_states_df$R0 = c("free", "free") | |
thetas_states_df$I[1] = "free" | |
thetas_states_df | |
# Run the params_to_likelihood_v1() function once, to see your starting likelihood | |
data.points = dat$Active_cases | |
params = c(3.0, 0.3, 0.03) # starting parameter values | |
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE) | |
# Running the Maximum Likelihood search with the optim() function. | |
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!! | |
delay_val = 14 | |
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1)) | |
# Graphing the ML model | |
# Take the learned parameters from "ML_results", put them | |
# into a theta_states data.frame for simulation and plotting | |
thetas_states_ML = thetas_states_df | |
TF = thetas_states_ML == "free" | |
thetas_states_ML[TF] = ML_results$par | |
for (i in 1:ncol(thetas_states_ML)) | |
{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) } | |
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val | |
# Plot the results | |
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points)) | |
maxy = max(max(trajectory$I), max(data.points)) | |
xvals = c(trajectory$time, times) | |
yvals = c(trajectory$I, data.points) | |
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals") | |
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2") | |
points(times, data.points) | |
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8) | |
titletxt = paste0("ML fit, Model 2 (a 2-regime model)\nTotal log-likelihood = ", round(ML_results$value, 2)) | |
title(titletxt) | |
# Save this model's parameters and log-likelihood | |
thetas_states_ML_model2 = thetas_states_ML | |
total_lnL_Model2 = ML_results$value | |
####################################################### | |
# Model 3: A 3-regime model | |
####################################################### | |
time = c(1, 16, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case | |
R0 = c(3.0, 1.0, 0.3) | |
D_inf = c(14.0, 14.0, 14.0) | |
S = c(5000000, 0, 0) | |
I = c(1, 0, 0) | |
R = c(0, 0, 0) | |
thetas_states_table = cbind(time, R0, D_inf, S, I, R) | |
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE) | |
# Make some parameters into "free" parameters, to be inferred | |
thetas_states_df$R0 = c("free", "free", "free") | |
thetas_states_df$I[1] = "free" | |
thetas_states_df | |
# Run the params_to_likelihood_v1() function once, to see your starting likelihood | |
data.points = dat$Active_cases | |
params = c(3.0, 2.0, 0.3, 0.03) # starting parameter values | |
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE) | |
# Running the Maximum Likelihood search with the optim() function. | |
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!! | |
delay_val = 14 | |
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1)) | |
# Graphing the ML model | |
# Take the learned parameters from "ML_results", put them | |
# into a theta_states data.frame for simulation and plotting | |
thetas_states_ML = thetas_states_df | |
TF = thetas_states_ML == "free" | |
thetas_states_ML[TF] = ML_results$par | |
for (i in 1:ncol(thetas_states_ML)) | |
{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) } | |
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val | |
# Plot the results | |
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points)) | |
maxy = max(max(trajectory$I), max(data.points)) | |
xvals = c(trajectory$time, times) | |
yvals = c(trajectory$I, data.points) | |
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals") | |
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2") | |
points(times, data.points) | |
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8) | |
titletxt = paste0("ML fit, Model 3 (a 3-regime model)\nTotal log-likelihood = ", round(ML_results$value, 2)) | |
title(titletxt) | |
# Save this model's parameters and log-likelihood | |
thetas_states_ML_model3 = thetas_states_ML | |
total_lnL_Model3 = ML_results$value | |
####################################################### | |
# Model 4: A 4-regime model | |
####################################################### | |
time = c(1, 16, 20, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case | |
R0 = c(3.0, 1.0, 1.0, 0.3) | |
D_inf = c(14.0, 14.0, 14.0, 14.0) | |
S = c(5000000, 0, 0, 0) | |
I = c(1, 0, 0, 0) | |
R = c(0, 0, 0, 0) | |
thetas_states_table = cbind(time, R0, D_inf, S, I, R) | |
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE) | |
# Make some parameters into "free" parameters, to be inferred | |
thetas_states_df$R0 = c("free", "free", "free", "free") | |
thetas_states_df$I[1] = "free" | |
thetas_states_df | |
# Run the params_to_likelihood_v1() function once, to see your starting likelihood | |
data.points = dat$Active_cases | |
params = c(3.0, 2.0, 1.0, 0.3, 0.03) # starting parameter values | |
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE) | |
# Running the Maximum Likelihood search with the optim() function. | |
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!! | |
delay_val = 14 | |
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1)) | |
# Graphing the ML model | |
# Take the learned parameters from "ML_results", put them | |
# into a theta_states data.frame for simulation and plotting | |
thetas_states_ML = thetas_states_df | |
TF = thetas_states_ML == "free" | |
thetas_states_ML[TF] = ML_results$par | |
for (i in 1:ncol(thetas_states_ML)) | |
{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) } | |
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val | |
# Plot the results | |
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points)) | |
maxy = max(max(trajectory$I), max(data.points)) | |
xvals = c(trajectory$time, times) | |
yvals = c(trajectory$I, data.points) | |
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals") | |
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2") | |
points(times, data.points) | |
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8) | |
titletxt = paste0("ML fit, Model 4 (a 4-regime model)\nTotal log-likelihood = ", round(ML_results$value, 2)) | |
title(titletxt) | |
# Save this model's parameters and log-likelihood | |
thetas_states_ML_model4 = thetas_states_ML | |
total_lnL_Model4 = ML_results$value | |
####################################################### | |
# AIC comparison of the 4 models | |
####################################################### | |
# Below, I will do AIC comparison in R, but I will | |
# also provide the same example in Excel and in | |
# Google Sheets. | |
# | |
# (NOTE: The exact results you get may differ, if you | |
# change the input data e.g. by adding days.) | |
# | |
# Excel: # http://phylo.wikidot.com/local--files/biosci220:quantitative-biology/Week6_AIC_with_SIRs_fit_to_NZ_Ministry_of_Health_data.xlsx | |
# | |
# Google Sheet: | |
# https://docs.google.com/spreadsheets/d/1WNk1la1TfoIdKcadxm0kUW5JNTTl0ZDImzDEQq0fdSQ/edit?usp=sharing | |
# | |
# Making the AIC table in R: | |
# | |
# First, let's get the number of free parameters | |
# for each of the 4 models. We get this by counting | |
# the number of "free" statement in the | |
# theta_states_df, or more simply by just thinking | |
# about it. | |
# | |
# Model 1 estimated 1 R0 parameter, and a starting "I" count | |
# That is 2 free parameters estimated by ML. | |
# | |
# Model 2 added another R0, Model 3 added another, and | |
# Model 4 added another. So they have 3, 4, and 5 | |
# parameters. | |
# Let's store the number of free parameters | |
num_free_params = c(2, 3, 4, 5) | |
# Let's collect the maximized log-likelihoods from | |
# above | |
lnL = c(total_lnL_Model1, total_lnL_Model2, total_lnL_Model3, total_lnL_Model4) | |
# AIC is just -2 * (lnL - num_free_params) | |
AIC_vals = -2*(lnL - num_free_params) | |
# Get the delta AICs -- the difference from the best (lowest!) AIC | |
best_AIC = min(AIC_vals) | |
delta_AICs = AIC_vals - best_AIC | |
# Calculate the relative likelihoods | |
relative_likelihoods = exp(-0.5 * delta_AICs) | |
# And, calculate the AIC weights | |
AIC_weight = relative_likelihoods / sum(relative_likelihoods) | |
# AIC weights as percentages (rounded to 2 decimals) | |
percent_weight = round(AIC_weight * 100, digits=2) | |
# Make model names, and results table | |
model_names = c("Model 1", "Model 2", "Model 3", "Model 4") | |
AIC_results_matrix = cbind(lnL, num_free_params, AIC_vals, AIC_weight, percent_weight) | |
AIC_results_df = as.data.frame(AIC_results_matrix, stringsAsFactors=FALSE) | |
# Make better row & column names | |
row.names(AIC_results_df) = model_names | |
names(AIC_results_df) = c("lnL", "nparams", "AIC", "AIC_weight", "pct_weight") | |
# Look at the results table! | |
AIC_results_df | |
# A "real" results table would often also include the parameter values | |
parameter_estimates = matrix(0, nrow=4, ncol=5) | |
# It takes some work to extract the parameter estimates from the | |
# "thetas_states_ML_model" objects where we saved them | |
parameter_estimates[1,1:5] = c(thetas_states_ML_model1$I[1], thetas_states_ML_model1$R0[1], thetas_states_ML_model1$R0[1], thetas_states_ML_model1$R0[1], thetas_states_ML_model1$R0[1]) | |
parameter_estimates[2,1:5] = c(thetas_states_ML_model2$I[1], thetas_states_ML_model2$R0[1], thetas_states_ML_model2$R0[1], thetas_states_ML_model2$R0[1], thetas_states_ML_model2$R0[2]) | |
parameter_estimates[3,1:5] = c(thetas_states_ML_model3$I[1], thetas_states_ML_model3$R0[1], thetas_states_ML_model3$R0[1], thetas_states_ML_model3$R0[2], thetas_states_ML_model3$R0[3]) | |
parameter_estimates[4,1:5] = c(thetas_states_ML_model4$I[1], thetas_states_ML_model4$R0[1], thetas_states_ML_model4$R0[2], thetas_states_ML_model4$R0[3], thetas_states_ML_model4$R0[4]) | |
parameter_estimates | |
# Add them to the table, add some nice rounding | |
first_two_columns = cbind(round(AIC_results_df[,1],1), AIC_results_df[,2]) | |
middle_columns = cbind(round(parameter_estimates, 2)) | |
last_columns = round(AIC_results_df[,3:ncol(AIC_results_df)], 2) | |
AIC_results_df2 = cbind(first_two_columns, middle_columns, last_columns) | |
names(AIC_results_df2) = c("lnL", "nparams", "I_ini", "R0_1", "R0_2", "R0_3", "R0_4", "AIC", "AIC_weight", "pct_weight") | |
# Look at the final table | |
AIC_results_df2 | |
####################################################### | |
# Congratulations! You have now calculated AIC model | |
# weights for your 4 models! You will see AIC tables | |
# like this in the results sections of many scientific | |
# papers. | |
# | |
# The main interpretation of the table is about the | |
# AIC weights or percent weights (which give the | |
# same information). | |
# | |
# Model 1 (R0=2.18 for the entire epidemic) | |
# is clearly a horrible fit. It gets 0.0% of | |
# the model weight, and the plot also shows a | |
# huge difference between the data and the best-fit | |
# curve. Yay! This shows that NZ's epidemic clearly | |
# deviates from constant exponential growth! We are | |
# nowhere near the millions of cases required for | |
# herd immunity to take over, so we're not just | |
# flattening the curve, we are squashing the curve! | |
# | |
# Model 2 (R0=3.5 until day 24, then R0=0.3 after the | |
# lockdown) has a fit that is 4918.75 AIC units | |
# better (meaning lower) than Model 1. An AIC difference of 10 | |
# translates to a model having about 150 times more | |
# likelihood (exp(0.5*10) = about 150), so a difference | |
# of thousands of AIC units is huge. So this is | |
# a pretty good result, for adding just 1 more | |
# parameter! | |
# | |
# But was Model 2 the best fit? We can see that, in | |
# Model 3, adding a third R0, allowing R0 to decrease | |
# before the lockdown starts, improves the fit by | |
# another 451 AIC units. This is an increase in | |
# relative likelihood of exp(0.5*451)=8.6*10^97 times. | |
# Model 3 has an initial R0 of 4.85, then a switch to | |
# R0=2.53 on Day 16, then a switch to R0=0.53 on Day 24. | |
# | |
# Can we do even better? Model 4 further subdivides | |
# Days 16-20 from Days 20-24. This *does* increase | |
# the log-likelihood, from -207.74 (Model 3) to | |
# -207.38 (Model 4). However, does this improvement | |
# in log-likelihood outweigh the penalty for having | |
# an extra parameter? We know that the penalty term | |
# is equivalent to 1 unit of log-likelihood, so a | |
# log-likelihood improvement of only 0.36 units is | |
# not going to overcome the penalty. | |
# | |
# The AIC weights allow us to quantify this effect. | |
# Model 3 gets 65.49% of the model weight, and Model 4 | |
# gets 34.51%. The research would conclude that, of the | |
# models considered, Model 3 is the best fit (or at | |
# least, the least-worst fit). Model 4 retains | |
# plausibility, because it is always possible that a | |
# more complex model is true, but we can say that Model 3 | |
# is an equivalent match to the data, but is more | |
# parsimonious. | |
# | |
####################################################### | |
####################################################### | |
# What is the point of all of this? | |
# | |
# In biology, we almost never know the "true model". We | |
# are just trying to find better approximations of | |
# reality. AIC is one common choice to balance fitting | |
# the observations against parsimony. | |
# | |
# AIC can be used to address common questions like | |
# "which model is the best, given the observations I | |
# have", or "out of this set of models, which models | |
# are reasonable, given the dataset"? | |
# | |
# AIC can be used any time you can get a maximized | |
# log-likelihood, and a count of the number of free | |
# parameters, on each model. It is used so often that | |
# R has some standard functions to provide them for | |
# linear models provided by the lm() function. | |
# Note: these functions will not work | |
# outside of standard "base" R functions like lm() and | |
# glm(). | |
# | |
# Here is a quick example of that: | |
data("iris") # A default R dataset, measurements of iris flowers | |
plot(iris$Sepal.Length, iris$Petal.Length) | |
# Run a simple linear model (lm), with Sepal Length | |
# predicting Petal Length | |
lm_result = lm(iris$Petal.Length~Sepal.Length, data=iris) | |
# The "logLik" function reports the lnL, and the | |
# number of free parameters (df=3) | |
# (df means "degrees of freedom", i.e. number of free | |
# parameters) | |
logLik(lm_result) | |
AIC(lm_result) | |
# You can see that the AIC formula is being used: | |
lnL = as.numeric(logLik(lm_result)) | |
-2*(lnL - 3) | |
AIC(lm_result) | |
# | |
# MAIN POINT: The skills you have used for using | |
# Maximum Likelihood and AIC on SIR models can be | |
# used anywhere from quite complex process-based | |
# models used in epidemiology, ecology, phylogenetics | |
# etc., all the way to the basic linear models used | |
# in introductory statistics. | |
# | |
####################################################### | |
################################################ | |
# Appendix: The connection between Ordinary | |
# Least Squares (OLS), i.e. linear regression, | |
# and Maximum Likelihood. | |
# | |
# This is something I wish I had been taught | |
# early in my statistics education, so I am | |
# mentioning it here. | |
# | |
# This Appendix material is "bonus", I will not | |
# examine you on it for my section of the course. | |
# | |
################################################ | |
# | |
# In standard "least squares" analyses | |
# (OLS, Ordinary Least Squares), where | |
# we talk about the line of | |
# best fit being the one that minimizes | |
# the sum of squared residuals (or RSS, residual | |
# sum of squares), this is just another way | |
# of talking about maximizing the log-likelihood, | |
# *if* we make the assumption that the residuals | |
# (the difference between the predicted line and | |
# the observed response) all follow the same | |
# normally distribution (in other words, the | |
# residuals are independent, normally distributed | |
# with constant variance). | |
# | |
# In this situation, the log-likelihood for the | |
# line of best fit is derived from the log of | |
# the probability density function for a normal | |
# distribution. | |
# | |
# I find that this is rarely explained in R code, | |
# so I am putting various formulations here. | |
# | |
# Here is the logLik() function code: | |
getAnywhere("logLik.lm") | |
# This is the formula from logLik, with weights (w) | |
# set to 1s (meaning equal weighting of all residuals): | |
0.5 * - N * (log(2 * pi) + 1 - log(N) + | |
log(sum(lm_result$residuals^2))) | |
# Here are several simpler formulations that | |
# give the same result: | |
# | |
# Formula for the log-likelihood of a linear model | |
# with x predicting y | |
# Source: | |
# https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf | |
# ...page 2, equation (3) | |
# | |
# N = number of observations | |
# | |
# RSS = residual sum of squares = | |
# sum of (yi - (b0 + b1*xi))^2 | |
# where | |
# yi = ith response (observation) | |
# b0 = intercept | |
# b1 = slope | |
# xi = ith predictor | |
# | |
# stdev_squared = variance = RSS / N | |
# stdev = standard deviation = sqrt(stdev_squared) = sqrt(variance) | |
N = length(lm_result$residuals) | |
RSS = sum(lm_result$residuals^2) | |
stdev_squared = RSS/N | |
stdev = sqrt(RSS/N) | |
-N/2 * log(2*pi) - N*log(stdev) - 1/(2*stdev_squared)*RSS | |
# The formula works the same if you just use RSS | |
-N/2 * log(2*pi) - N*log(sqrt(RSS/N)) - 1/(2*RSS/N)*RSS | |
# R's logLik() function has the above formula re-arranged | |
# so that RSS appears only once: | |
0.5 * - N * (log(2 * pi) + 1 - log(N) + | |
log(RSS)) | |
logLik(lm_result) | |
# These formulations all give the same result! | |
####################################################### | |
# References | |
# | |
# Cosma Shalizi (2015). "Lecture 6: The Method of | |
# Maximum Likelihood for Simple Linear Regression." | |
# Lecture 6 from Modern Regression, 36-401, Fall 2015. | |
# https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf | |
# | |
####################################################### | |
####################################################### | |
# References | |
# | |
# My SIR code relies heavily on code from the fitR package | |
# (Funk et al. 2019). | |
# | |
# Adam, David (2020). "UK has enough intensive care units for | |
# coronavirus, expert predicts." New Scientist, March 25, 2020. | |
# https://www-newscientist-com.ezproxy.auckland.ac.nz/article/2238578-uk-has-enough-intensive-care-units-for-coronavirus-expert-predicts/ | |
# | |
# Anderson, Charles (2020). "New Zealand coronavirus deaths during lockdown could | |
# be just 20, modelling suggests." The Guardian. 27 Mar 2020. | |
# https://www.theguardian.com/world/2020/mar/27/new-zealand-coronavirus-deaths-during-lockdown-could-be-just-20-modelling-suggests | |
# | |
# Dillane, Tom; Knox, Chris (2020), "Coronavirus Covid 19: New Zealand's | |
# best and worst prepared DHBs by elderly population and ICU beds." | |
# NZ Herald. April 5, 2020 | |
# https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12321390 | |
# | |
# Fefferman, Nina (2020). "The role of applied math in real-time pandemic | |
# response: How basic disease models work." Date: 3:30 EDT Tuesday, | |
# March 31, 2020. http://www.nimbios.org/webinars#fefferman | |
# | |
# Funk, Sebastian (2019). "Introduction to model fitting in R." http://sbfnk.github.io/mfiidd/introduction.html | |
# | |
# Funk, Sebastian; Camacho, Anton; Johnson, Helen; Minter, Amanda; O’Reilly, Kathleen; Davies, Nicholas (2019). "Model fitting and inference for infectious disease dynamics." Centre for the Mathematical Modelling of Infectious Diseases (CMMID), London School of Hygiene & Tropical Medicine. | |
# | |
# James, Alex; Hendy, Shaun C.; Plank, Michael J.; Steyn, Nicholas (2020). "Suppression and Mitigation | |
# Strategies for Control of COVID-19 in New Zealand." 25 March 2020. | |
# https://cpb-ap-se2.wpmucdn.com/blogs.auckland.ac.nz/dist/d/75/files/2017/01/Supression-and-Mitigation-New-Zealand-TPM-006.pdf | |
# | |
# Newton, Kate (2020). "The man modelling NZ's Covid-19 spread | |
# from his kitchen table." Radio New Zealand. 27 March 2020. | |
# https://www.rnz.co.nz/news/in-depth/412744/the-man-modelling-nz-s-covid-19-spread-from-his-kitchen-table | |
######################################################### | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment