nmatzke/week_06_exercise_v4.R

## week_06_exercise_v4.R

#######################################################
# BIOSCI220 WEEK 6 LAB EXERCISE
#######################################################


################################################
# Note for people using online R
################################################
#
# In previous labs, people who couldn't get R
# installed on a home machine could use the
# "R Snippets" website to run the lab code.
#
# However, I have discovered that R Snippets
# puts a 10-second limit on R runs, and our
# Maximum Likelihood searches in this lab
# will often take a little longer than this.
#
# Therefore, we have to use something else.
# The best option I have found is RStudio Cloud.
#
# I have run all of the lab code in RStudio Cloud,
# and it works just like regular RStudio.
#
# Also, it remembers your previous actions, so
# you no longer have to do the "copy-paste the
# whole script down to your question" business
# that R Snippets required. (However, this can
# still be a useful tactic for "resetting"
# everything, if you are getting errors due to
# having accidentally skipped some steps.)
#
# However, you do need to register for a free
# account. I have put instructions below.
#
################################################
# Instructions for using RStudio Cloud:
################################################
#
# 1. Go to https://rstudio.cloud/
#
# 2. Register for a free account, e.g. using a gmail address.
#
# 3. Wait awhile for your Workspace to load. Be patient.
#
# 4. Click the blue "New Project" button in the upper right.
#
# 5. Wait awhile for your new project to load. Be patient.
#
# 6. Click File -> New File -> R script
#
# 7. Paste your R script into this window. You should be able to
#    run the script's code by either (a) copy-pasting into
#    the Console, or (b) clicking the "Run" button.
#
# 8. Go to the Console. Type
#
# install.packages("deSolve")
#
# ...to get deSolve installed.
#
# 9. Type library(deSolve).  If no errors are returned, then
#    deSolve is working, and you can start the lab.
################################################

#######################################################
# Introduction
#######################################################
# In week 5, we got a lot of experience working with SIR models,
# and simulating possible futures, given different values of R0
# for different levels of lockdown, different timings of interventions,
# different numbers of starting infections, etc.
#
# However, many students came across one big difficulty -- how do we
# know what parameters to use?  In some cases, we can justify parameters
# with background knowledge.  For example, it appears, from medical
# observations, that COVID-19 patients are infectious for about 14 days.
# We don't necessarily need fancy inference methods for that.
#
# However, some parameters, like R0, are fairly uncertain -- in the
# media, you will see scientists quoting numbers between 2 and 6.
# Furthermore, the R0 will (hopefully!) change depending on how well
# a country's leadership and population enact the various control
# measures. We would like to *measure* this, if possible.
#
# Regarding the number of starting infections, we rarely know
# exactly how many infections were imported into a country at
# the start of an epidemic. We can guesstimate based on the number
# of early cases, but we might want to use a more formal method.
#
# Furthermore, we would like to have some measure of uncertainty
# about these parameters. Are we highly confident that R0 is 3.0,
# plus or minus 0.1?  Or are we highly uncertain, and we can only
# put R0 between 2 and 6?
#
# Answers to these questions can be found, pretty easily, with
# MODEL-BASED INFERENCE.
#
# In this lab, we will perform some model-based inference on SIR
# models. This will give you a decent taste of what real-life
# model-fitting is like -- although I hasten to add that we are
# using quite simple models, and professional epidemiologists
# would use somewhat more complex models to help fix parts of
# our models that are unrealistic (for example, our models allow
# there to be 0.0639 of an infection, whereas in real life there
# is either 0 or 1 infections).
#
# In a quick appendix, I connect up the model-based, Maximum
# Likelihood inference we have studied, to Ordinary Least Squares
# (OLS) or linear regression, which you probably remember
# from introductory statistics.  It turns
# out that a great deal of statistics is tied to likelihood,
# Maximum Likelihood, and model-based inference in one
# form or another, although often the connections are not
# always apparent in introductory courses.
#######################################################


#######################################################
# Likelihood of a case count
#######################################################
# In week 4, we introduced the concept of likelihood, where
# "likelihood" means the probability of the data, given a model.
# We can write this definition like this:
#
# L = P(data | model)
#
# The "|" means "given".
#
# Sometimes, we write the likelihood as a function of the model,
# and of the parameters. The list of parameters is often abbreviated
# as the Greek symbol "theta".  We cannot use greek symbols in
# R scripts, which have to be plain-text, so we will just write
# "theta".
#
# L = P(data | model, theta)
#
# ...this just means that the likelihood (L) is the probability (P)
# of the data, given a model and particular parameter values stored
# in theta.
#
#
# As we saw in week 4, we can try to learn what parameter values
# maximize the probability of the data. This is the method of
# "Maximum Likelihood" (ML), and is one common method to attempt
# to estimate likely values of parameters. It is commonly used
# in hundreds of different scientific fields.
#
# Let's set up a ML analysis, in order to fit SIR model parameters
# to real data.

#######################################################
# NOTE: AS IN PREVIOUS WEEKS, IT IS CRUCIAL TO COPY-
# PASTE *ALL* OF THE CODE, WITHOUT SKIPPING STEPS.
#
# IT IS A BIT TEDIOUS, COMPARED TO E.G. LOADING FUNCTIONS
# FROM A WEBSITE, BUT AT LEAST IT KEEPS US ALL
# ON THE SAME PAGE, AND MAKES THE LAB PORTABLE
#
# In R on a computer, you can run the code in chunks.
#
# This should also be true on RStudio Cloud.
#
#######################################################


# We are using the deSolve library again; see Week 5 for install instructions
library(deSolve)

# Let's start with the New Zealand COVID-19 "active cases"
# data.  I acquired this data from:
#
# Source: Hancock, Faith (2020). Covid-19 in NZ. NZ Herald, April 13, 2020.
# https://www.newsroom.co.nz/2020/03/27/1101216/covid-19-in-nz-the-numbers
#
# ...which gets its info from the Ministry of Health. These are
# basically the numbers that are announced by the government
# in a press conference at 1 pm each day.
#

# I have the data saved as a text file, and normally I would
# load it with:

################################################
## EXAMPLE CODE FOR LOADING A TEXT FILE IN R
################################################
## Set the working directory (wd) to wherever you have the file saved:
# getwd()    # getwd = "get working directory", to check what
#                      working directory you are in

## You would type the full path for the directory containing your file here:
# wd = "/drives/GDrive/__classes/BIOSCI220/__week06/"
#
## Then, set R to make this your working directory
# setwd(wd)  # setwd = "set working directory:
# getwd()    # getwd = "get working directory", to check what
#                      working directory you are in

## Load the file into table "dat"
## fn = filename
# fn = "2020-04-13_NZ_daily_cases.txt"
# dat = read.table(file=fn, header=TRUE, stringsAsFactors=FALSE)
# dat
# dim(dat)

################################################
# WHY WE AREN'T TRYING TO LOAD FILES IN THIS WORKSHEET
#
# Getting everyone to save the file somewhere they can
# find it on their computer systems, then getting everyone to
# successfully reference that exact file path in their R
# script, is always a huge pain. This is why people say,
# "the hardest part about teaching R is getting students to
#  successfully load their data."
#
# It would be even harder to do this while teaching remotely,
# so we'll "cheat", and just have the data pasted straight
# into the R script.
#
# In real life R analyses, you will have to put some effort
# into saving your data files into a directory, setting
# R's working directory with setwd(), and loading
# the data file with a function like read.table().
#
################################################


datafile_as_a_string = '
# Source: Hancock, Faith (2020). Covid-19 in NZ. NZ Herald, April 13, 2020.
# https://www.newsroom.co.nz/2020/03/27/1101216/covid-19-in-nz-the-numbers
Date	Total_cases	Active_cases
2020-02-28	1	1
2020-02-29	1	1
2020-03-01	1	1
2020-03-02	1	1
2020-03-03	1	1
2020-03-04	2	2
2020-03-05	3	3
2020-03-06	3	3
2020-03-07	5	5
2020-03-08	5	5
2020-03-09	5	5
2020-03-10	5	5
2020-03-11	5	5
2020-03-12	5	5
2020-03-13	5	5
2020-03-14	6	6
2020-03-15	8	8
2020-03-16	8	8
2020-03-17	12	12
2020-03-18	20	20
2020-03-19	28	28
2020-03-20	39	39
2020-03-21	52	52
2020-03-22	66	66
2020-03-23	102	102
2020-03-24	155	143
2020-03-25	205	183
2020-03-26	283	256
2020-03-27	368	331
2020-03-28	451	401
2020-03-29	514	457
2020-03-30	589	525
2020-03-31	647	572
2020-04-01	708	625
2020-04-02	797	704
2020-04-03	868	764
2020-04-04	950	822
2020-04-05	1039	882
2020-04-06	1106	929
2020-04-07	1160	918
2020-04-08	1210	927
2020-04-09	1239	921
2020-04-10	1283	908
2020-04-11	1312	886
2020-04-12	1330	855
2020-04-13	1349	798
2020-04-14	1366	729
2020-04-15	1386	649
2020-04-16	1401	622
2020-04-17	1409	582
'

# Here, we will load the data by reading in the text string we
# created above
dat = read.table(text=datafile_as_a_string, header=TRUE, stringsAsFactors=FALSE)

# Look at the full data:

# Look at the head/tail of the data
head(dat)
tail(dat)

# Looking at dat, please answer the questions below:
#
# QUESTION #1: How many rows are in the dataset loaded into 'dat'?
#
# QUESTION #2: How many columns are in the dataset loaded into 'dat'?
#
# QUESTION #3: What is the first day data are reported? Put your answer in
#              YYYY-MM-DD format.
#
# QUESTION #4: What is the last day data are reported? Put your answer in
#              YYYY-MM-DD format.
#

# Hint: You can also at the dimensions of dat, with dim().
# R *ALWAYS* refers to rows, then columns
dim(dat)

# Or, you can use "nrow" and "ncol"
nrow(dat)
ncol(dat)


#######################################################
# Let's plot this data
#######################################################

# Let's make x-values just be 1-46
xvals = 1:nrow(dat)
plot(x=xvals, y=dat$Total_cases, col="grey70", pch=19, xlab="Day", ylab="Number of cases")
points(x=xvals, y=dat$Active_cases, col="red", pch=19)
legend(x="topleft", legend=c("Total cases", "Active cases"), col=c("grey50", "red"), pch=c(19,19))


# QUESTION #5: Why are "Total cases" and "Active cases" the
# same at the start, but then diverge later in the plot?


# Distinguishing "classes" (different data types) in R
#
# It might be nice to make this plot have dates on the
# x-axis.
#
# What happens if we type this?
#
# plot(x=dat$Date, y=dat$Total_cases, col="grey70", pch=19, xlab="Day", ylab="Number of cases")
#
# (I am commenting this out, as it will "crash" peoples'
#  R-Snippets runs. Uncomment it to try it.)
#
# The error message returned is:
#
# Error in plot.window(...) : need finite 'xlim' values
#
# This error occurs because the x-values are not numbers
# or dates, they are character strings representing
# dates, and character strings don't provide a way for
# R to determine the finite numbers determining the
# limits of the x-axis.
#
# Although the dates in dat$Date look like dates to us humans,
# R just sees them as character strings. In R, character strings
# have quote marks around them. You can see this by looking at a
# single value from the column:

dat$Date[1]  # the first value in column "Date"
dat$Date[2]  # the second value in column "Date"
dat$Date[3]  # the third value in column "Date"


# We can do the plot with calendar dates. To do this,
# we have to convert the "Dates" column
# - from "character" format (meaning a string of text)
# - to "Date" format (meaning a number R will recognize as a calendar date)
# Convert the dates to R's "Date" format, then order d by date
#
# Functions like "as.character", and "as.Date", can convert between
# these formats (also known as data types or variable classes).
#
# You can check the format/class of a variable or column with the class() function.

# The initial class for column "Date"
class(dat$Date)

# QUESTION #6: What was the initial class of dat$Date? (put the exact text returned,
#             i.e. the text inside the quotes, but not the quotes themselves)

dat$Date = as.Date(dat$Date, format="%Y-%m-%d")

# The new class for column "Date"
class(dat$Date)

# QUESTION #7: What was the new class of dat$Date? (put the exact text returned,
#             i.e. the text inside the quotes, but not the quotes themselves)


# Now we can plot the data, with dates on the x-axis:
plot(x=dat$Date, y=dat$Total_cases, col="grey70", pch=19, xlab="Day", ylab="Number of cases")
points(x=dat$Date, y=dat$Active_cases, col="red", pch=19)
legend(x="topleft", legend=c("Total cases", "Active cases"), col=c("grey70", "red"), pch=c(19,19))


# Compare this plot to the "Total and active cases plot" available here:
# https://www.newsroom.co.nz/2020/03/27/1101216/covid-19-in-nz-the-numbers
#
# QUESTION #8: What is different about the NZ Herald plot, compared to
# your R plot?
#
# QUESTION #9: What would you have to do to make your R plot match the NZ Herald plot?
#


#######################################################
# Maximum likelihood inference of R0,
# pre- and post-lockdown
#######################################################

# You will recall the several steps required to do
# maximum likelihood inference:
#
# PART 1. Be able to calculate the likelihood of a single
#         datum, given a model and parameter values.
#
# PART 2. Calculate the likelihood for all data points under
#         a model+parameters.
#
# NOTE: For raw likelihoods, this means MULTIPLYING the likelihoods
#     of the individual datapoints together. (In other words,
#     getting the product of all of the individual likelihoods.)
#     Remember the coin-flipping example: if the likelihood of
#     a single heads is 0.7, then the likelihood of two Heads is
#     0.7 * 0.7, or 0.7^2.
#
#    For log-likelihoods, we ADD UP the log-likelihoods of
#     all of the individual data. This is true because adding
#     logged likelihoods is equivalent to multiplying raw likelihoods.
#
#    Proof of this:
#
#    If Prob_heads is 0.7, then the raw likelihood of 2 Heads is:

0.7 * 0.7

#    If Prob_heads is 0.7, then the log-likelihood of 2 Heads is:

log(0.7) + log(0.7)

# ...which is the same as...
log(0.7 * 0.7)
log(0.49)

# ...so, the log operation turns multiplication into addition.
#
# (The log operation also turns power operations into multiplication,
# so these are equivalent:

log(0.7^2)
2*log(0.7)

# ...remember, whenever you find yourself confused about some bit of
# math like this, just experiment with it in R to figure it out.)

#
# PART 3. Optimize the parameters in some way, in order
#         to find the parameter values that maximize
#         the likelihood.
#
# The "maximum likelihood parameter values" are the
# parameter values that maximize the probability of
# the observed data. They are the "best fit" parameters.
# The method of Maximum Likelihood is one very common,
# very simple method of trying to learn parameter
# values from the data. Learning parameters from the
# data is also called "estimation" or "inference".
# A vast amount of statistics is, deep down, based on
# Maximum Likelihood methods, even if they are
# (unfortunately) not usually emphasized in
# introductory classes.
#
#######################################################
# Here, we will use ML to infer R0 ("R-naught") for
# the NZ Covid epidemic.
#
# PLEASE NOTE that this is a practice exercise, using
# simplified models. "Real-life" epidemiology models
# would be somewhat more complex, and allow, for
# example, multiple different introductions of Covid
# cases on multiple different days.
#
# I think the models we are using are very instructive --
# I'm just saying that you shouldn't go around claiming
# you are a professional epidemiologist based on this
# exercise.
#######################################################


#######################################################
# Part 1. Be able to calculate the likelihood of a single
#         datum, given a model and parameter values.
#######################################################
# Let's start with calculating the likelihood of a
# single data point.
#
# We will do this with the "likelihood_single_point()"
# function. Following Funk (2019), we will calculate
# the probability of a particular observed
# count with a Poisson distribution. The Poisson
# distribution is often used for count data,
# for several reasons:
#
# * It has only one parameter: lambda, which is equal
#   to both the expected mean and the variance
# * It has a minimum of zero (so it is impossible
#   to produce negative counts).
#
# So, all that likelihood_single_point() does is
# take a model projection for a particular day
# ("model.point"), set that as the lambda (expected
# mean) of the the dpois() function, and then get
# the probability of the observed case count
# ("data.point") for the day.
#
# The "log=TRUE" statement means that the
# log-likelihood is returned, instead of the
# regular or "raw" likelihood.
#


# Let's load this function, then use it to
# calculate a few likelihoods for single
# points.

# Function definition for: likelihood_single_point
# This function returns the likelihood or log-likelihood of a
# single day's case count (data.point), given a prediction
# for that day from an SIR model (model.point)
likelihood_single_point <- function(data.point, model.point, log=TRUE)
	{
	# Get the probability density of the count, assuming
	# the observed count is a poisson process with an
	# average value matching the true process.
	#
	# This allows for variation, e.g. if a case is reported
	# days after it occurs.
	likelihood = dpois(x=data.point, lambda=model.point, log=log)

	# Return the likelihood value
	return(likelihood)
	} # END of likelihood_single_point


# Let's say that a particular set of model parameters
# predict that, on average, 10 active cases will
# be observed. What is the likelihood for an observation
# of exactly 10 cases?

# What is the likelihood for an observation
# of exactly 10 cases?
likelihood_single_point(data.point=10, model.point=10, log=FALSE)

# What is the log-likelihood for an observation
# of exactly 10 cases?
likelihood_single_point(data.point=10, model.point=10, log=TRUE)


# Now that you have seen how this code works, please answer the questions
# below.

# QUESTION #10: What is the log-likelihood for an observation of
# 8 active cases, if the model predicts 10 cases?


# QUESTION #11: What is the log-likelihood for an observation of
# 6 active cases, if the model predicts 10 cases?


# QUESTION #12: What is the log-likelihood for an observation of
# 4 active cases, if the model predicts 10 cases?


# QUESTION #13: Does the log-likelihood get higher or lower as
# the number of observed cases gets further from the model
# projection?


# NOTE: FOR QUESTIONS #14-15, MAKE SURE YOU READ AND UNDERSTAND
# THE NOTES ABOVE ON MULTIPLYING RAW LIKELIHOOD, VERSUS SUMMING
# LOG-LIKELIHOOD:
#
# "For raw likelihoods, this means MULTIPLYING the likelihoods
#     of the individual datapoints together..."
# "For log-likelihoods, we ADD UP the log-likelihoods of
#     all of the individual data..."


# QUESTION #14: What is the total log-likelihood of the 3
# observed counts (8, 6, and 4), assuming the model prediction
# was 10 for each?


# QUESTION #15: What is the total log-likelihood of the 3
# observed counts (8, 6, and 4), assuming the model prediction
# was 6 for each?


# QUESTION #16: Which model prediction yields a higher
# total log-likelihood for the 3 observed counts, the
# prediction of 6, or 10?


###############################################################
# Part 2. Calculate the likelihood for all data points under
#         a model+parameters, and multiply all of these
#         likelihoods together to get their product.
#         (Or, more commonly, add together the logs of the
#          likelihoods, to get their sum.)
###############################################################


# Rather than summing log-likelihoods by hand, we can just
# have another function.  This way, we can take a series
# of observed counts, and a model prediction, and get the
# total log-likelihood easily.
#
# I have done this in likelihood_time_series(), which
# just runs the likelihood_single_point() function across
# each pair of data.points and model.points.

likelihood_time_series <- function(data.points, model.points, log=TRUE)
	{
	# Error check - return a stop() message if check fails
	if (length(data.points) != length(model.points))
		{ stop("ERROR in likelihood_time_series(): the lengths of data.points and model.points must match.") }

	# Calculate the likelihoods of the data.points
	likelihoods = mapply(FUN=likelihood_single_point, data.point=data.points, model.point=model.points, MoreArgs=list(log=log))

	# If you calculated log-likelihoods, then sum them
	# If you calculated raw likelihoods, then multiply them
	if (log == TRUE)
		{
		total_likelihood = sum(likelihoods)
		} else {
		total_likelihood = prod(likelihoods)  # prod=take the product
		} # END if/else statement: if (log == TRUE)

	# Return that value
	return (total_likelihood)
	} # END of likelihood_time_series


# We can use this function as another way to answer our question, above.
likelihood_time_series(data.points=c(8,6,4), model.points=c(10,10,10), log=TRUE)


# Now, in order to fit a series of observations to a
# model projection, we need the functions we used last
# week to make model projections. As before, load
# these functions by pasting them in ALL AT ONCE,
# i.e. from the function name, all the way through to the
# close of the function at the "}" character.


#######################################################
# Functions from Week 5
#######################################################
library(deSolve)
SIR_ode <- function(time, state, parameters)
	{
	# The above inputs are:
	# time = a time point
	# state = an R list of the 3 state values, i.e. the
	#         population sizes of S, I, and R
	# parameters = an R list of the parameters R0 and D_inf

	# Convert the input parameters "R0" and "D_inf"
	# into the rates "beta" and "nu"
	beta <- parameters[["R0"]] / parameters[["D_inf"]]
	nu <- 1 / parameters[["D_inf"]]

	# Extract the current values of the states S, I, and R
	S <- state[["S"]]
	I <- state[["I"]]
	R <- state[["R"]]

	# Calculate N, the total of S+I+R,
	# i.e. the population size
	N <- S + I + R

	# Write out the system of
	# Ordinary Differential Equations (ODEs)
	dS <- -beta * I/N * S
	dI <- beta * I/N * S - nu * I
	dR <- nu * I

	# Return the current rates of change of S, I, and R:
	return(list(c(dS, dI, dR)))
	} # End of function SIR_ode

simulate_SIR <- function(theta, init.state, times)
	{
	trajectory <- data.frame(ode(y = init.state,
                times = times,
                func = SIR_ode,
                parms = theta,
                method = "ode45"))
	return(trajectory)
	} # End of simulate_SIR()

simulate_SIR_changes <- function(thetas_states_df, times)
	{
	# Loop through the rows
	trajectory = NULL
	for (rn in 1:nrow(thetas_states_df))
		{
		if (rn == 1)
			{
			init.state = c(S=thetas_states_df$S[rn], I=thetas_states_df$I[rn], R=thetas_states_df$R[rn])
			theta = list(R0=thetas_states_df$R0[rn], D_inf=thetas_states_df$D_inf[rn])
			} else {
			last_row = trajectory[nrow(trajectory),]
			init.state = c(S=last_row$S+thetas_states_df$S[rn], I=last_row$I+thetas_states_df$I[rn], R=last_row$R+thetas_states_df$R[rn])
			theta = list(R0=thetas_states_df$R0[rn], D_inf=thetas_states_df$D_inf[rn])
			}

		# Read in the parameters for the initial, versus later, time-bins
		if (rn >= nrow(thetas_states_df))
			{
			start_time = thetas_states_df$time[rn]
			end_time = max(times)
			} else {
			start_time = thetas_states_df$time[rn]
			end_time = thetas_states_df$time[rn+1]
			}

		TF1 = times >= start_time
		TF2 = times <= end_time
		TF = (TF1 + TF2) == 2
		tmp_times = times[TF]

		tmp_trajectory = simulate_SIR(theta=theta, init.state=init.state, times=tmp_times)

		if (rn == 1)
			{
			trajectory = rbind(trajectory, tmp_trajectory)
			} else {
			trajectory_minus_last_state = trajectory[-nrow(trajectory),]
			trajectory = rbind(trajectory_minus_last_state, tmp_trajectory)
			} # if (rn == 1)
		} # END for (rn in 1:nrow(thetas_states_df))
	return(trajectory)
	} # End simulate_SIR_changes()


#######################################################
# Code to plot a model projection, observed counts,
# and the total log-likelihood
#######################################################
# Let's get our active case counts into a variable, "data.points"
data.points = dat$Active_cases

# ...and, let's make a model projection like we did last week:
times = 1:length(data.points)
time = c(1, 38) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case;
                # Assume 24+14=38 days to see effect.
R0 = c(3.0, 0.3)
D_inf = c(14.0, 14.0)
S = c(5000000, 0)
I = c(1, 0)
R = c(0, 0)
thetas_states_table = cbind(time, R0, D_inf, S, I, R)
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE)

trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_df, times=1:length(data.points))

# Plot the results
maxy = max(max(trajectory$I), max(data.points))
xvals = c(trajectory$time, times)
yvals = c(trajectory$I, data.points)
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals")
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2")
points(times, data.points)
legend(x="topleft", legend=c("Active COVID-19 case count", 'Model projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8)

# Print the total log-likelihood in the plot title
# NOTE how we take the model projection of # infectious, trajectory$I, and
# store that in model.points:
model.points = trajectory$I
total_log_likelihood = likelihood_time_series(data.points=data.points, model.points=model.points, log=TRUE)
titletxt = paste0("Total log-likelihood = ", round(total_log_likelihood, 2))
title(titletxt)
#######################################################


# Now, that model projection (red curve) doesn't look very close to your
# observations (the circles), does it?  Can we do better?
#
# Let's spend a few minutes changing the two "R0" values, and/or the
# first "I" value, and re-running the code above.
#
# For example, I've made another attempt:
#

#######################################################
# Code to plot a model projection, observed counts,
# and the total log-likelihood
#######################################################
# Let's get our active case counts into a variable, "data.points"
data.points = dat$Active_cases

# ...and, let's make a model projection like we did last week:
times = 1:length(data.points)
time = c(1, 38) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case;
                # Assume 24+14=38 days to see effect.
R0 = c(3.5, 0.4)
D_inf = c(14.0, 14.0)
S = c(5000000, 0)
I = c(2, 0)
R = c(0, 0)
thetas_states_table = cbind(time, R0, D_inf, S, I, R)
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE)

trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_df, times=1:length(data.points))

# Plot the results
maxy = max(max(trajectory$I), max(data.points))
xvals = c(trajectory$time, times)
yvals = c(trajectory$I, data.points)
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals")
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2")
points(times, data.points)
legend(x="topleft", legend=c("Active COVID-19 case count", 'Model projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8)

# Print the total log-likelihood in the plot title
# NOTE how we take the model projection of # infectious, trajectory$I, and
# store that in model.points:
model.points = trajectory$I
total_log_likelihood = likelihood_time_series(data.points=data.points, model.points=model.points, log=TRUE)
titletxt = paste0("Total log-likelihood = ", round(total_log_likelihood, 2))
title(titletxt)
#######################################################


# QUESTION #17: Spend about 10 minutes re-running the
# code chunk above, changing the two "R0" values,
# and/or the first "I" value.
#
# Answer these questions in the answer box:
#
# What is the highest log-likelihood you found?
#
# What parameter values did you use to find this
# log-likelihood (lnL)?
#
# How hard was it to find parameter values that
# made a curve that matched the data?
#
# I will not mark down for finding the "wrong" answer, this
# is about getting an intuition about how
# changing parameters -> changes model projections ->
# changes fit between line & data -> changes the
# log-likelihood.
#
# Report your answer like this:
# The highest log-likelihood (lnL) I found was lnL = -12500.74
#
# What parameter values did you use to find this
# log-likelihood (lnL)?
#
# R0 = c(3.0, 0.3)
# I = c(1, 0)
#
# How hard was it to find parameter values that
# made a curve that roughly matched the data?
#
# I found it very easy/very difficult to find
# the parameter values because...
#


##########################################################
# Part 3. Optimize the parameters in some way, in order
#         to find the parameter values that maximize
#         the likelihood.
##########################################################

# Hopefully, you just saw that "optimization by hand"
# is pretty difficult, although possible if you
# had no other options.
#
# Fortunately, we have other options!
#
# The function below, "params_to_likelihood_v1()", will help us
# automate this process of finding good parameter values, by:
#
# (1) taking the free parameters you want to infer ("params")
# (2) loading them into simulate_SIR_changes() to get a trajectory (model.points)
# (3) calculating the likelihood of the observed data.points against the model.points
#
# It takes parameters, simulates the model projection,
# then calculates the total log-likelihood - this just
# combines everything we have done separately above!
#
# It takes as inputs:
#
# params - starting values for the parameters that you want to infer
# data.points - the observed active cases, as before
# thetas_states_df - the data.frame holding time,R0, D_inf, S, I, R
# delay_val - if you think there is a delay between an intervention,
#             and seeing the effects, put it here. Default is 14.
# printvals - if TRUE, prints the parameter values you are trying.
#
params_to_likelihood_v1 <- function(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE)
	{
	# The times are just the length of the data.points
	times = 1:length(data.points)

	# (1) input the free parameters
	thetas_states_temp = thetas_states_df
	TF = thetas_states_temp == "free"
	thetas_states_temp[TF] = params

	# Error check
	if ((max(thetas_states_temp$time)+delay_val) > max(times))
		{
		stop("ERROR in params_to_likelihood_v1: there is a 'thetas_states_df$time'+delay_val greater than found in 'times'")
		}

	# Make sure everything in thetas_states_df is numeric, instead of character
	for (i in 1:ncol(thetas_states_temp))
		{
		thetas_states_temp[,i] = as.numeric(thetas_states_temp[,i])
		}

	# Ad2 the delay (allows time for interventions to show up
	# in the detections)
	thetas_states_temp$time[2:length(thetas_states_temp$time)] = thetas_states_temp$time[2:length(thetas_states_temp$time)] + delay_val

	# Calculate the trajectory given the parameters,
	# input into model.points
	trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_temp, times)
	model.points = trajectory$I


	lnL = likelihood_time_series(data.points, model.points, log=TRUE)

	if (is.finite(lnL)== FALSE)
		{
		lnL=-10e100
		}

	# Print, if desired
	if (printvals == TRUE)
		{
		print(thetas_states_temp)
		cat("log-likelihood = ", lnL, "\n", sep="")
		}
	# Make plots, if desired
	if (makePlots == TRUE)
		{
		plot_trajectory_lines(trajectory, theta=NULL, thetas_states_df=thetas_states_df, title_prefix="Plot #5:")
		}
	return(lnL)
	} # END params_to_likelihood_v1


# We can use params_to_likelihood_v1(), in conjunction with with
# the ML optimization function optim() (we used this in Week 4!),
# to try lots of parameter values, and gradually find the
# parameters that maximize the likelihood.
#
# Let's try it!
#
# The major difference from what we did above is that, after
# setting up thetas_states_df, we write "free" in certain
# cells of the table.  These "free" parameters are then filled
# in with the values from "params" -- and the optim() function
# will change the values in params, in an attempt to find the
# best parameters.
#

#######################################################
# Fitting a 2-regime model
#######################################################
time = c(1, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case
R0 = c(3.0, 0.3)
D_inf = c(14.0, 14.0)
S = c(5000000, 0)
I = c(1, 0)
R = c(0, 0)
thetas_states_table = cbind(time, R0, D_inf, S, I, R)
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE)

# Make some parameters into "free" parameters, to be inferred
thetas_states_df$R0 = c("free", "free")
thetas_states_df$I[1] = "free"
thetas_states_df

# Run the params_to_likelihood_v1() function once, to see your starting likelihood
data.points = dat$Active_cases
params = c(2.5, 1.0, 0.03) # starting parameter values
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE)

# Running the Maximum Likelihood search with the optim() function.
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!!
delay_val = 14
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1))

# Graphing the ML model
# Take the learned parameters from "ML_results", put them
# into a theta_states data.frame for simulation and plotting
thetas_states_ML = thetas_states_df
TF = thetas_states_ML == "free"
thetas_states_ML[TF] = ML_results$par
for (i in 1:ncol(thetas_states_ML))
	{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) }
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val


# Plot the results
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points))
maxy = max(max(trajectory$I), max(data.points))
xvals = c(trajectory$time, times)
yvals = c(trajectory$I, data.points)
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals")
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2")
points(times, data.points)
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8)

titletxt = paste0("ML fit, 3-regime model\nTotal log-likelihood = ", round(ML_results$value, 2))
title(titletxt)


# Congratulations!  You have run a Maximum Likelihood inference
# on New Zealand's Covid-19 active case count data.

# Using the plot, and the variables & screen output above, please answer
# the questions below.
#
# QUESTION #18. What was the log-likelihood obtained for the
# Maximum Likelihood parameters?  Provide the answer to 2
# decimal places.
#
# QUESTION #19. Was this lnL higher than the one you found
# when you searched the parameter space "by hand"?
#
# QUESTION #20. During the ML search, the thetas_states table,
# and the log-likelihood, were printed to screen at each
# step of the search.  Which option best describes what this
# looks like?
#
# NOTE: FOR QUESTIONS 21-24, AND ANY OTHER QUESTIONS ASKING ABOUT
# THE "first", "second", etc. R0 parameters, see lecture slide
# #101. And this example:
#
# If your thetas_states_ML table looks like this:
#
#   time       R0 D_inf     S        I R
# 1    1 2.574564    14 5e+06 1.028189 0
# 2   38 1.007426    14 0e+00 0.000000 0
#
# ...then:
# the 1st R0 parameter is: 2.574564
# the 2nd R0 parameter is: 1.007426
# the starting number of Infectious is: 1.028189
#
#
# QUESTION #21. What was the first R0 parameter, as inferred by ML?
# (This parameter was operating from Days 1-36.)
#
# QUESTION #22. What was the second R0 parameter, as inferred by ML?
# (This parameter was operating from Day 37 to the end.)
#
# QUESTION #23. What was the starting number of Infected, as inferred by ML?
# (This parameter was operating from Day 37 to the end.)
#
# QUESTION #24. Is this "starting number of Infected" a plausible number in
# real life, or is this more of a "nuisance parameter" in our analysis?
#


# You may have noticed that even our ML-fitted model parameters
# did not produce an excellent match to our data points.
#
# For example, the projected curve first rises faster than
# our observed case counts, but then it gets lower than the
# counts, and then it gets higher than the counts again.
#
# This is probably the result of trying to fit a single
# R0 parameter, which will produce a curve increasing
# exponentially by a constant proportion, to a dataset
# where, possibly, multiple things are going on.
#
# For example, what if, before the Level 4 lockdown happened,
# New Zealanders were already reducing transmission of the virus,
# for example through:
#
# * Level 2 and Level 3 restrictions
# * Increased handwashing & use of sanitizer
# * Stopping handshaking etc.
#
# This suggests that we might end up with a better-fitting
# model, and better parameter estimates, if we fit a
# 3-regime model, instead of a 2-regime model.
#
# Let's try that!


#######################################################
# Fitting a 3-regime model
#######################################################
time = c(1, 16, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case
R0 = c(3.0, 1.0, 0.3)
D_inf = c(14.0, 14.0)
S = c(5000000, 0)
I = c(1, 0)
R = c(0, 0)
thetas_states_table = cbind(time, R0, D_inf, S, I, R)
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE)

# Make some parameters into "free" parameters, to be inferred
thetas_states_df$R0 = c("free", "free", "free")
thetas_states_df$I[1] = "free"
thetas_states_df

# Run the params_to_likelihood_v1() function once, to see your starting likelihood
data.points = dat$Active_cases
params = c(2.5, 1.0, 0.3, 0.03) # starting parameter values
params_to_likelihood_v1(params, data.points, thetas_states_df, printvals=TRUE, makePlots=FALSE)

# Running the Maximum Likelihood search with the optim() function.
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!!
delay_val = 14
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1))

# Graphing the ML model
# Take the learned parameters from "ML_results", put them
# into a theta_states data.frame for simulation and plotting
thetas_states_ML = thetas_states_df
TF = thetas_states_ML == "free"
thetas_states_ML[TF] = ML_results$par
for (i in 1:ncol(thetas_states_ML))
	{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) }
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val


# Plot the results
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points))
maxy = max(max(trajectory$I), max(data.points))
xvals = c(trajectory$time, times)
yvals = c(trajectory$I, data.points)
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals")
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2")
points(times, data.points)
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8)

titletxt = paste0("ML fit, 3-regime model\nTotal log-likelihood = ", round(ML_results$value, 2))
title(titletxt)


# QUESTION #25. Visually, in the 3-regime model, does it look like
# this curve fits the data better?
#
# QUESTION #26. In the 3-regime model, what was the log-likelihood obtained
# for the Maximum Likelihood parameters?  Provide the answer to 2
# decimal places.
#
# QUESTION #27. In the 3-regime model, is the lnL higher than the one you found
# for the 2-regime model?
#
# QUESTION #28. What was the first R0 parameter, as inferred by ML?
# (This parameter was operating from Days 1-30.)
#
# QUESTION #29. What was the second R0 parameter, as inferred by ML?
# (This parameter was operating from Days 31-38.)
#
# QUESTION #30. What was the third R0 parameter, as inferred by ML?
# (This parameter was operating from Day 39-end.)
#
# QUESTION #31. What was the starting number of Infected, as inferred by ML?
# (note: as we saw before, this is a nuisance parameter, not intended
#  as a real estimate in this simplified model)
#
# QUESTION #32. What do our inferred R0 values suggest about the
# coronavirus's natural spreading ability, and about the impact of
# sanitation and physical distancing measures?
#


#######################################################
# Statistical model comparison
#
# One huge advantage of likelihood-based inference is
# that the likelihoods can be used to statistically
# compare model fits.
#
# One very common method for comparing ML-fitted models
# is Akaike Information Criterion (AIC). AIC has been
# used in tens of thousands of research papers to
# compare multiple model fits. See the lecture slides/
# audio, and the Week 6 reading, for an introduction to
# AIC.
#
# The very-very short version of AIC is that AIC
# *estimates the relative distance from the truth*.
#
# We never know what the "true model" is in real life,
# since any true biological history that produced observations
# is preposterously complex.  However, we can at least
# compare the *relative* fit of models to data, using AIC.
#
# AIC is a simple function of likelihood:
#
# AIC = -2 * (lnL - nparams)
#
# lnL = the maximized log-likelihood under some model
# nparams = the number of free parameters that you fit with ML
#
# Basically, the AIC is a form of penalized likelihood: we
# favor models that give higher lnLs, but then we penalize
# models base on the number of free parameters that we fit.
#
# This helps enforce parsimony in our explanations. If we
# put no penalty on extra free parameters, we could always
# fit models with higher and higher likelihood, until we
# ended up with a free parameter for every data point, at
# which point we fit all the data perfectly, but we have
# explained nothing at all, because the model just restates
# the dataset.
#
# Some details:
#
# The "-2" in the AIC formula needs explanation. The "-"
# just converts the log-likelihood (which is always negative,
# if the data are discrete) from negative to positive. This
# makes sense because AIC is a distance, and distances
# should be positive.
#
# The "2" part in AIC is actually an arbitrary convention -
# it would work the same if we used 1, 3, etc. The number "2"
# was chosen because another famous statistical test, the
# Likelihood Ratio Test, uses 2*(difference in log-likelihoods)
# as the test statistic, often called the "deviance".
#
# The rationale for 1 parameter = 1 unit of log-likelihood
# is complex, but it can be derived from a view of the world
# where we think that processes are controlled by a series of
# parameters of decreasing importance, such that there are
# a few major controlling variables, and an exponential
# decay in the importance of other controlling variables.
#
# Other rationales can produce other penalties, creating
# other model comparison criteria, such as BIC (Bayesian
# Information Criterion). Don't worry about these for now.
#
# AIC model weights
#
# AIC is just a number that doesn't mean much by itself.
# AIC values are used to *compare* multiple models. We
# do this with AIC weights, which assign each model a
# weight, where all of the weights sum to 1.
#
# For example, if you had these weights:
#
# Model 1 AIC weight = 0.05
# Model 2 AIC weight = 0.9
# Model 3 AIC weight = 0.05
#
# You would say that "Model 2 acquired 90% of the AIC
# weight, and was the best-fitting model in our
# model set. It fits about 9 times better than the
# other models combined, which is moderately good
# evidence that it is a better fit. Models 1 and 3
# have equal model weight, so they are approximately
# equivalent fits to the data."
#
##############################################
# Here is how to calculate AIC weights:
##############################################
#
# 1. Calculate AIC values for all the models
#    in your model set.
#
# 2. Find the AIC-best model. This is the model
#    with the *LOWEST* AIC (i.e., the smallest
#    relative distance from the truth).
#
# 3. For all the models, calculate the AIC difference
#    from the best model. (The AIC difference between
#    the best model and itself is 0.0.).  This
#    number is called "deltaAIC" (in science, "delta"
#    often means "change" or "difference", so "deltaAIC"
#    just means "difference in AIC".
#
# 4. For each model, calculate exp(-0.5 * deltaAIC). These
#    are the relative likelihoods.
#
# 5. Sum the relative likelihoods, and divide each
#    individual relative likelihood by the sum of
#    relative likelihoods. This is the AIC weight.
#    The AIC weights will sum to 1.0.
#
# 6. If desired, convert these fractions (numbers
#    between 0 and 1) to percentages, by multiplying
#    by 100, or by clicking the "%" option in Excel.
#
###############################################


###############################################
# Advantages and disadvantages of AIC
###############################################
# The advantages of the AIC framework include:
#
# * Multiple models (more than 2) can be compared easily.
#   Traditional statistical tests, which give you p-values,
#   typically only compare two models at a time, namely
#   a null model and an alternate.
#
# * There is no concept of "p-value cutoffs" or
#   "statistical signficance" in AIC-based analyses.
#   The proponents of AIC acknowledge that evidence is
#   continuous. If AIC favours a model 1 over model 2
#   60%-40%, this is a tiny bit of evidence that the
#   model 1 is closer to the truth.  If AIC favours a
#   model 1 99.999% to 0.001%, then this is strong
#   evidence that model 1 closer to the truth.
#
# * AIC assumes that we never have access to the
#   complete, true model that produced our data. It
#   acknowledges that the "true" model behind most
#   empirical data would be fantastically complex. So
#   all we are really trying to do is find better
#   simple approximations of a very complex reality.
#   AIC is one easy way to measure the *relative*
#   distance of models from the truth. We never know
#   the "absolute" distance from the truth, because
#   that would require knowing the true model, which
#   we never do.
#
# * Assuming you have maximized lnL values for each
#   model, the formula for AIC, and AIC weights, is very
#   simple.  The calculations can be done by hand, or
#   in Excel or Google Sheets (or R). All you have to
#   be able to do is count the number of free parameters
#   in each model.
#
# * AIC can be used very broadly for comparing models.
#   This includes not just models where the lnL is
#   explicitly reported (like those above), but
#   also in linear regression and other linear models.
#   Here, model fit is more commonly reported in
#   terms of "deviance" or "RSS" (residual sum
#   of squares, the sum of the squared errors between
#   the line of best fit, and the data).  These are
#   all proportional to -1 * log-likelihood, under
#   the assumption that the errors are normally
#   distributed. So, the "least squares" method can
#   also be interpret as a Maximum Likelihood method!
#
#   (Unfortunately, many introductory students only
#    ever learn about least-squares, when Maximum
#    Likelihood has much broader application to
#    all sorts of models, not just cases where
#    a normal distribution of errors can be assumed.)
#
# AIC has some limitations as well:
#
# * If you are literally doing a randomized controlled
#   experiment, where you have two hypotheses (models)
#   that exhaust the space of models, namely "no effect"
#   (the null model) and "effect" (the alternative model),
#   then the traditional p-value framework can make a
#   lot of sense. This is especially the case where
#   further decisions will be based on the result, so
#   the effect of false-positives and false-negatives
#   needs serious consideration.
#
# * If you *do* have the true model in your set, for
#   example when your data come from a computer simulation
#   where you know the true model, then a criterion
#   like BIC, which has a stronger penalty against
#   extra parameters.
#
# * If you have a dataset where the number of data is
#   small (typically <30 data points), and/or the number
#   of free parameters is large, it is more appropriate
#   to use "AICc", which is sample-size corrected AIC.
#   The formula for this is a little more complex, so
#   we will ignore it here.
#
# * There are all kinds of intense philosophical debates
#   in statistics and philosophy of science about
#   models, inference, and the best way to measure
#   support for models/hypotheses. AIC is just one
#   choice amongst these. I would put it roughly in
#   the "likelihoodist" school. Other major schools of
#   thought are "frequentism" (which includes p-values
#   and worries about the long-term error rate) and
#   Bayesianism (which explicitly puts prior probability
#   distributions on parameter values, and on models.)
#
# * It is easy forget that "all models are wrong", and
#   get very attached to your best model.  This is a
#   challenge in all of statistics & data analysis,
#   however.  The famous statement from George Box is
#   "All models are wrong, but some models are useful."
#   Constant critical thinking is required, as I have
#   tried to encourage!
#######################################################


#######################################################
# EXAMPLE USE OF AIC TO COMPARE MODELS
#######################################################
# Let's compare four models for our NZ Covid-19
# active cases dataset. Here, we run the 2-regime
# and 3-regime models we used before, but I also
# include a 1-regime model, and a 4-regime model.


#######################################################
# Model 1: A 1-regime model
#######################################################
time = c(1, 49) # Day 49 is the 2nd-to-last day, days 49-50 will have
                # a fixed R0 of 2.5
R0 = c(3.0, 2.5)
D_inf = c(14.0, 14.0)
S = c(5000000, 0)
I = c(1, 0)
R = c(0, 0)
thetas_states_table = cbind(time, R0, D_inf, S, I, R)
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE)

# NOTE: ONLY 2 FREE PARAMETERS
thetas_states_df$R0[1] = "free"
thetas_states_df$I[1] = "free"
thetas_states_df

# Run the params_to_likelihood_v1() function once, to see your starting likelihood
data.points = dat$Active_cases
params = c(2.5, 0.03) # starting parameter values
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=0, printvals=TRUE, makePlots=FALSE)

# Running the Maximum Likelihood search with the optim() function.
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!!
delay_val = 0
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1))

# Graphing the ML model
# Take the learned parameters from "ML_results", put them
# into a theta_states data.frame for simulation and plotting
thetas_states_ML = thetas_states_df
TF = thetas_states_ML == "free"
thetas_states_ML[TF] = ML_results$par
for (i in 1:ncol(thetas_states_ML))
	{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) }
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val


# Plot the results
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points))
maxy = max(max(trajectory$I), max(data.points))
xvals = c(trajectory$time, times)
yvals = c(trajectory$I, data.points)
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals")
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2")
points(times, data.points)
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8)

titletxt = paste0("ML fit, Model 1 (a 1-regime model)\nTotal log-likelihood = ", round(ML_results$value, 2))
title(titletxt)

# Save this model's parameters and log-likelihood
thetas_states_ML_model1 = thetas_states_ML
total_lnL_Model1 = ML_results$value


#######################################################
# Model 2: A 2-regime model
#######################################################
time = c(1, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case
R0 = c(3.0, 0.3)
D_inf = c(14.0, 14.0)
S = c(5000000, 0)
I = c(1, 0)
R = c(0, 0)
thetas_states_table = cbind(time, R0, D_inf, S, I, R)
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE)

# Make some parameters into "free" parameters, to be inferred
thetas_states_df$R0 = c("free", "free")
thetas_states_df$I[1] = "free"
thetas_states_df

# Run the params_to_likelihood_v1() function once, to see your starting likelihood
data.points = dat$Active_cases
params = c(3.0, 0.3, 0.03) # starting parameter values
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE)

# Running the Maximum Likelihood search with the optim() function.
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!!
delay_val = 14
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1))

# Graphing the ML model
# Take the learned parameters from "ML_results", put them
# into a theta_states data.frame for simulation and plotting
thetas_states_ML = thetas_states_df
TF = thetas_states_ML == "free"
thetas_states_ML[TF] = ML_results$par
for (i in 1:ncol(thetas_states_ML))
	{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) }
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val


# Plot the results
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points))
maxy = max(max(trajectory$I), max(data.points))
xvals = c(trajectory$time, times)
yvals = c(trajectory$I, data.points)
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals")
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2")
points(times, data.points)
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8)

titletxt = paste0("ML fit, Model 2 (a 2-regime model)\nTotal log-likelihood = ", round(ML_results$value, 2))
title(titletxt)

# Save this model's parameters and log-likelihood
thetas_states_ML_model2 = thetas_states_ML
total_lnL_Model2 = ML_results$value


#######################################################
# Model 3: A 3-regime model
#######################################################
time = c(1, 16, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case
R0 = c(3.0, 1.0, 0.3)
D_inf = c(14.0, 14.0, 14.0)
S = c(5000000, 0, 0)
I = c(1, 0, 0)
R = c(0, 0, 0)
thetas_states_table = cbind(time, R0, D_inf, S, I, R)
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE)

# Make some parameters into "free" parameters, to be inferred
thetas_states_df$R0 = c("free", "free", "free")
thetas_states_df$I[1] = "free"
thetas_states_df

# Run the params_to_likelihood_v1() function once, to see your starting likelihood
data.points = dat$Active_cases
params = c(3.0, 2.0, 0.3, 0.03) # starting parameter values
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE)

# Running the Maximum Likelihood search with the optim() function.
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!!
delay_val = 14
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1))

# Graphing the ML model
# Take the learned parameters from "ML_results", put them
# into a theta_states data.frame for simulation and plotting
thetas_states_ML = thetas_states_df
TF = thetas_states_ML == "free"
thetas_states_ML[TF] = ML_results$par
for (i in 1:ncol(thetas_states_ML))
	{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) }
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val


# Plot the results
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points))
maxy = max(max(trajectory$I), max(data.points))
xvals = c(trajectory$time, times)
yvals = c(trajectory$I, data.points)
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals")
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2")
points(times, data.points)
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8)

titletxt = paste0("ML fit, Model 3 (a 3-regime model)\nTotal log-likelihood = ", round(ML_results$value, 2))
title(titletxt)

# Save this model's parameters and log-likelihood
thetas_states_ML_model3 = thetas_states_ML
total_lnL_Model3 = ML_results$value


#######################################################
# Model 4: A 4-regime model
#######################################################
time = c(1, 16, 20, 24) # NZ Level 4 lockdown went in on March 26, day 24 since 1st case
R0 = c(3.0, 1.0, 1.0, 0.3)
D_inf = c(14.0, 14.0, 14.0, 14.0)
S = c(5000000, 0, 0, 0)
I = c(1, 0, 0, 0)
R = c(0, 0, 0, 0)
thetas_states_table = cbind(time, R0, D_inf, S, I, R)
thetas_states_df = as.data.frame(thetas_states_table, stringsAsFactors=FALSE)

# Make some parameters into "free" parameters, to be inferred
thetas_states_df$R0 = c("free", "free", "free", "free")
thetas_states_df$I[1] = "free"
thetas_states_df

# Run the params_to_likelihood_v1() function once, to see your starting likelihood
data.points = dat$Active_cases
params = c(3.0, 2.0, 1.0, 0.3, 0.03) # starting parameter values
params_to_likelihood_v1(params, data.points, thetas_states_df, delay_val=14, printvals=TRUE, makePlots=FALSE)

# Running the Maximum Likelihood search with the optim() function.
# LOOK AT THE OUTPUT THAT PRINTS TO SCREEN!!
delay_val = 14
ML_results = optim(par=params, fn=params_to_likelihood_v1, data.points=data.points, thetas_states_df=thetas_states_df, delay_val=delay_val, printvals=TRUE, method="L-BFGS-B", lower=0.0, control=list(fnscale=-1))

# Graphing the ML model
# Take the learned parameters from "ML_results", put them
# into a theta_states data.frame for simulation and plotting
thetas_states_ML = thetas_states_df
TF = thetas_states_ML == "free"
thetas_states_ML[TF] = ML_results$par
for (i in 1:ncol(thetas_states_ML))
	{ thetas_states_ML[,i] = as.numeric(thetas_states_ML[,i]) }
thetas_states_ML$time[2:length(thetas_states_ML$time)] = thetas_states_ML$time[2:length(thetas_states_ML$time)] + delay_val


# Plot the results
trajectory = simulate_SIR_changes(thetas_states_df=thetas_states_ML, times=1:length(data.points))
maxy = max(max(trajectory$I), max(data.points))
xvals = c(trajectory$time, times)
yvals = c(trajectory$I, data.points)
plot(x=xvals, y=yvals, pch=".", col="white", xlim=c(0, max(trajectory$time)), ylim=c(0, maxy), xlab="Day", ylab="Number of individuals")
lines(x=trajectory$time, y=trajectory$I, lwd=3, col="firebrick2")
points(times, data.points)
legend(x="topleft", legend=c("Active COVID-19 case count", 'ML-fitted projection of "I" (Infected)'), lty=c("blank", "solid"), lwd=c(1,3), pch=c(1, "."), col=c("black","firebrick2"), cex=0.8)

titletxt = paste0("ML fit, Model 4 (a 4-regime model)\nTotal log-likelihood = ", round(ML_results$value, 2))
title(titletxt)

# Save this model's parameters and log-likelihood
thetas_states_ML_model4 = thetas_states_ML
total_lnL_Model4 = ML_results$value


#######################################################
# AIC comparison of the 4 models
#######################################################

# Below, I will do AIC comparison in R, but I will
# also provide the same example in Excel and in
# Google Sheets.
#
# (NOTE: The exact results you get may differ, if you
#  change the input data e.g. by adding days.)
#
# Excel: # http://phylo.wikidot.com/local--files/biosci220:quantitative-biology/Week6_AIC_with_SIRs_fit_to_NZ_Ministry_of_Health_data.xlsx
#
# Google Sheet:
# https://docs.google.com/spreadsheets/d/1WNk1la1TfoIdKcadxm0kUW5JNTTl0ZDImzDEQq0fdSQ/edit?usp=sharing
#

# Making the AIC table in R:
#
# First, let's get the number of free parameters
# for each of the 4 models. We get this by counting
# the number of "free" statement in the
# theta_states_df, or more simply by just thinking
# about it.
#
# Model 1 estimated 1 R0 parameter, and a starting "I" count
# That is 2 free parameters estimated by ML.
#
# Model 2 added another R0, Model 3 added another, and
# Model 4 added another.  So they have 3, 4, and 5
# parameters.

# Let's store the number of free parameters
num_free_params = c(2, 3, 4, 5)

# Let's collect the maximized log-likelihoods from
# above
lnL = c(total_lnL_Model1, total_lnL_Model2, total_lnL_Model3, total_lnL_Model4)

# AIC is just -2 * (lnL - num_free_params)
AIC_vals = -2*(lnL - num_free_params)

# Get the delta AICs -- the difference from the best (lowest!) AIC
best_AIC = min(AIC_vals)
delta_AICs = AIC_vals - best_AIC

# Calculate the relative likelihoods
relative_likelihoods = exp(-0.5 * delta_AICs)

# And, calculate the AIC weights
AIC_weight = relative_likelihoods / sum(relative_likelihoods)

# AIC weights as percentages (rounded to 2 decimals)
percent_weight = round(AIC_weight * 100, digits=2)

# Make model names, and results table
model_names = c("Model 1", "Model 2", "Model 3", "Model 4")
AIC_results_matrix = cbind(lnL, num_free_params, AIC_vals, AIC_weight, percent_weight)
AIC_results_df = as.data.frame(AIC_results_matrix, stringsAsFactors=FALSE)

# Make better row & column names
row.names(AIC_results_df) = model_names
names(AIC_results_df) = c("lnL", "nparams", "AIC", "AIC_weight", "pct_weight")

# Look at the results table!
AIC_results_df


# A "real" results table would often also include the parameter values
parameter_estimates = matrix(0, nrow=4, ncol=5)

# It takes some work to extract the parameter estimates from the
# "thetas_states_ML_model" objects where we saved them
parameter_estimates[1,1:5] = c(thetas_states_ML_model1$I[1], thetas_states_ML_model1$R0[1], thetas_states_ML_model1$R0[1], thetas_states_ML_model1$R0[1], thetas_states_ML_model1$R0[1])
parameter_estimates[2,1:5] = c(thetas_states_ML_model2$I[1], thetas_states_ML_model2$R0[1], thetas_states_ML_model2$R0[1], thetas_states_ML_model2$R0[1], thetas_states_ML_model2$R0[2])
parameter_estimates[3,1:5] = c(thetas_states_ML_model3$I[1], thetas_states_ML_model3$R0[1], thetas_states_ML_model3$R0[1], thetas_states_ML_model3$R0[2], thetas_states_ML_model3$R0[3])
parameter_estimates[4,1:5] = c(thetas_states_ML_model4$I[1], thetas_states_ML_model4$R0[1], thetas_states_ML_model4$R0[2], thetas_states_ML_model4$R0[3], thetas_states_ML_model4$R0[4])
parameter_estimates

# Add them to the table, add some nice rounding
first_two_columns = cbind(round(AIC_results_df[,1],1), AIC_results_df[,2])
middle_columns = cbind(round(parameter_estimates, 2))
last_columns = round(AIC_results_df[,3:ncol(AIC_results_df)], 2)
AIC_results_df2 = cbind(first_two_columns, middle_columns, last_columns)
names(AIC_results_df2) = c("lnL", "nparams", "I_ini", "R0_1", "R0_2", "R0_3", "R0_4", "AIC", "AIC_weight", "pct_weight")

# Look at the final table
AIC_results_df2


#######################################################
# Congratulations! You have now calculated AIC model
# weights for your 4 models!  You will see AIC tables
# like this in the results sections of many scientific
# papers.
#
# The main interpretation of the table is about the
# AIC weights or percent weights (which give the
# same information).
#
# Model 1 (R0=2.18 for the entire epidemic)
# is clearly a horrible fit. It gets 0.0% of
# the model weight, and the plot also shows a
# huge difference between the data and the best-fit
# curve.  Yay! This shows that NZ's epidemic clearly
# deviates from constant exponential growth! We are
# nowhere near the millions of cases required for
# herd immunity to take over, so we're not just
# flattening the curve, we are squashing the curve!
#
# Model 2 (R0=3.5 until day 24, then R0=0.3 after the
# lockdown) has a fit that is 4918.75 AIC units
# better (meaning lower) than Model 1. An AIC difference of 10
# translates to a model having about 150 times more
# likelihood (exp(0.5*10) = about 150), so a difference
# of thousands of AIC units is huge. So this is
# a pretty good result, for adding just 1 more
# parameter!
#
# But was Model 2 the best fit? We can see that, in
# Model 3, adding a third R0, allowing R0 to decrease
# before the lockdown starts, improves the fit by
# another 451 AIC units. This is an increase in
# relative likelihood of exp(0.5*451)=8.6*10^97 times.
# Model 3 has an initial R0 of 4.85, then a switch to
# R0=2.53 on Day 16, then a switch to R0=0.53 on Day 24.
#
# Can we do even better?  Model 4 further subdivides
# Days 16-20 from Days 20-24. This *does* increase
# the log-likelihood, from -207.74 (Model 3) to
# -207.38 (Model 4). However, does this improvement
# in log-likelihood outweigh the penalty for having
# an extra parameter?  We know that the penalty term
# is equivalent to 1 unit of log-likelihood, so a
# log-likelihood improvement of only 0.36 units is
# not going to overcome the penalty.
#
# The AIC weights allow us to quantify this effect.
# Model 3 gets 65.49% of the model weight, and Model 4
# gets 34.51%. The research would conclude that, of the
# models considered, Model 3 is the best fit (or at
# least, the least-worst fit).  Model 4 retains
# plausibility, because it is always possible that a
# more complex model is true, but we can say that Model 3
# is an equivalent match to the data, but is more
# parsimonious.
#
#######################################################


#######################################################
# What is the point of all of this?
#
# In biology, we almost never know the "true model". We
# are just trying to find better approximations of
# reality.  AIC is one common choice to balance fitting
# the observations against parsimony.
#
# AIC can be used to address common questions like
# "which model is the best, given the observations I
#  have", or "out of this set of models, which models
#  are reasonable, given the dataset"?
#
# AIC can be used any time you can get a maximized
# log-likelihood, and a count of the number of free
# parameters, on each model.  It is used so often that
# R has some standard functions to provide them for
# linear models provided by the lm() function.
# Note: these functions will not work
# outside of standard "base" R functions like lm() and
# glm().
#
# Here is a quick example of that:

data("iris")  # A default R dataset, measurements of iris flowers
plot(iris$Sepal.Length, iris$Petal.Length)

# Run a simple linear model (lm), with Sepal Length
# predicting Petal Length
lm_result = lm(iris$Petal.Length~Sepal.Length, data=iris)

# The "logLik" function reports the lnL, and the
# number of free parameters (df=3)
# (df means "degrees of freedom", i.e. number of free
#  parameters)
logLik(lm_result)

AIC(lm_result)

# You can see that the AIC formula is being used:
lnL = as.numeric(logLik(lm_result))
-2*(lnL - 3)
AIC(lm_result)

#
# MAIN POINT: The skills you have used for using
# Maximum Likelihood and AIC on SIR models can be
# used anywhere from quite complex process-based
# models used in epidemiology, ecology, phylogenetics
# etc., all the way to the basic linear models used
# in introductory statistics.
#
#######################################################


################################################
# Appendix: The connection between Ordinary
# Least Squares (OLS), i.e. linear regression,
# and Maximum Likelihood.
#
# This is something I wish I had been taught
# early in my statistics education, so I am
# mentioning it here.
#
# This Appendix material is "bonus", I will not
# examine you on it for my section of the course.
#
################################################
#
# In standard "least squares" analyses
# (OLS, Ordinary Least Squares), where
# we talk about the line of
# best fit being the one that minimizes
# the sum of squared residuals (or RSS, residual
# sum of squares), this is just another way
# of talking about maximizing the log-likelihood,
# *if* we make the assumption that the residuals
# (the difference between the predicted line and
#  the observed response) all follow the same
# normally distribution (in other words, the
# residuals are independent, normally distributed
# with constant variance).
#
# In this situation, the log-likelihood for the
# line of best fit is derived from the log of
# the probability density function for a normal
# distribution.
#
# I find that this is rarely explained in R code,
# so I am putting various formulations here.
#

# Here is the logLik() function code:
getAnywhere("logLik.lm")

# This is the formula from logLik, with weights (w)
# set to 1s (meaning equal weighting of all residuals):
0.5 * - N * (log(2 * pi) + 1 - log(N) +
        log(sum(lm_result$residuals^2)))


# Here are several simpler formulations that
# give the same result:
#
# Formula for the log-likelihood of a linear model
# with x predicting y
# Source:
# https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf
# ...page 2, equation (3)
#
# N = number of observations
#
# RSS = residual sum of squares =
# sum of (yi - (b0 + b1*xi))^2
# where
# yi = ith response (observation)
# b0 = intercept
# b1 = slope
# xi = ith predictor
#
# stdev_squared = variance = RSS / N
# stdev = standard deviation = sqrt(stdev_squared) = sqrt(variance)

N = length(lm_result$residuals)
RSS = sum(lm_result$residuals^2)
stdev_squared = RSS/N
stdev = sqrt(RSS/N)

-N/2 * log(2*pi) - N*log(stdev) - 1/(2*stdev_squared)*RSS

# The formula works the same if you just use RSS
-N/2 * log(2*pi) - N*log(sqrt(RSS/N)) - 1/(2*RSS/N)*RSS

# R's logLik() function has the above formula re-arranged
# so that RSS appears only once:
0.5 * - N * (log(2 * pi) + 1 - log(N) +
        log(RSS))

logLik(lm_result)

# These formulations all give the same result!

#######################################################
# References
#
# Cosma Shalizi (2015). "Lecture 6: The Method of
# Maximum Likelihood for Simple Linear Regression."
# Lecture 6 from Modern Regression, 36-401, Fall 2015.
# https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf
#
#######################################################


#######################################################
# References
#
# My SIR code relies heavily on code from the fitR package
# (Funk et al. 2019).
#
# Adam, David (2020). "UK has enough intensive care units for
# coronavirus, expert predicts." New Scientist, March 25, 2020.
# https://www-newscientist-com.ezproxy.auckland.ac.nz/article/2238578-uk-has-enough-intensive-care-units-for-coronavirus-expert-predicts/
#
# Anderson, Charles (2020). "New Zealand coronavirus deaths during lockdown could
# be just 20, modelling suggests." The Guardian. 27 Mar 2020.
# https://www.theguardian.com/world/2020/mar/27/new-zealand-coronavirus-deaths-during-lockdown-could-be-just-20-modelling-suggests
#
# Dillane, Tom; Knox, Chris (2020), "Coronavirus Covid 19: New Zealand's
# best and worst prepared DHBs by elderly population and ICU beds."
# NZ Herald. April 5, 2020
# https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12321390
#
# Fefferman, Nina (2020). "The role of applied math in real-time pandemic
# response: How basic disease models work." Date: 3:30 EDT Tuesday,
# March 31, 2020.  http://www.nimbios.org/webinars#fefferman
#
# Funk, Sebastian (2019). "Introduction to model fitting in R." http://sbfnk.github.io/mfiidd/introduction.html
#
# Funk, Sebastian; Camacho, Anton; Johnson, Helen; Minter, Amanda; O’Reilly, Kathleen; Davies, Nicholas (2019). "Model fitting and inference for infectious disease dynamics."  Centre for the Mathematical Modelling of Infectious Diseases (CMMID), London School of Hygiene & Tropical Medicine.
#
# James, Alex; Hendy, Shaun C.; Plank, Michael J.; Steyn, Nicholas (2020). "Suppression and Mitigation
# Strategies for Control of COVID-19 in New Zealand." 25 March 2020.
# https://cpb-ap-se2.wpmucdn.com/blogs.auckland.ac.nz/dist/d/75/files/2017/01/Supression-and-Mitigation-New-Zealand-TPM-006.pdf
#
# Newton, Kate (2020). "The man modelling NZ's Covid-19 spread
# from his kitchen table." Radio New Zealand. 27 March 2020.
# https://www.rnz.co.nz/news/in-depth/412744/the-man-modelling-nz-s-covid-19-spread-from-his-kitchen-table
#########################################################