Last active
March 30, 2020 07:28
-
-
Save nmatzke/e7a49f963912174136c3d319e2981231 to your computer and use it in GitHub Desktop.
BIOSCI220: Lab 4 exercise and questions
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Module 2: "Model-based Inference" | |
# https://canvas.auckland.ac.nz/courses/48087/pages/module-2-model-based-inference?module_item_id=747842 | |
# You can find the Lab 4 Worksheet questions at: | |
# | |
# Canvas -> BIOSCI 220 -> Quizzes -> Lab 4 Worksheet | |
# The below Lab 4 code is posted at: | |
# | |
# https://gist.github.com/nmatzke/e7a49f963912174136c3d319e2981231 | |
####################################################### | |
# WEEK 4 LAB | |
####################################################### | |
# When I planned Week 4 and recorded the lectures, | |
# COVID-19 was a faraway news story. Now, of course, | |
# it has taken over our lives. | |
# | |
# My goals for BIOSCI220 have been to (1) communicate | |
# the importance of model-based inference in modern | |
# science, but also (2) encourage you to think critically | |
# about models, and how decisions are made based on them. | |
# | |
# I usually like to emphasize #2, because there is a | |
# tendency, inside and outside of science, to naively | |
# belive whatever some model says, without much | |
# understanding of what goes into the model, or what | |
# the consequences might be if the model is wrong. | |
# | |
# However, in the light of COVID-19, we are now in the | |
# midst of situation where pretty literal life-and- | |
# death decisions are based on models -- frankly, | |
# rightly so, because we are in a situation where | |
# ignoring the models and trying to carry on with | |
# "business as usual" would have devastating | |
# consequences. | |
# | |
# Even in this situation, it is better to understand | |
# what is going on, and to understand why decisions | |
# are being made the way they are, than to just | |
# shrug and decide to believe or disbelieve the | |
# authorities on faith. | |
# | |
# To help you in this endeavour, I have scrapped some | |
# of my original material to focus "Module 2: Model-based | |
# Inference" on epidemiological models. The basic plan is: | |
# | |
# Week 4: Basics of model-based inference: | |
# Lectures: What are models? | |
# Lab: Likelihood and maximizing the likelihood | |
# | |
# Week 5: The SIR model (Susceptible, Infectious, Recovered) | |
# Lectures: Intro to SIR models | |
# Lab: Forecasting epidemics with SIR models in R (simulation) | |
# | |
# Week 6: Inference and model comparison | |
# Lectures: Methods of inference | |
# Lab: Inference of SIR model parameters from data | |
# | |
####################################################### | |
####################################################### | |
# | |
# LAB 4 SETUP INSTRUCTIONS | |
# | |
# For lab 4, you will need access to these: | |
# | |
# 1. Web browser, and access to CANVAS (for quizzes, | |
# lectures, online disucssions etc.). | |
# | |
# 2. Ability to copy/paste text. This is possible even | |
# on a smartphone, but probably is much easier on | |
# a desktop/laptop/tablet. | |
# | |
# 3. R. You can either install R on your desktop/laptop | |
# machine, or you can use R through the web, for | |
# example with this website: https://rdrr.io/snippets/ | |
# | |
# Note that, for the R-through-the-web pages, you | |
# will have to paste ALL necessary code in one go - | |
# it does not remember whatever you ran previously. | |
# This may mean that, later in the lab exercise, you | |
# have to paste in a large chunk of the lab code into | |
# the run window at once. | |
# | |
# 4. The TAs and I will take questions on Piazza all | |
# week. I also plan to do live Zoom help sessions, | |
# so if you want to access these, please install | |
# Zoom. | |
# | |
####################################################### | |
####################################################### | |
# LAB 4, PART 1: READING AND THINKING -- | |
# ADVANTAGES/DISADVANTAGES OF MODELS | |
####################################################### | |
# | |
# In lieu of an in-person discussion in class, I ask | |
# all students to contribute to a Canvas discussion | |
# on "models and COVID". | |
# | |
# You will recall that we have 3 assigned readings for | |
# the week: | |
# | |
# 1. Asimov - The Relativity of Wrong | |
# 2. O'Neill - Weapons of Math Destruction | |
# 3. Stevens - "Why outbreaks like coronavirus spread | |
# exponentially, and how to "flatten the | |
# curve" (free online). March 14, 2020 | |
# https://www.washingtonpost.com/graphics/2020/world/corona-simulator/ | |
# | |
# ASSIGNMENT: Each student should read the above | |
# articles, think about what they mean about | |
# models, their importance, and where and how | |
# critical thinking should be applied to them. | |
# | |
# THEN, with those thoughts in mind, *find another | |
# media article* that discusses the modeling of | |
# COVID-19 in New Zealand or elsewhere. There | |
# are *many* such articles up right now. They cover | |
# topics like describing the model projections, | |
# describing debates between epidemiologists (some | |
# of which have been very public!), or describing | |
# the reaction/nonreaction of politicians or the | |
# public to model projections. | |
# | |
# Instructions: | |
# | |
# 1. Make a post on the CANVAS discussion page: | |
# | |
# 2. Start the post with the reference for the article: | |
# Author, date, title, publication, link | |
# | |
# 3. Write a short paragraph summarizing the | |
# article. | |
# | |
# 4. Write a second short paragraph giving your | |
# reaction. Don't overthink this - we are | |
# in a stressful time, and I want this to be | |
# open to anything from a highly academic | |
# evaluation ("This article gets the science | |
# wrong because...") to questions ("I wish | |
# this article answered question X that I have | |
# because..."), to personal reactions ("It makes | |
# me angry/sad/happy that person/country X/Y | |
# did/did not do Z because"...). | |
# | |
# Include the reference & 2 paragraphs in a SINGLE post. | |
# | |
# 5. After a number of posts have gone up, | |
# please post a short reply to at least one | |
# other post. Ideally, a discussion of | |
# interesting topics will naturally emerge, but | |
# I would like to see every post get at least | |
# one reaction. OF COURSE, REMEMBER TO BE POLITE | |
# AND COLLEGIAL AND SCHOLARLY. | |
# | |
# 6. It's OK if different posts discuss the same articles, | |
# although, for your own interest, you might try to | |
# find a new article. | |
# | |
# 7. This discussion assignment will be worth 10% | |
# of your lab 4 grade. The only ways to lose points | |
# are (a) not do it, (b) do something insulting/ | |
# flippant/mean. | |
# | |
#################################################### | |
####################################################### | |
# LAB 4, PART 2: INTRODUCTION TO ML INFERENCE | |
######################################################## | |
# | |
# In Week 4, we will begin our exploration of | |
# model-based inference with a very simple dataset | |
# and model. | |
# | |
# In Weeks 5 & 6, we will apply these basic skills to | |
# SIR models from epidemiology, which are not nearly | |
# as complex or hard to understand as you might have | |
# thought. Basic SIR models can be coded and run in | |
# a few lines of R code. | |
# | |
# BACKGROUND | |
# | |
# Many biological problems, in many different fields, | |
# from ecology to epidemiology, are concerned with | |
# trying to estimate the prevalence or frequency of | |
# something. | |
# | |
# For example: | |
# | |
# - Out of 1000 ponds in a region, how many contain a | |
# particular endangered frog species? | |
# | |
# - Out of a million people, how many are resistant | |
# to measles (due to e.g. vaccination or previous | |
# infection)? | |
# | |
# These examples (leaving out numerous real-world | |
# complexities) boil down to this: there are 2 | |
# possible outcomes, and we are interested in how | |
# often each outcome occurs. | |
# | |
# Another process with 2 outcomes is coin-flipping. | |
# | |
# Coin-flipping, and any other 2-outcome process, | |
# can be described with a one-parameter model, using | |
# a binomial distribution ("bi-nomial" = "two names"). | |
# | |
# (We will use coin-flipping as an example, but you can | |
# mentally substitute any other binomial process if | |
# you like.) | |
# | |
# The single parameter describing a binomial process | |
# is "p", the probability of one of the outcomes. The | |
# probability of the other outcome is 1-p. | |
# | |
# Because there are so many different probabilities and | |
# "p" terms used in probability | |
# and statistics, I will use "P_heads" - the probability | |
# that a coin will produce "Heads" on a single flip - to | |
# describe this parameter. | |
# | |
######################################################## | |
# Let's consider some coin-flip data. | |
# | |
# Here are 100 coin flips: | |
coin_flips = c('H','T','H','T','H','H','T','H','H','H','T','H','H','T','T','T','T','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','H','T','T','T','H','T','T','T','H','T','T','T','H','H','H','T','T','H','H','H','T','H','H','H','T','T','H','H','H','H','H','H','H','T','T','H','H','H','H','T','T','H','H','H','T','T','H','H','H','H','H','H','T','T','T','H','H','H','H','H','H','T','H','T','H','H','T','T') | |
# Look at the data | |
coin_flips | |
# What is your guess at "P_heads", the probability of heads? | |
# | |
# | |
# In the case of binomial data, we actually have a simple | |
# formula to calculate the best estimate of P_heads: | |
# Find the heads | |
heads_TF = (coin_flips == "H") | |
heads_TF | |
# Find the tails | |
tails_TF = (coin_flips == "T") | |
tails_TF | |
numHeads = sum(heads_TF) | |
numHeads | |
numTails = sum(tails_TF) | |
numTails | |
numTotal = length(coin_flips) | |
numTotal | |
# Here's the formula: | |
P_heads_ML_estimate = numHeads / numTotal | |
P_heads_ML_estimate | |
# PLEASE ANSWER LAB 4 QUESTION #1, NOW: | |
# | |
# 1. What number (exactly) was reported for P_heads_ML_estimate? | |
# | |
# (Obviously the answer is 0.65, but I want you to realize that | |
# the computer will auto-grade this. So, 0.65 will be marked correct, | |
# but .65, 0.650, etc., will be marked incorrect. Extra spaces may | |
# also cause the answer to be marked incorrect, e.g. "0.65" is correct, | |
# but " 0.65" or "0.65 " are incorrect. For future answers, when numbers | |
# are asked for, be sure to report the exact number R gave, without | |
# spaces etc.) | |
# Well, duh, that seems pretty obvious. At least it would have been, if we | |
# weren't thinking of coins, where we have a strong prior belief that the | |
# coin is probably fair. | |
# | |
# It turns out that this formula can be justified through a technique | |
# known as Maximum Likelihood. | |
# | |
# What does it mean to say we have a "maximum likelihood" estimate of P_heads? | |
# | |
# "Likelihood", in statistics, means "the probability of the data under the model" | |
# | |
# So, a "Maximum Likelihood estimate" of P_heads is the value of the parameter | |
# P_heads that maximizes the probability of seeing the data you saw, namely, | |
# 65 Heads and 35 Tails. | |
# | |
# Comparing likelihoods for different possible parameter values of P_heads | |
# | |
# Let's calculate the probability of these same coin flip data under the | |
# hypothesis/model that P_heads=0.5 | |
# | |
# We'll be very inefficient, and use a for-loop, and | |
# if/else statements | |
# Loop through all 100 flips | |
# Make a list of the probability of | |
# each datum | |
P_heads_guess = 0.5 | |
# Empty list of probabilities | |
probs_list = rep(NA, times=length(coin_flips)) | |
probs_list | |
for (i in 1:length(coin_flips)) | |
{ | |
# Print an update | |
cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="") | |
# Get the current coin flip | |
coin_flip = coin_flips[i] | |
# If the coin flip is heads, give that datum | |
# probability P_heads_guess. | |
# If tails, give it (1-P_heads_guess) | |
if (coin_flip == "H") | |
{ | |
probs_list[i] = P_heads_guess | |
} # End if heads | |
if (coin_flip == "T") | |
{ | |
probs_list[i] = (1-P_heads_guess) | |
} # End if tails | |
} # End for-loop | |
# Look at the resulting probabilities | |
probs_list | |
# We get the probability of all the data by multiplying | |
# all the probabilities, with the prod() function. | |
likelihood_of_data_given_P_heads_guess1 = prod(probs_list) | |
likelihood_of_data_given_P_heads_guess1 | |
# PLEASE ANSWER LAB 4, QUESTION #2, NOW: | |
# 2. What number (exactly) was reported for likelihood_of_data_given_P_heads_guess1? | |
# That's a pretty small number! You'll see that it's | |
# just 0.5^100: | |
0.5^100 | |
# A probability of 0.5 is not small, but multiply it | |
# 100 values of 0.5 together, and you get a small value. | |
# That's the probability of that specific sequence of | |
# heads/tails, given the hypothesis that the true | |
# probability is P_heads_guess. | |
# Let's try another parameter value: | |
# Loop through all 100 flips | |
# Make a list of the probability of | |
# each datum | |
P_heads_guess = 0.7 | |
# Empty list of probabilities | |
probs_list = rep(NA, times=length(coin_flips)) | |
probs_list | |
for (i in 1:length(coin_flips)) | |
{ | |
# Print an update | |
cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="") | |
# Get the current coin flip | |
coin_flip = coin_flips[i] | |
# If the coin flip is heads, give that datum | |
# probability P_heads_guess. | |
# If tails, give it (1-P_heads_guess) | |
if (coin_flip == "H") | |
{ | |
probs_list[i] = P_heads_guess | |
} # End if heads | |
if (coin_flip == "T") | |
{ | |
probs_list[i] = (1-P_heads_guess) | |
} # End if tails | |
} # End for-loop | |
# Look at the resulting probabilities | |
probs_list | |
# We get the probability of all the data by multiplying | |
# all the probabilities | |
likelihood_of_data_given_P_heads_guess2 = prod(probs_list) | |
likelihood_of_data_given_P_heads_guess2 | |
# PLEASE ANSWER LAB 4, QUESTION #3, NOW: | |
# 3. What number (exactly) was reported for likelihood_of_data_given_P_heads_guess2? | |
# We got a different likelihood. It's also very small. | |
# But that's not important. What's important is, | |
# how many times higher is it? | |
likelihood_of_data_given_P_heads_guess2 / likelihood_of_data_given_P_heads_guess1 | |
# PLEASE ANSWER LAB 4, QUESTION #4, NOW: | |
# 4. What number (exactly) was reported for | |
# likelihood_of_data_given_P_heads_guess2 / likelihood_of_data_given_P_heads_guess1 ? | |
# Whoa! That's a lot higher! This means the coin flip data is 54 times more | |
# probable under the hypothesis that P_heads=0.7 than under the | |
# hypothesis that P_heads=0.5. | |
# Maximum likelihood: Hopefully you can see that the data is in some sense | |
# "best explained" by the parameter value (the valueof P_heads) that | |
# maximizes the probability of the data. This is what we call the Maximum | |
# Likelihood solution. | |
# To try different values of P_heads, we could keep copying and pasting code, | |
# but that seems annoying. Let's define an R "function" instead. | |
# | |
# Functions, in R, collect a number of commands, and run them when you | |
# type the name of the function. Functions are simple programs. All of | |
# the R commands you have ever run, e.g. mean(), prod(), etc. are all | |
# just functions that R has pre-programmed. | |
# | |
# NOTES ON FUNCTIONS IN R: | |
# * To run, functions have to be followed by (). Usually, but not always, | |
# inputs go into the (). One example of a function that needs no inputs | |
# is getwd(), which gets your current working directory (wd) and prints it | |
# to screen: | |
getwd() | |
# * To see the code that is running inside the function, you can often | |
# just type the name of the function, without the (). This works for | |
# most R functions except for "primitive" ones (these are core R | |
# functions, like getwd, mean, length, etc., that are programmed | |
# a different way to be super-fast). | |
# | |
# * Typing the names of functions and figuring out exactly what is | |
# happening in the code is a great way to learn R and learn a | |
# specific scientific field, since you can see *exactly* what | |
# some scientist is doing in their analysis. | |
# | |
# * For functions that are part of packages, you can get the help | |
# page by type a ? before the function name. E.g., ?mean will open | |
# the help page for the "mean" function. R help is often not | |
# written for beginners, but it is better than nothing. | |
# | |
# * When you try to run a function and get an error, this usually | |
# means you left out a required input, or you put in the wrong | |
# kind of input (e.g., if the input is supposed to be a number, | |
# and you put in a word like "hello", you will get an error | |
# message. | |
# | |
# When you type | |
mean("hello") | |
# ...what happens? | |
# PLEASE ANSWER LAB 4, QUESTION #5, NOW: | |
# 5. When you run 'mean("hello")', paste in the last part of the error message you get back. | |
# Your paste should being with "argument" and end with "NA". | |
# Let's define a function | |
# | |
# This function calculates the probability of a | |
# sequence of coin flip data, given a value of P_heads_guess | |
calc_prob_coin_flip_data <- function(P_heads_guess, coin_flips) | |
{ | |
# Empty list of probabilities | |
probs_list = rep(NA, times=length(coin_flips)) | |
probs_list | |
for (i in 1:length(coin_flips)) | |
{ | |
# Print an update | |
#cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="") | |
# Get the current coin flip | |
coin_flip = coin_flips[i] | |
# If the coin flip is heads, give that datum | |
# probability P_heads_guess. | |
# If tails, give it (1-P_heads_guess) | |
if (coin_flip == "H") | |
{ | |
probs_list[i] = P_heads_guess | |
} # End if heads | |
if (coin_flip == "T") | |
{ | |
probs_list[i] = (1-P_heads_guess) | |
} # End if tails | |
} # End for-loop | |
# Look at the resulting probabilities | |
probs_list | |
# We get the probability of all the data by multiplying | |
# all the probabilities | |
likelihood_of_data_given_P_heads_guess = prod(probs_list) | |
# Return result | |
return(likelihood_of_data_given_P_heads_guess) | |
} | |
# Now, we can just use this function: | |
calc_prob_coin_flip_data(P_heads_guess=0.5, coin_flips=coin_flips) | |
calc_prob_coin_flip_data(P_heads_guess=0.6, coin_flips=coin_flips) | |
calc_prob_coin_flip_data(P_heads_guess=0.7, coin_flips=coin_flips) | |
# Look at that! We did all of that work in a split-second. | |
# PLEASE ANSWER LAB 4, QUESTION #6, NOW: | |
# 6. What number (exactly) was reported for | |
# calc_prob_coin_flip_data(P_heads_guess=0.6, coin_flips=coin_flips) ? | |
# In fact, we can make another for-loop, and search for the ML | |
# value of P_heads by trying all of the values and plotting them. | |
# Sequence of 50 possible values of P_heads between 0 and 1 | |
P_heads_values_to_try = seq(from=0, to=1, length.out=50) | |
likelihoods = rep(NA, times=length(P_heads_values_to_try)) | |
for (i in 1:length(P_heads_values_to_try)) | |
{ | |
# Get the current guess at P_heads_guess | |
P_heads_guess = P_heads_values_to_try[i] | |
# Calculate likelihood of the coin flip data under | |
# this value of P_heads | |
likelihood = calc_prob_coin_flip_data(P_heads_guess=P_heads_guess, coin_flips=coin_flips) | |
# Store the likelihood value | |
likelihoods[i] = likelihood | |
} # End for-loop | |
# Here are the resulting likelihoods: | |
likelihoods | |
# Let's try plotting the likelihoods to see if there's a peak | |
plot(x=P_heads_values_to_try, y=likelihoods) | |
lines(x=P_heads_values_to_try, y=likelihoods) | |
# Whoa! That's quite a peak! You can see that the likelihoods | |
# vary over several orders of magnitude. | |
# | |
# Partially because of this extreme variation, we often use the | |
# logarithm of the likelihood, also known as | |
# | |
# * log-likelihood | |
# * lnL | |
# | |
# In R, the log() function returns the natural logarithm, while | |
# log10 returns the logarithms at base 10. I mention this because | |
# on some calculators, the LN button returns the natural log, | |
# and the LOG button returns the log10 logarithm. | |
# | |
# | |
# There are several reasons scientists, and scientific papers, | |
# often use the log-likelihood instead of the raw likelihood. | |
# These include: | |
# | |
# * Computers have a minimum numerical precision. If a likelihood | |
# value gets too small, it gets rounded to 0. You can see this | |
# happen if you calculate the raw likelihood of over 1000 coin-flips | |
# using the parameter P_heads = 0.5 | |
# | |
# Machine precision can vary between computers, so I will give you | |
# the results on my machine (but, try it on yours): | |
0.5^1073 | |
# ...equals 9.881313e-324, which is a very small number | |
0.5^1074 | |
# ...equals 4.940656e-324, which is half the size of the previous number, | |
# which makes sense, because we just multiplied it by 0.5 | |
# | |
# However, on my machine, | |
0.5^1075 | |
# ...equals 0. Instead of one more coin-flip causing the likelihood to be | |
# halved again, zero is infinitely less than 4.940656e-324. So the | |
# computer has made a "mistake" here. | |
# | |
# Log-likelihoods do not have this problem. Compare: | |
1073 * log(0.5) | |
1074 * log(0.5) | |
1075 * log(0.5) | |
1076 * log(0.5) | |
# We can convert log-likelihoods to raw-likelihood with exp(), which reverses | |
# the log() operation. | |
exp(1073 * log(0.5)) | |
exp(1074 * log(0.5)) | |
exp(1075 * log(0.5)) | |
exp(1076 * log(0.5)) | |
# You can see that, working in log-likelihood space, we have no | |
# numerical precision issues, but with raw likelihoods, we do. | |
# PLEASE ANSWER LAB 4, QUESTION #7, NOW: | |
# 7. What number (exactly) was reported for 1075 * log(0.5) ? | |
# Other reasons scientists use log-likelihoods instead of likelihoods: | |
# | |
# * The log-likelihood and raw-likelihood will have their maximums | |
# at the same parameter values. | |
# | |
# * Taking the logarithm of an equation converts multiplication to | |
# addition, which can make calculus much easier. | |
# | |
# * Log-likelihoods are easier to read and interpret than raw | |
# likelihoods, once you get used to log-likelihoods. For any | |
# large dataset, the raw likelihood will be a very tiny | |
# number, hard to read/remember/interpret. | |
# | |
# * It turns out that a number of fundamental statistical measures | |
# and tests use the log-likelihood as an input. These include | |
# the Likelihood Ratio Test (LRT) and Akaike Information Criterion (AIC). | |
# We will talk about these in future weeks. | |
# PLOTTING THE LOG-LIKELIHOOD | |
# We will repeat the code to calculate the likelihood | |
# for a sequence of 50 possible values of P_heads between 0 and 1: | |
P_heads_values_to_try = seq(from=0, to=1, length.out=50) | |
likelihoods = rep(NA, times=length(P_heads_values_to_try)) | |
for (i in 1:length(P_heads_values_to_try)) | |
{ | |
# Get the current guess at P_heads_guess | |
P_heads_guess = P_heads_values_to_try[i] | |
# Calculate likelihood of the coin flip data under | |
# this value of P_heads | |
likelihood = calc_prob_coin_flip_data(P_heads_guess=P_heads_guess, coin_flips=coin_flips) | |
# Store the likelihood value | |
likelihoods[i] = likelihood | |
} # End for-loop | |
# Here are the resulting likelihoods: | |
likelihoods | |
# Let's take the log (the natural log, i.e. the base is exp(1)). | |
log_likelihoods = log(likelihoods, base=exp(1)) | |
plot(x=P_heads_values_to_try, y=log_likelihoods) | |
lines(x=P_heads_values_to_try, y=log_likelihoods) | |
# Let's plot the raw- and log-likelihood together: | |
par(mfrow=c(2,1)) | |
plot(x=P_heads_values_to_try, y=likelihoods, main="Likelihood (L) of the data") | |
lines(x=P_heads_values_to_try, y=likelihoods) | |
plot(x=P_heads_values_to_try, y=log_likelihoods, main="Log-likelihood (LnL) of the data") | |
lines(x=P_heads_values_to_try, y=log_likelihoods) | |
# PLEASE ANSWER LAB 4, QUESTION #8, NOW: | |
# 8. Compare the raw-likelihood curve and the log-likelihood curve. Does it look | |
# like the top of each curve appears in about the same location on the | |
# x-axis? (i.e., at the same parameter value for P_heads?) | |
# | |
# Maximum likelihood optimization | |
# | |
# You can see that the maximum likelihood of the data occurs when | |
# P_heads is somewhere around 0.6 or 0.7. What is it | |
# exactly? | |
# | |
# We could just keep trying more values until we find whatever | |
# precision we desire. But, R has a function for | |
# maximum likelihood optimization! | |
# | |
# It's called optim(). Optim() takes a function as an input. | |
# Fortunately, we've already written a function! | |
# | |
# Let's modify our function a bit to return the log-likelihood, | |
# and print the result: | |
# Function that calculates the probability of coin flip data | |
# given a value of P_heads_guess | |
calc_prob_coin_flip_data2 <- function(P_heads_guess, coin_flips) | |
{ | |
# Empty list of probabilities | |
probs_list = rep(NA, times=length(coin_flips)) | |
probs_list | |
for (i in 1:length(coin_flips)) | |
{ | |
# Print an update | |
#cat("\nAnalysing coin flip #", i, "/", length(coin_flips), sep="") | |
# Get the current coin flip | |
coin_flip = coin_flips[i] | |
# If the coin flip is heads, give that datum | |
# probability P_heads_guess. | |
# If tails, give it (1-P_heads_guess) | |
if (coin_flip == "H") | |
{ | |
probs_list[i] = P_heads_guess | |
} # End if heads | |
if (coin_flip == "T") | |
{ | |
probs_list[i] = (1-P_heads_guess) | |
} # End if tails | |
} # End for-loop | |
# Look at the resulting probabilities | |
probs_list | |
# We get the probability of all the data by multiplying | |
# all the probabilities | |
likelihood_of_data_given_P_heads_guess = prod(probs_list) | |
# Get the log-likelihood | |
LnL = log(likelihood_of_data_given_P_heads_guess) | |
LnL | |
# Error correction: if -Inf, reset to a low value | |
if (is.finite(LnL) == FALSE) | |
{ | |
LnL = -1000 | |
} | |
# Print some output | |
print_txt = paste("\nWhen P_heads=", P_heads_guess, ", LnL=", LnL, sep="") | |
cat(print_txt) | |
# Return result | |
return(LnL) | |
} | |
# Try the function out: | |
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.1, coin_flips=coin_flips) | |
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.2, coin_flips=coin_flips) | |
LnL = calc_prob_coin_flip_data2(P_heads_guess=0.3, coin_flips=coin_flips) | |
# Looks like it works! Let's use optim() to search for he | |
# best P_heads value: | |
# Set a starting value of P_heads | |
starting_value = 0.1 | |
# Set the limits of the search | |
limit_bottom = 0 | |
limit_top = 1 | |
optim_result = optim(par=starting_value, fn=calc_prob_coin_flip_data2, coin_flips=coin_flips, method="L-BFGS-B", lower=limit_bottom, upper=limit_top, control=list(fnscale=-1)) | |
# You can see the search print out as it proceeds. | |
# Let's see what ML search decided on: | |
optim_result | |
# Let's compare the LnL from ML search, with the binomial mean | |
optim_result$par | |
# Here's the formula: | |
P_heads_ML_estimate = numHeads / numTotal | |
P_heads_ML_estimate | |
# PLEASE ANSWER LAB 4, QUESTION #9, NOW: | |
# 9. Is the ML value of P_heads_guess found by numerical optimization | |
# (found in optim_result$par ) very close to the ML value found | |
# by analytical solution (P_heads_ML_estimate)? | |
# PLEASE ANSWER LAB 4, QUESTION #10, NOW: | |
# 10. What was the maximized log-likelihood found by optim (hint: look at | |
# optim_result$value)? Numerical optimization results might vary | |
# slightly, so round this value to 2 decimal places in your answer. | |
# Where does that formula -- the "analytic solution" for P_heads_ML_estimate | |
# come from? We will discuss this further in future lectures, but here is a summary. | |
# | |
# It turns out that, in simple cases, we can use calculus to derive | |
# "analytical solutions" for maximizing the likelihood. Basically, this involves: | |
# | |
# - taking the derivative of the likelihood function | |
# (the derivative is the slope of a function) | |
# - setting the formula for the derivative to equal 0 | |
# (because, when the derivative=0, you are at a maximum or minimum) | |
# - solving for the parameter value that causes the derivative to | |
# equal 0. | |
# | |
# This same basic calculus-based procedure provides a justification for | |
# the formulas that are used in e.g. linear regression. | |
# | |
# The nice things about calculus-derived formulas for maximizing likelihood | |
# -- known as "analytic solutions" -- include: | |
# | |
# - it is very fast | |
# - the solution can be found without a computer (this was useful back in | |
# the 20th century, when statistics was being developed without computers) | |
# - there is no chance of optimization error (optimization errors can occur | |
# if likelihood surfaces are too flat, too bumpy, etc.) | |
# | |
# This raises the question: | |
# Why would anyone ever go through all the optimization rigamarole, when | |
# they could just calculate P_heads directly? | |
# | |
# Well, only in simple cases do we have a formula for the maximum likelihood | |
# estimate. The optim() strategy works whether or not | |
# there is a simple formula. | |
# | |
# In real life science, with complex models of biological processes (such as | |
# epidemiological models), ML optimization gets used A LOT, but most scientists | |
# don't learn it until graduate school, if then. | |
# | |
# WHAT WE LEARNED | |
# | |
# You can now see that there is nothing magical, or even particularly | |
# difficult, about these concepts: | |
# | |
# - likelihood | |
# - log-likelihood | |
# - maximum likelihood (ML) | |
# - inferring a parameter through ML | |
# | |
# | |
# IMPLICATIONS | |
# | |
# Despite their simplicity, it turns out that likelihood is one of | |
# the core concepts in model-based inference, and statistics | |
# generally. Bayesian analyses have likelihood as a core component | |
# (we will talk about this more later), and methods from introductory | |
# statistics (linear models etc.) | |
####################################################### | |
####################################################### | |
# PART 3: BUILDING BLOCKS OF MODELS IN R | |
####################################################### | |
# | |
# R has a number of functions that speed up the process | |
# of building models, and simulating from them. | |
# | |
# For example, for binomial processes, these are: | |
# | |
# dbinom(x, size, prob, log) -- the probability density | |
# (the likelihood) of x "heads" | |
# | |
# size = the number of tries (the number of coin flips) | |
# | |
# prob = the parameter value (e.g. P_heads) | |
# | |
# log = should the likelihood be log-transformed when output? | |
# | |
# | |
# The other functions are: | |
# | |
# rbinom(n, size, prob) -- simulate n binomial outcomes, | |
# using "size" and "prob" as above | |
# | |
# pbinom(q, ...) - used for calculating tail probabilities | |
# (proportion of the distribution less than the value q) | |
# | |
# qbinom(p, ...) - reverse of pbinom() | |
# | |
# | |
# Any time you hear about "probability densities", "distributions" | |
# "random number generators", "random draws", or "simulations" | |
# in R, functions like these are being used. | |
# | |
# For uniform distributions, the equivalent functions are: | |
# | |
# dunif, runif, punif, qunif | |
# | |
# For normal (=Gaussian/bell-curve) distributions, the equivalents are: | |
# | |
# | |
# dnorm, rnorm, pnorm, qnorm | |
# Random seeds, and random number generators | |
# Random numbers in computers are not truly random. To make random numbers | |
# repeatable, a "seed" number is always used. | |
# Compare: | |
rnorm(n=1, mean=6, sd=1) | |
rnorm(n=1, mean=6, sd=1) | |
rnorm(n=1, mean=6, sd=1) | |
set.seed(54321) | |
rnorm(n=1, mean=6, sd=1) | |
rnorm(n=1, mean=6, sd=1) | |
set.seed(54321) | |
rnorm(n=1, mean=6, sd=1) | |
rnorm(n=1, mean=6, sd=1) | |
set.seed(54321) | |
rnorm(n=1, mean=6, sd=1) | |
rnorm(n=1, mean=6, sd=1) | |
# By re-setting the seed, we re-set the random number generator | |
# PLEASE ANSWER LAB 4, QUESTION #11, NOW: | |
# 11. If you set the random number seed to 54322, what is the first | |
# random number generated by rnorm(n=1, mean=6, sd=1)? | |
# Generate 1000 random numbers from a normal distribution, and | |
# then plot a histogram of them, with this code: | |
random_numbers = rnorm(n=1000, mean=6, sd=1) | |
hist(random_numbers) | |
# PLEASE ANSWER LAB 4, QUESTION #12, NOW: | |
# 12. What shape do you see? | |
# PLEASE ANSWER LAB 4, QUESTION #13, NOW: | |
# 13. I am 6 feet tall. What is the raw likelihood of my height, | |
# assuming that I am a random draw from a population that is | |
# 7 feet tall, with a standard deviation of 0.5 feet? | |
dnorm(x=6.0, mean=7.0, sd=0.5) | |
# PLEASE ANSWER LAB 4, QUESTION #14, NOW: | |
# 14. What is the log-likelihood of my height? | |
log(dnorm(x=6.0, mean=7.0, sd=0.5)) | |
# PLEASE ANSWER LAB 4, QUESTION #15, NOW: | |
# 15. If we change the model of my population from mean height=7 | |
# feet to mean height=6 feet, will the likelihood of my 6-foot | |
# height go up or down? | |
# PLEASE ANSWER LAB 4, QUESTION #16, NOW: | |
# 16. What is the raw likelihood of my height, assuming I am a | |
# random draw from a population that is 6 feet tall, with | |
# a standard deviation of 0.5 feet? | |
dnorm(x=6.0, mean=6.0, sd=0.5) | |
# NOTE: The "d" in "dnorm" refers to probability density. For continuous data, | |
# the probability of any exact value is 0, because there are an infinite number | |
# of possible values for a continuous variable. So we talk about the probability | |
# density instead of the exact probability (or probability mass). Probability | |
# densities work the same as probability masses for our purposes. | |
# PLEASE ANSWER LAB 4, QUESTION #17, NOW: | |
# 17. What is the corresponding log-likelihood? | |
# PLEASE ANSWER LAB 4, QUESTION #18, NOW: | |
# 18. What shape does dnorm() make, for a variety of | |
# height values? Try: | |
# | |
height_values = seq(4, 8, 0.01) | |
probability_densities = dnorm(x=height_values, mean=6.0, sd=0.5) | |
plot(x=height_values, y=probability_densities) | |
####################################################### | |
# LAB 4 EXERCISE: PLEASE FILL OUT THE EXERCISE ON CANVAS. | |
# IT IS DUE BY 11:59 PM, SATURDAY, APRIL 4. | |
####################################################### | |
####################################################### | |
# LAB 5: WHEN YOU FINISH LAB 4, WHICH WAS JUST | |
# INTRODUCTORY MATERIAL, YOU ARE WELCOME TO MOVE | |
# ON TO LAB 5, WHEN I POST IT. IT WILL INVOLVE | |
# SIR MODELS OF EPIDEMICS, AND SO IT WILL BE MUCH MORE | |
# INTERACTIVE. | |
####################################################### | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment