Skip to content

Instantly share code, notes, and snippets.

Avatar
🏠
Working from home

Christopher Peters statwonk

🏠
Working from home
View GitHub Profile
@statwonk
statwonk / coronavirus.R
Last active Mar 1, 2020
Showing that the number of deaths by coronavirus is somewhat robust to CFR restricted above 1% (South Korea lower bound at 0.4%).
View coronavirus.R
set.seed(1)
N <- 1e7 # sims
quantiles_of_interest <- function(x) { quantile(x, c(0.0001, 0.25, 0.5, 0.75, 0.9999)) }
death_rate <- function() { pmin(pmax(rlnorm(N, log(0.02), 0.35), 0.001), 0.06) }
quantiles_of_interest(death_rate())
# very robust / ignorant belief of range of case 2.4k to 60M cases.
susceptible_cases <- function() { runif(N, 0.0024, 150) }
@statwonk
statwonk / coronavirus.R
Created Mar 1, 2020
Coronavirus estimates
View coronavirus.R
set.seed(0)
N <- 1e7 # sims
quantiles_of_interest <- function(x) { quantile(x, c(0.0001, 0.25, 0.5, 0.75, 0.9999)) }
death_rate <- function() { pmin(pmax(rlnorm(N, log(0.02), 0.35), 0.001), 0.06) }
# Statwonk's belief of final US death rate
quantiles_of_interest(death_rate())
# 0.01% 25% 50% 75% 99.99%
@statwonk
statwonk / bootstrapping.R
Last active Feb 22, 2020
The essence of bootstrapping: resampling with replacement to estimate a sampling distribution and its statistics of interest.
View bootstrapping.R
library(tidyverse)
library(gganimate)
# sample of 100 data points
tibble(x = rexp(1e2)) -> d
# 20 bootstrap samples, you likely want several orders of magnitude more.
tibble(boot_samples = seq_len(2e1)) %>%
mutate(data = map(boot_samples, ~ d %>% sample_frac(1, replace = TRUE))) %>%
unnest(cols = c(data)) %>%
@statwonk
statwonk / retweets.R
Created Dec 30, 2019
Measuring retweets, for fun!
View retweets.R
library(fitdistrplus)
library(countreg)
library(tidyverse)
library(rtweet)
get_timeline("statwonk", n = 1e4) %>%
filter(!is_retweet) %>%
pull(retweet_count) -> my_retweets
# I know a priori it's negbin,
@statwonk
statwonk / durations_to_counts.R
Created Sep 7, 2019
The relationship between intra-arrival times and count models.
View durations_to_counts.R
library(tidyverse)
# https://www.kellogg.northwestern.edu/faculty/weber/decs-430/Notes%20on%20the%20Poisson%20and%20exponential%20distributions.pdf
# http://www.mscs.mu.edu/~jsta/issues/11(3)/JSTA11(3)p2.pdf
# Case 1: memoryless intra-arrival times. They're not correlated.
tibble::tibble(
durations_between = rexp(3e4, rate = 20)
) %>%
mutate(time = cumsum(durations_between)) %>%
mutate(time_interval = cut(time, breaks = seq(0, 1e4, 1), labels = FALSE)) %>%
@statwonk
statwonk / simple_multilevel.R
Last active Sep 4, 2019
A simulation of a minimal multilevel model w/ two levels: inning level, opposing pitcher level.
View simple_multilevel.R
library(tidyverse)
library(brms)
library(tidybayes)
library(ggthemes)
# Simulation settings
innings <- 9
games <- 162
number_of_opposing_pitchers <- 20 # purposly set low to illustrate variation
opposing_pitchers <- rnorm(number_of_opposing_pitchers, sd = 0.5) # a set of opposing pitcher effects
@statwonk
statwonk / rootograms.R
Last active Sep 28, 2019
A gist that shows how a rootogram helps find that the zero-inflated negative binomial was the data generating mechanism.
View rootograms.R
# install.packages("countreg", repos="http://R-Forge.R-project.org")
# https://www.fromthebottomoftheheap.net/2016/06/07/rootograms/
# https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference/countreg-Tools-for-count-data-regression
library(countreg)
rzinbinom(3e3, size = 4, mu = 20, pi = 0.05) -> x
table(x)
hist(x, col = "orange")
rootogram(glm(x ~ 1, family = "poisson")) # zeros under fit
@statwonk
statwonk / risk_adds_up.R
Last active Aug 11, 2019
Risk adds up. This code piece answers, "how quickly?" https://twitter.com/statwonk/status/1160542394544267265
View risk_adds_up.R
library(tidyverse)
expand.grid(
risk = seq(0.001, 0.02, 0.001),
units_of_exposure = seq_len(24*7)
) %>% as_tibble() %>%
mutate(total_risk = map2_dbl(risk, units_of_exposure, ~ 1 - (1 - .x)^(.y))) %>%
ggplot(aes(x = risk, y = units_of_exposure)) +
geom_raster(aes(fill = total_risk), alpha = 0.90) +
scale_fill_gradient2(name = "Total chance",
low = "white", mid = "white", high = "purple4", midpoint = 0.5,
@statwonk
statwonk / outages.R
Created Jul 12, 2019
Assuming outages are totally unrelated and occur at a rate of every other month, what's the chance of seeing more than four in a given week? About one in seven million.
View outages.R
# https://news.ycombinator.com/item?id=20356610
# chance of the cluster size being >5 when we can cluster events within a one-week period
# and the rate of events is one every 2 months (debatable, but feels right to me)?
# My re-write:
# probability of more than four outages in a given week given a rate of every other month
# _assuming_: totally unrelated infra (none share AWS, etc.)
ppois(4, lambda = 1/2/4.5, lower.tail = FALSE) # ~ 1/7M chance.
# Given that we've observed five in one-week,
@statwonk
statwonk / quantile_analysis.R
Created Jul 11, 2019
Time to 10% ending / converting / failure.
View quantile_analysis.R
library(tidyverse)
library(survival)
rweibull_cens <- function(n, shape, scale) {
# http://statwonk.com/weibull.html
a_random_death_time <- rweibull(n, shape = shape, scale = scale)
a_random_censor_time <- rweibull(n, shape = shape, scale = scale)
observed_time <- pmin(a_random_censor_time, a_random_death_time)
censor <- observed_time == a_random_death_time
tibble(time = observed_time, censor = censor)
You can’t perform that action at this time.