Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aflaxman/368824f98cac81eb47c8bc902188fe49 to your computer and use it in GitHub Desktop.
Save aflaxman/368824f98cac81eb47c8bc902188fe49 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np, matplotlib.pyplot as plt, pandas as pd\n",
"pd.set_option('display.max_rows', 8)\n",
"!date\n",
"\n",
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequsites:\n",
"\n",
"* SWC Intro to Python\n",
"* SWC Software Testing\n",
"\n",
"* Basic Matplotlib Graphics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Minimal, familiar example for IHME workers new to Bayesian methods\n",
"\n",
"Logistic regression... who has coded their own logistic regression before? Who has used logistic regression? Who has heard of logistic regression?\n",
"\n",
"[poll results go here]\n",
"\n",
"The _Unifying Political Methodology_ version of logistic regression looks like this:\n",
"\n",
"\\begin{align}\n",
"Y_i &\\sim \\text{Bernoulli}(\\pi_t),\\\\\n",
"\\pi_i &= \\text{logit}(\\beta_0 + \\beta_1 \\cdot X_i).\n",
"\\end{align}\n",
"\n",
"Gary King calls the first line is the _stochastic component_ and the second line the _systematic component_.\n",
"\n",
"## Exercise 1: Scientific Python Warm up\n",
"I always need to look up the formula for $\\text{logit}$. Let's do that now, and take a look at it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def logit(x):\n",
" # fill this in here\n",
" \n",
"logit(0) # what should this be?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# some minimal testing of our new logit function\n",
"\n",
"# write tests here, and if you do them first, you are doing test-driven development"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot this\n",
"xx = np.linspace(-10, 10)\n",
"yy = logit(xx)\n",
"plt.plot(...) # fill in these details\n",
"plt.xlabel(...) # label your plots---you might not remember what this was next time you look at it\n",
"plt.ylabel(...)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2: \"forward\" simulation\n",
"\n",
"Is it helpful to have the bubble diagram of the logistic regression model? To be fully Bayesian, we should probably go beyond Gary's formulation, and write down priors explicitly.\n",
"\n",
"\\begin{align}\n",
"Y_i &\\sim \\text{Bernoulli}(\\pi_t),\\\\\n",
"\\pi_i &= \\text{logit}(\\beta_0 + \\beta_1 \\cdot X_i),\\\\\n",
"\\beta_0, \\beta_1 &\\sim \\text{Uninformative}.\n",
"\\end{align}\n",
"\n",
"(This is an \"improper prior\"; you decide if you consider such a model fully Bayesian.)\n",
"\n",
"How to make a graphical representation of this?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# if necessary, from command line do\n",
"# pip install daft\n",
"import daft"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 680.315x175.748 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Instantiate the PGM.\n",
"pgm = daft.PGM([12, 3.1], origin=[0.3, 1.2])\n",
"\n",
"# Add in the nodes.\n",
"pgm.add_node(daft.Node(\"beta0\", r\"$\\beta_0$\", 1.5, 4))\n",
"pgm.add_node(daft.Node(\"beta1\", r\"$\\beta_1$\", 2.5, 4))\n",
"pgm.add_node(daft.Node(\"pi_i\", r\"$\\pi_i$\", 2, 3))\n",
"pgm.add_node(daft.Node(\"X_i\", r\"$X_i$\", 3, 3, observed=True))\n",
"pgm.add_node(daft.Node(\"Y_i\", r\"$Y_i$\", 2, 2, observed=True))\n",
"\n",
"# Add in the edges.\n",
"pgm.add_edge(\"beta0\", \"pi_i\")\n",
"pgm.add_edge(\"beta1\", \"pi_i\")\n",
"pgm.add_edge(\"X_i\", \"pi_i\")\n",
"pgm.add_edge(\"pi_i\", \"Y_i\")\n",
"\n",
"# And a plate.\n",
"pgm.add_plate(daft.Plate([1.5, 1.4, 2, 2.1], label=r\"$i = 1, \\cdots, N$\",\n",
" shift=-0.1))\n",
"\n",
"# Render\n",
"pgm.render();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This sort of figure is sometimes called a \"DAG\" because it is a directed, acyclic graph, and one way to start figuring out what it means is to use it to simulate data starting from the \"source\" nodes and working your way in. Try it!\n",
"\n",
"What are the source nodes here?\n",
"\n",
"(Answer: $\\beta_0$, $\\beta_1$ and $X_i$ for $i=1, \\ldots, N$.)\n",
"\n",
"We are stopped before we start in this process, because the source nodes are $\\beta_0$ and $\\beta_1$ and above we gave them an improper prior distribution. Let's change that. What prior do you want to use?\n",
"\n",
"(One answer: $\\beta_0, \\beta_1 \\sim \\text{Normal}(0,1^2)$.)\n",
"\n",
"Now can you sample from that?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# generate a sample value for beta_0 and beta_1\n",
"beta_0 = ...\n",
"beta_1 = ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now generate $\\pi_i$ with your $\\beta_0$ and $\\beta_1$ values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# this is a \"deterministic\" variable in the PyMC nomenclature\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wait, something is missing here. We need a value for $X_i$ first."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# assign something to be X_i\n",
"X_i = ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# another solution --- helpful or distracting?\n",
"X = np.array([3,1,4,1,5,9])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# now we can generate a sample of pi_i\n",
"pi_i = ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# another solution - helpful or distracting\n",
"pi = logit(beta_0 + beta_1 * X)\n",
"pi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To finish up this data generating simulation, we can now generate $Y_i$ from $\\pi_i$:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_i = ...\n",
"Y_i"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# another soluation --- helpful or distracting?\n",
"Y = np.random.binomial(n=1, p=pi)\n",
"Y"
]
},
{
"attachments": {
"image.png": {
"image/png": ""
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Bayesian inference (inverse probability)\n",
"\n",
"Now that we have done the forward simulation, perhaps concepts with cohere better when we do Bayesian inference. This is sometimes called _the inverse problem_, because instead of starting from the source nodes of the DAG and working our way forward along the arrows, we will start with information on the sink nodes of the DAG (and the other grey shaded nodes) and work our way back.\n",
"\n",
"![image.png](attachment:image.png)\n",
"\n",
"Because of the semantics of Python, we still write our program from the source nodes forward, starting from the priors. I'm using PyMC2, because I think it is cleanest, but you should try PyMC3."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pymc as pm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# priors\n",
"\n",
"beta_0 = pm.Normal(...)\n",
"beta_1 = pm.Normal(...)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Want to see what the priors look like? One way to do that is with MCMC sampling (soon to be described in more detail):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pm.MCMC([beta_0, beta_1]).sample(iter=10_000, burn=5_000, thin=5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.hist(beta_0.trace(), density=True)\n",
"# remember to label your plots for later\n",
"plt.title('Prior Distribution of $\\\\beta_0$');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.hist(beta_1.trace(), density=True)\n",
"plt.title('Prior Distribution of $\\\\beta_1$');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now for the systematic part:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pi = beta_0 + beta_1 * X # it is pretty tricky that this works"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# oops but we forgot the logit link\n",
"pi = ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# failed solution\n",
"# too tricky... what does that error mean?\n",
"pi = logit(beta_0 + beta_1 * X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# can you do it a differen way, that is clearer?\n",
"..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# one solution\n",
"@pm.deterministic\n",
"def pi(beta_0=beta_0, beta_1=beta_1, X=X):\n",
" return logit(beta_0 + beta_1*X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally the stochastic part (sometimes also called the likelihood):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_obs = pm.Bernoulli('Y_obs', ...) # called Y_obs to signify that it is about \"observed\" data, \n",
" # and because var name Y is already taken by the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ingredients for this model are now all defined. Want to see some things about them?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_obs.value"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_obs.logp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_obs.parents"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_obs.children"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pi.children"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pi.parents"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pi.value"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"beta_0.children"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"beta_1.children"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"beta_0.value"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And here is how you can fit the model with MCMC:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_obs.parents"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pi.value"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_obs.value"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Y_obs.logp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"beta_0.value"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"beta_1.value"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's refactor all of this into a function, make an `pm.MCMC` object out of it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"def model(X, Y):\n",
" # prior\n",
" beta_0 = pm.Normal('beta_0', mu=0, tau=1)\n",
" beta_1 = pm.Normal('beta_1', mu=0, tau=1)\n",
" \n",
" # systematic component\n",
" @pm.deterministic\n",
" def pi(beta_0=beta_0, beta_1=beta_1, X=X):\n",
" return logit(beta_0 + beta_1*X)\n",
" \n",
" # stochastic component (aka likelihood)\n",
" Y_obs = pm.Bernoulli('Y_obs', pi, value=Y, observed=True)\n",
" \n",
" return locals() # cool python affordance, see https://stackoverflow.com/questions/7969949/whats-the-difference-between-globals-locals-and-vars\n",
"model(X, Y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"m = pm.MCMC(model(X,Y))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"m.logp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"m.beta_0.value = 10"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"m.logp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"m.sample(iter=20_000, burn=10_000, thin=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.hist(m.beta_0.trace())\n",
"plt.title('Posterior Distribution of $\\\\beta_0$');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.hist(m.beta_1.trace())\n",
"plt.title('Posterior Distribution of $\\\\beta_1$');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Questions?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (dismod_env)",
"language": "python",
"name": "dismod_env"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment