lgarrison/SAO_python_worksheet.ipynb

## SAO_python_worksheet.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SAO Python Tutorial\n",
    "7/14/2018\n",
    "\n",
    "Lehman Garrison (lgarrison@cfa.harvard.edu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook shows how to do simple simulations with Python and NumPy.  It shows the \"naive\" way using Python `for` loops, then compares it to the fast solution using NumPy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simulating events: coin flips"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start simple by simulating a million coin flips and counting the number of heads.\n",
    "\n",
    "The naive (\"pure Python\") to do this is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 2000000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#%%timeit\n",
    "n_heads = 0\n",
    "for i in range(N):\n",
    "    flip = np.random.randint(0,2)\n",
    "    if flip == 1:\n",
    "        n_heads += 1\n",
    "print(n_heads)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should try uncommenting the `%%timeit` command, which will run the cell multiple times to estimate the runtime."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The way to do this in Numpy is to generate the whole array of flips, then sum the array.  This is much faster, because native Python \"`for`\" loops are very slow, while generating random numbers and summing arrays are fast in Numpy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#%%timeit\n",
    "flips = np.random.randint(0, 2, size=N)\n",
    "n_heads = np.sum(flips)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, try uncommenting the `%%timeit` command to see how much faster this way is."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the flips array, should be length `N` and filled with 0s and 1s:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "flips"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The fraction of heads should be close to 0.5:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sum(flips)/len(flips)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What about a weighted coin toss?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weighted_nheads = np.random.binomial(N, 0.75)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weighted_nheads"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced: computing $\\pi$ by generating random numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compute the value of $\\pi$ by generating random points in a 2D square and counting the number of points that fall inside a circumscribed circle.  This gives us the area of the circle, from which we can compute $\\pi$.  This general technique of computing area or volume by generating random points is known as \"Monte Carlo integration\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's the naive (slow) way to write such a simulation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = int(1e6)  # Number of random points to use\n",
    "R = 1.0  # Radius of circle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#%%timeit\n",
    "N_inside = 0\n",
    "for i in range(N):\n",
    "    point = np.random.rand(2)*2*R - R  # (x,y) point between -R and R\n",
    "    \n",
    "    r = np.sqrt(point[0]**2 + point[1]**2)\n",
    "    \n",
    "    inside_circle = r < R\n",
    "    \n",
    "    if inside_circle:\n",
    "        N_inside += 1\n",
    "        \n",
    "frac_in_circle = N_inside/N\n",
    "area_of_square = (2*R)**2\n",
    "pi_estimate = frac_in_circle*area_of_square/R**2\n",
    "print(pi_estimate)\n",
    "print((pi_estimate - np.pi)/np.pi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "\n",
    "Can you re-write this simulation without a `for` loop?  Remember how we avoided the `for` loop in the coin flip example: we created a big array with all the flips, and then summed it up, rather than checking each flip one at a time.\n",
    "\n",
    "#### Bonus:\n",
    "What other tricks can you come up with to speed up this simulation?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your solution here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "Write a function that takes a parameter `n_trials` that executes the above code `n_trials` times.  It's okay to use a `for` loop over the number of trials, since this will typically be small (~1000), while the number of points `N` will be more like 100000.  Return an array of all the `pi_estimate` values (this array should be of length `n_trials`).  Plot a histogram of this array.\n",
    "\n",
    "Also plot the true value of pi, and the mean value of all your trials."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your solution here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3 (Bonus)\n",
    "The trials have some built-in scatter due to the random nature of Monte Carlo integration.  That's the spread of the histogram you just plotted.  But you can see that the mean is a better estimate than any one trial.\n",
    "\n",
    "1. How much more precise do you expect the mean to be than the individual trials?  If you're not sure, try the below experiment and then come back to this.\n",
    "\n",
    "1. Compute the standard deviation of the distribution of trials that you plotted above.\n",
    "\n",
    "1. Write a function that will repeat the above experiment of generating trials and taking the mean.  Record the mean each time.  Note: you can reduce the number of trials if you need this to run faster.  You should finish with an array of mean values.\n",
    "\n",
    "1. Compute the standard deviation of this distribution of means.\n",
    "\n",
    "1. How much smaller is the standard deviation of the means than the standard devation of the samples?\n",
    "\n",
    "1. If you wanted your \"error bars\" (standard deviation of the mean) to shrink by a factor of two, how many more trials would you need to run?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your solution here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# SAO Python Tutorial\n",
	"7/14/2018\n",
	"\n",
	"Lehman Garrison (lgarrison@cfa.harvard.edu)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This notebook shows how to do simple simulations with Python and NumPy. It shows the \"naive\" way using Python `for` loops, then compares it to the fast solution using NumPy."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Simulating events: coin flips"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's start simple by simulating a million coin flips and counting the number of heads.\n",
	"\n",
	"The naive (\"pure Python\") to do this is:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"N = 2000000"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#%%timeit\n",
	"n_heads = 0\n",
	"for i in range(N):\n",
	" flip = np.random.randint(0,2)\n",
	" if flip == 1:\n",
	" n_heads += 1\n",
	"print(n_heads)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You should try uncommenting the `%%timeit` command, which will run the cell multiple times to estimate the runtime."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The way to do this in Numpy is to generate the whole array of flips, then sum the array. This is much faster, because native Python \"`for`\" loops are very slow, while generating random numbers and summing arrays are fast in Numpy."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#%%timeit\n",
	"flips = np.random.randint(0, 2, size=N)\n",
	"n_heads = np.sum(flips)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Again, try uncommenting the `%%timeit` command to see how much faster this way is."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's look at the flips array, should be length `N` and filled with 0s and 1s:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"flips"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The fraction of heads should be close to 0.5:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"np.sum(flips)/len(flips)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"What about a weighted coin toss?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"weighted_nheads = np.random.binomial(N, 0.75)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"weighted_nheads"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Advanced: computing $\\pi$ by generating random numbers"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can compute the value of $\\pi$ by generating random points in a 2D square and counting the number of points that fall inside a circumscribed circle. This gives us the area of the circle, from which we can compute $\\pi$. This general technique of computing area or volume by generating random points is known as \"Monte Carlo integration\"."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here's the naive (slow) way to write such a simulation:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"N = int(1e6) # Number of random points to use\n",
	"R = 1.0 # Radius of circle"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#%%timeit\n",
	"N_inside = 0\n",
	"for i in range(N):\n",
	" point = np.random.rand(2)2R - R # (x,y) point between -R and R\n",
	" \n",
	" r = np.sqrt(point[0]2 + point[1]2)\n",
	" \n",
	" inside_circle = r < R\n",
	" \n",
	" if inside_circle:\n",
	" N_inside += 1\n",
	" \n",
	"frac_in_circle = N_inside/N\n",
	"area_of_square = (2R)*2\n",
	"pi_estimate = frac_in_circlearea_of_square/R*2\n",
	"print(pi_estimate)\n",
	"print((pi_estimate - np.pi)/np.pi)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Exercise 1\n",
	"\n",
	"Can you re-write this simulation without a `for` loop? Remember how we avoided the `for` loop in the coin flip example: we created a big array with all the flips, and then summed it up, rather than checking each flip one at a time.\n",
	"\n",
	"#### Bonus:\n",
	"What other tricks can you come up with to speed up this simulation?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Your solution here"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Exercise 2\n",
	"\n",
	"Write a function that takes a parameter `n_trials` that executes the above code `n_trials` times. It's okay to use a `for` loop over the number of trials, since this will typically be small (~1000), while the number of points `N` will be more like 100000. Return an array of all the `pi_estimate` values (this array should be of length `n_trials`). Plot a histogram of this array.\n",
	"\n",
	"Also plot the true value of pi, and the mean value of all your trials."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Your solution here"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Exercise 3 (Bonus)\n",
	"The trials have some built-in scatter due to the random nature of Monte Carlo integration. That's the spread of the histogram you just plotted. But you can see that the mean is a better estimate than any one trial.\n",
	"\n",
	"1. How much more precise do you expect the mean to be than the individual trials? If you're not sure, try the below experiment and then come back to this.\n",
	"\n",
	"1. Compute the standard deviation of the distribution of trials that you plotted above.\n",
	"\n",
	"1. Write a function that will repeat the above experiment of generating trials and taking the mean. Record the mean each time. Note: you can reduce the number of trials if you need this to run faster. You should finish with an array of mean values.\n",
	"\n",
	"1. Compute the standard deviation of this distribution of means.\n",
	"\n",
	"1. How much smaller is the standard deviation of the means than the standard devation of the samples?\n",
	"\n",
	"1. If you wanted your \"error bars\" (standard deviation of the mean) to shrink by a factor of two, how many more trials would you need to run?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Your solution here"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.5"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}