/Hierarchical Bayesian movie ratings.ipynb

## Hierarchical Bayesian movie ratings.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a Bayesian hierarchical (pooling) model for movie ratings.  Each movie can be rated on a scale from zero to five.  Each movie is rated several times.  We want to find a smoothed distribution of ratings for each movie.\n",
    "\n",
    "We are going to learn a top-level prior distribution (hyperprior) on movie ratings from the data.  Each movie will then have its own prior that is smoothed by this top-level prior.  Another way of thinking about this is that the prior for ratings for each movie will be shrunk towards the group-level, or pooled, distribution.\n",
    "\n",
    "If a movie has an atypical rating distribution, this approach will shrink the ratings to something more in-line with what is expected.  Furthermore, this learned prior can be useful to bootstrap movies with few ratings to allow them to be meaningfully compared to movies with many ratings.\n",
    "\n",
    "The model is as follows:\n",
    "\n",
    "$\\gamma_{k=1...K} \\sim Gamma(\\alpha, \\beta)$\n",
    "\n",
    "$\\theta_{m=1...M} \\sim Dirichlet_M(c\\gamma_1, ..., c\\gamma_K)$\n",
    "\n",
    "$z_{m=1...M,n=1...N_m} \\sim Categorical_M(\\theta_m)$\n",
    "\n",
    "where:\n",
    "\n",
    "* $K$ number of movie rating levels (e.g. $K = 6$ implies ratings 0, ..., 5)\n",
    "* $M$ number of movies being rated\n",
    "* $N_m$ number of ratings for movie $m$ \n",
    "* $\\alpha = 1 / K$ in order to make the collection of gamma r.v.s [act as an exponential coefficient](http://tdunning.blogspot.com/2010/04/sampling-dirichlet-distribution-revised.html)\n",
    "* $\\beta$ rate parameter for the exponential top-level prior\n",
    "* $c$ concentration parameter dictating the strength of the top-level prior\n",
    "* $\\gamma_k$ top-level prior for rating level $k$\n",
    "* $\\theta_m$ movie-level prior for rating levels (multivariate with dimension = $K$)\n",
    "* $z_{mn}$ rating $n$ for movie $m$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# uses pymc3\n",
    "import numpy as np\n",
    "import pymc3 as pm\n",
    "import scipy.stats\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAgUAAAFkCAYAAACw3EhvAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3X+UXWV97/H3d/Jj8pMEkpAfFauoEAq9SAZqc1WURkis\nC9RWi4NeERSkUqVpWtRlLYq3UvUGLNYordaA4FjspVfstcYGtV6kITVDUDBB5VcIkJAEnJhkmEwm\n3/vH2cFkzGTm7MzMOXPyfq11FnOeeZ69v3tNOOez93723pGZSJIkNdW6AEmSVB8MBZIkCTAUSJKk\ngqFAkiQBhgJJklQwFEiSJMBQIEmSCoYCSZIEGAokSVLBUCBJkoCSoSAiLo+IhyOiMyJWRcQZh+j7\npYjYGxE9xX/3vX5cvmxJkjTYqg4FEXE+sBS4CjgNuBdYERHT+xjyPmAWMLv47/OAp4FbyxQsSZKG\nRlT7QKSIWAXcnZlXFO8DeAy4PjM/OYDxbwD+GXhhZj5WfcmSJGkoVHWkICLGAC3AHfvaspIqVgLz\nB7iYi4GVBgJJkurL6Cr7TwdGAZt7tW8GTuxvcETMBl4LvKWfftOAhcAjwLNV1ihJ0pFsHPACYEVm\nbqtmYLWh4HC9A3gG+Ho//RYCtwx5NZIkNa63Al+pZkC1oWAr0APM7NU+E9g0gPEXATdl5p5++j0C\ncPPNN3PSSSdVWWL9ee+7LuW8ezbxspuv5aiTXlzrcg7b4sWLue6662pdxqBppO1ppG0Bt6eeNdK2\nQGNtz7p163jb294GxXdpNaoKBZnZHRFrgAXA7fDcRMMFwPWHGhsRrwZeBHxxAKt6FuCkk05i3rx5\n1ZRYl46aPJnjeYZTT/otjp53Sq3LOWxTpkxpiL/LPo20PY20LeD21LNG2hZovO0pVH36vczpg2uB\n5UU4WA0sBiYAywEi4hpgTmZe2GvcO6lctbCuxDolSdIQqzoUZOatxT0JrqZy2mAtsDAztxRdZgHH\n7T8mIo4C3kjlngWSJKkOlZpomJnLgGV9/O6ig7RtByaVWZckSRoePvtgGPzhwtfWuoRB1draWusS\nBlUjbU8jbQu4PfWskbYFGm97yqr6jobDISLmAWvWrFnTEBM/nmm/j5Utr+c1a77eEBMNJUn1q729\nnZaWFoCWzGyvZqxHCiRJEmAokCRJheG+o2FVNm7cyNSpU2tdxiGNHz+e2bNn17oMSZIOW12Hgp/9\nrIe9e/u7+WHt9PTsYerUZ5g+fTpjxoypdTmSJB2Wug4Fs2b9Jscdd0Kty+jTjh0ddHf/vNZlSJI0\nKJxTIEmSAEOBJEkqGAokSRJgKJAkSQVDgSRJAgwFkiSpYCiQJEmAoUCSJBUMBZIkCTAUSJKkgqFA\nkiQBhgJJklQwFEiSJMBQIEmSCoYCSZIEGAokSVLBUCBJkgBDgSRJKhgKJEkSYCiQJEkFQ4EkSQIM\nBZIkqWAokCRJgKFAkiQVDAWSJAkwFEiSpEKpUBARl0fEwxHRGRGrIuKMfvqPjYi/johHIuLZiHgo\nIt5RqmJJkjQkRlc7ICLOB5YClwKrgcXAiog4ITO39jHsa8AM4CLgQWA2HqWQJKmuVB0KqISAGzLz\nJoCIuAx4HXAx8MnenSNiEfBK4PjM/EXRvKFcuZIkaahUtbceEWOAFuCOfW2ZmcBKYH4fw84Ffgi8\nPyI2RsQDEfGpiBhXsmZJkjQEqj1SMB0YBWzu1b4ZOLGPMcdTOVLwLPCGYhmfA44B3lnl+iVJ0hAp\nc/qgWk3AXuCCzNwBEBF/BnwtIt6TmV3DUIMkSepHtaFgK9ADzOzVPhPY1MeYJ4HH9wWCwjoggOdR\nmXh4UEuXLmbSpCkHtC1c2MqiRa1Vli1JUuNpa2ujra3tgLaOjo7Sy6sqFGRmd0SsARYAtwNERBTv\nr+9j2A+AN0XEhMzcVbSdSOXowcZDrW/JkuuYO3deNSVKknTEaG1tpbX1wB3l9vZ2WlpaSi2vzGWB\n1wKXRMTbI2Iu8HlgArAcICKuiYgb9+v/FWAb8KWIOCkizqRylcIXPXUgSVL9qHpOQWbeGhHTgaup\nnDZYCyzMzC1Fl1nAcfv13xkRZwOfAf6LSkD4J+DDh1m7JEkaRKUmGmbmMmBZH7+76CBtPwUWllmX\nJEkaHt5VUJIkAYYCSZJUMBRIkiTAUCBJkgqGAkmSBBgKJElSwVAgSZIAQ4EkSSoYCiRJEmAokCRJ\nBUOBJEkCDAWSJKlgKJAkSYChQJIkFQwFkiQJMBRIkqSCoUCSJAGGAkmSVDAUSJIkwFAgSZIKhgJJ\nkgQYCiRJUsFQIEmSAEOBJEkqGAokSRJgKJAkSQVDgSRJAgwFkiSpYCiQJEmAoUCSJBUMBZIkCTAU\nSJKkgqFAkiQBhgJJklQoFQoi4vKIeDgiOiNiVUSccYi+r4qIvb1ePRFxbPmyJUnSYKs6FETE+cBS\n4CrgNOBeYEVETD/EsAReAswqXrMz86nqy5UkSUOlzJGCxcANmXlTZq4HLgN2ARf3M25LZj6171Vi\nvZIkaQhVFQoiYgzQAtyxry0zE1gJzD/UUGBtRDwREd+OiP9eplhJkjR0qj1SMB0YBWzu1b6ZymmB\ng3kSeDfwh8AfAI8B34uIl1a5bkmSNIRGD/UKMvOnwE/3a1oVES+ichriwkONXbp0MZMmTTmgbeHC\nVhYtah30OiVJGmna2tpoa2s7oK2jo6P08qoNBVuBHmBmr/aZwKYqlrMaeHl/nZYsuY65c+dVsVhJ\nko4cra2ttLYeuKPc3t5OS0tLqeVVdfogM7uBNcCCfW0REcX7u6pY1EupnFaQJEl1oszpg2uB5RGx\nhsoe/2JgArAcICKuAeZk5oXF+yuAh4H7gXHAJcBZwNmHW7wkSRo8VYeCzLy1uCfB1VROG6wFFmbm\nlqLLLOC4/YaMpXJfgzlULl38EbAgM79/OIVLkqTBVWqiYWYuA5b18buLer3/FPCpMuuRJEnDx2cf\nSJIkwFAgSZIKhgJJkgQYCiRJUsFQIEmSAEOBJEkqGAokSRJgKJAkSQVDgSRJAgwFkiSpYCiQJEmA\noUCSJBUMBZIkCTAUSJKkgqFAkiQBhgJJklQwFEiSJMBQIEmSCoYCSZIEGAokSVLBUCBJkgBDgSRJ\nKhgKJEkSYCiQJEkFQ4EkSQIMBZIkqWAokCRJgKFAkiQVDAWSJAkwFEiSpIKhQJIkAYYCSZJUMBRI\nkiSgZCiIiMsj4uGI6IyIVRFxxgDHvTwiuiOivcx6JUnS0Kk6FETE+cBS4CrgNOBeYEVETO9n3BTg\nRmBliTolSdIQK3OkYDFwQ2belJnrgcuAXcDF/Yz7PHALsKrEOiVJ0hCrKhRExBigBbhjX1tmJpW9\n//mHGHcR8ELgo+XKlCRJQ210lf2nA6OAzb3aNwMnHmxARLwE+DjwiszcGxFVFylJkoZetaGgKhHR\nROWUwVWZ+eC+5oGOX7p0MZMmTTmgbeHCVhYtah28IiVJGqHa2tpoa2s7oK2jo6P08qoNBVuBHmBm\nr/aZwKaD9J8MnA68NCI+W7Q1ARERu4FzMvN7fa1syZLrmDt3XpUlSpJ0ZGhtbaW19cAd5fb2dlpa\nWkotr6o5BZnZDawBFuxri8r5gAXAXQcZsh04BXgpcGrx+jywvvj57lJVS5KkQVfm9MG1wPKIWAOs\npnI1wgRgOUBEXAPMycwLi0mIP9l/cEQ8BTybmesOp3BJkjS4qg4FmXlrcU+Cq6mcNlgLLMzMLUWX\nWcBxg1eiJEkaDqUmGmbmMmBZH7+7qJ+xH8VLEyVJqjs++0CSJAGGAkmSVDAUSJIkwFAgSZIKhgJJ\nkgQYCiRJUsFQIEmSAEOBJEkqGAokSRJgKJAkSQVDgSRJAgwFkiSpYCiQJEmAoUCSJBUMBZIkCTAU\nSJKkgqFAkiQBhgJJklQwFEiSJMBQIEmSCoYCSZIEGAokSVLBUCBJkgBDgSRJKhgKJEkSYCiQJEkF\nQ4EkSQIMBZIkqWAokCRJgKFAkiQVDAWSJAkwFEiSpIKhQJIkASVDQURcHhEPR0RnRKyKiDMO0ffl\nEXFnRGyNiF0RsS4i/rR8yZIkaSiMrnZARJwPLAUuBVYDi4EVEXFCZm49yJCdwGeAHxU/vwL4+4jY\nkZlfKF25JEkaVGWOFCwGbsjMmzJzPXAZsAu4+GCdM3NtZv5TZq7LzA2Z+RVgBfDK0lVLkqRBV1Uo\niIgxQAtwx762zExgJTB/gMs4rej7vWrWLUmShla1pw+mA6OAzb3aNwMnHmpgRDwGzCjGfyQzv1Tl\nuiVJ0hCqek7BYXgFMAn4XeATEfHzzPynYVy/JEk6hGpDwVagB5jZq30msOlQAzPz0eLH+yNiFvAR\n4JChYOnSxUyaNOWAtoULW1m0qLWKkiVJakxtbW20tbUd0NbR0VF6eVWFgszsjog1wALgdoCIiOL9\n9VUsahTQ3F+nJUuuY+7cedWUKEnSEaO1tZXW1gN3lNvb22lpaSm1vDKnD64FlhfhYN8liROA5QAR\ncQ0wJzMvLN6/B9gArC/GvwpYAny6VMWSJGlIVB0KMvPWiJgOXE3ltMFaYGFmbim6zAKO229IE3AN\n8AJgD/Ag8BeZ+feHUbckSRpkpSYaZuYyYFkfv7uo1/u/A/6uzHrqXUQTXV1w9933HbJf9wMPAbD2\nngcYs6t7OEo7qBe9aDazZ8+q2folSfVtOK8+aDgTJ05mz54X0tm555D99nTtAODZrmPZ0/kbw1Ha\nr+no2Ma0adsNBZKkPhkKDtOUKcf026drylPsAKZMOZrmY44d+qIOorNzJ1C7oxSSpPrnUxIlSRJg\nKJAkSQVDgSRJAgwFkiSpYCiQJEmAoUCSJBUMBZIkCTAUSJKkgqFAkiQBhgJJklQwFEiSJMBQIEmS\nCoYCSZIEGAokSVLBUCBJkgBDgSRJKhgKJEkSYCiQJEkFQ4EkSQIMBZIkqWAokCRJgKFAkiQVDAWS\nJAkwFEiSpIKhQJIkAYYCSZJUMBRIkiTAUCBJkgqGAkmSBBgKJElSwVAgSZKAkqEgIi6PiIcjojMi\nVkXEGYfo+8aI+HZEPBURHRFxV0ScU75kSZI0FKoOBRFxPrAUuAo4DbgXWBER0/sYcibwbeC1wDzg\nu8A3IuLUUhVLkqQhUeZIwWLghsy8KTPXA5cBu4CLD9Y5Mxdn5v/KzDWZ+WBmfgj4GXBu6aolSdKg\nqyoURMQYoAW4Y19bZiawEpg/wGUEMBl4upp1S5KkoVXtkYLpwChgc6/2zcCsAS7jL4CJwK1VrluS\nJA2h0cO5soi4APgwcF5mbh3OdUuSpEOrNhRsBXqAmb3aZwKbDjUwIt4C/D3wpsz87kBWtnTpYiZN\nmnJA28KFrSxa1DrggiVJalRtbW20tbUd0NbR0VF6eVWFgszsjog1wALgdnhujsAC4Pq+xkVEK/AF\n4PzM/NZA17dkyXXMnTuvmhIlSTpitLa20tp64I5ye3s7LS0tpZZX5vTBtcDyIhyspnI1wgRgOUBE\nXAPMycwLi/cXFL97H/BfEbHvKENnZm4vVbUkSRp0VYeCzLy1uCfB1VROG6wFFmbmlqLLLOC4/YZc\nQmVy4meL1z430sdljJIkafiVmmiYmcuAZX387qJe788qsw5JkjS8fPaBJEkCDAWSJKlgKJAkSYCh\nQJIkFQwFkiQJMBRIkqSCoUCSJAGGAkmSVDAUSJIkwFAgSZIKhgJJkgQYCiRJUsFQIEmSAEOBJEkq\nGAokSRJgKJAkSQVDgSRJAgwFkiSpYCiQJEmAoUCSJBVG17oADZ+enh527txZ6zL61dTUxPjx42td\nhiQdcQwFR4ixY5vZsOFpHntsfa1L6dfYsXD66S9mypQptS5Fko4ohoIjxIwZc+jqOqbWZQzIU0/d\nT3d3d63LkKQjjqHgCNLcPK7WJUiS6pgTDSVJEmAokCRJBUOBJEkCDAWSJKlgKJAkSYChQJIkFQwF\nkiQJMBRIkqSCoUCSJAGGAkmSVCgVCiLi8oh4OCI6I2JVRJxxiL6zIuKWiHggInoi4try5UqSpKFS\ndSiIiPOBpcBVwGnAvcCKiJjex5Bm4CngY8DaknVKkqQhVuZIwWLghsy8KTPXA5cBu4CLD9Y5Mx/N\nzMWZeTOwvXypkiRpKFUVCiJiDNAC3LGvLTMTWAnMH9zSjlzdGzew687v8ux9a8m9ew/aZ/fPH+CZ\nf/jMMFcmSWpk1T46eTowCtjcq30zcOKgVHQEy7172fKxD7Dj//7Lc22jf+M4pl/5ESbMP/OAvl0/\nW88z/3A9R1/y3uEuU5LUoKoNBcNq6dLFTJo05YC2hQtbWbSotUYVDa1f3vZVdvzrbUw+901MOOsc\nerY+RUfbcjZd8U6O/uM/4+iL/rjWJUqS6khbWxttbW0HtHV0dJReXrWhYCvQA8zs1T4T2FS6ij4s\nWXIdc+fOG+zF1q3t//JVJp61kBl/9TfPtU0+9w/Z+omreGbZUnqe2sz093+kdgVKkupKa2srra0H\n7ii3t7fT0tJSanlVzSnIzG5gDbBgX1tERPH+rlIV6DndGx9l/PxXHtAWo8cw40Mf55j3Xsn2/30L\nmz/4XnJPd40qlCQ1sjKnD64FlkfEGmA1lasRJgDLASLiGmBOZl64b0BEnAoEMAmYUbzfnZnrDq/8\nxtI0fgJ7d+446O+mvv1SRh19DFv+54fY1PFOJv7ewmGuTpLU6KoOBZl5a3FPgqupnDZYCyzMzC1F\nl1nAcb2G3QNk8fM84ALgUeD4MkU3qrEvPpHOVXcy9W3vOujvJ5/7JpomT+GpD/0pz/74nmGuTpLU\n6Erd0TAzl2XmCzJzfGbOz8wf7ve7izLz93r1b8rMUb1eBoJeJpy5gM7VP2D3wz/vs8/EV5/NrM98\niRg1ahgrkyQdCer66oMjzeTz3sz4lpcx6uhph+w3ft7v8LyvfpPuxx8bpsokSUcCQ0EdaRo3jrEv\nOmFAfUfPmsPoWXOGuCJJ0pHEUFCnck833Y8+zJ4tT5FdzxLN4xg941jG/OYLidFjal2eJKkBGQrq\nTPfGR3nmhr9l53+sJJ/trDRmQgQAMW48E89cwNGXvo8xz39hDSuVJDUaQ0Ed6Vp/P0+8+wJi1Cgm\nLTqP5pNPZfT0GURzM9nVxZ6tW+i67x523vEtdt35XWZ//haa555c67IlSQ3CUFBHtn3644yeNYc5\nN9zCqKnHHLzT69/MMZf/OU+8+61s+9trmPO5m4e3yGGye/duOjs7a11Gv8aMGcPo0f5vJKkx+GlW\nR7ru/xHTrvhA34GgMGrqMUx589vYdv0nhqmy4ZU5lh//+EngyVqX0q9p08bS0nIKUZzekaSRzFBQ\nR2LcOHp+8cyA+vb84mmiuXmIK6qNOXPm0t29u9Zl9Gvnzu388pdPkJmGAkkNwVBQRya++mx+8eV/\nYOyLXsLEs/q+jfHO76zgF1/+ApPOed0wVjd8Ro8ew+gRcIXF7t1dtS5BkgaVoaCOTLvig+x+8Gds\nvvJyRk2bQfPckxk1fQYxppns7qJn6xa61t9Pz9NbaT7lVKZd8cFalyxJaiCGgjrSNGkyc754KztX\nfpOd3/kWux/4CZ3tq391n4LpMxj30tOZuGARExe8lmgqdZdqSZIOylBQZyKCSWe/jklnN+apAUlS\n/XJXU5IkAR4pGJGyp4eeLZsBfP6BJGnQGApGoO7HHmXjm8+BpiaOv/untS5HktQgDAUjUNOkyUx6\n3Rufex6CJEmDwVAwAo2ePoNjP/KpWpchSWowTjQcofbu3MGeTU/UugxJUgMxFIxQHV+9kQ3nvarW\nZUiSGoihQJIkAc4pqCu//NfbBtx39wM/GcJKJElHIkNBHdny0SsrVxRkDmyAVx9IkgaRoaCONB01\nhbEnnMS0972/376//PrX2H5b2zBUJUk6UhgK6kjzyf+N7kceovmk3+637667vj8MFUmSjiRONKwj\nzSefyp4nH6fn6a399m2afJS3OJYkDSpDQR2Z+vZLOe72/6DpqCn99p3yR/+D59/+H8NQlSTpSOHp\ngzrSNH4CTeMn1LoMSdIRyiMFkiQJMBRIkqSCoUCSJAHOKZAO26ZNm4gRcCOpadOmMXbs2FqXIamO\nGQqkksaPn8imTeNYvbr/S0hrradnD6ed1snxxx9f61Ik1TFDwTD497u+yW/VuohB9K1vtbFoUWut\nyxg0Zbdn7Nhmnv/8k4egovL62paNG39ODvT22XWkra2N1tbG+bfWSNvTSNsCjbc9ZZUKBRFxOfDn\nwCzgXuC9mflfh+j/amApcDKwAfjrzLyxzLpHon//z39rqFCwYkVjhYJG2p5DbUtnZycbN24c5oqq\nN3r0aGbOnElENNwHdSNtTyNtCzTe9pRVdSiIiPOpfMFfCqwGFgMrIuKEzPy146gR8QLgX4FlwAXA\na4AvRMQTmfnv5UuXNFCTJh3Ngw9uAjpqXcohZSbNzV3Mnz+OqVOn1roc6YhT5kjBYuCGzLwJICIu\nA14HXAx88iD9/xh4KDOvLN4/EBGvKJZjKJCGwdSp05g6dVqty+hXT08Pmzevpbu7m66uLvbu3UtX\nV1ety+rTmDFjaGryIi41jqpCQUSMAVqAj+9ry8yMiJXA/D6G/S6wslfbCuC6atYtqfFFBHv2jGLt\n2g0AbNu2ne9//74aV9W3yZNhzpwZA+7f2dnJhg0bhrCivs2cOZPm5uaarFsjR7VHCqYDo4DNvdo3\nAyf2MWZWH/2PiojmzDzYbsA4gPXr2+ns3FllifVn+45f8BBdbFvfzujOZ2pdzmHbvn0b99zz/2pd\nxqBppO1phG3Zs2c3jz++B4AdO3bzwAP1eXVHd/dudu3aTlPT+gGP2bhxG7fcMvx/nwiYPXtwl/nY\nYxu58cZbBnehVGqdMKF52I/APPnkk9x2221VjZkxYwYTJ04coorKW7du3b4fx1U7NqqZkRwRs4HH\ngfmZefd+7Z8AzszMXztaEBEPAP+YmZ/Yr+21VOYZTDhYKIiIC4DB/9cmSdKR462Z+ZVqBlR7pGAr\n0APM7NU+E9jUx5hNffTf3sdRAqicXngr8AjwbJU1SpJ0JBsHvIDKd2lVqgoFmdkdEWuABcDtAFG5\nldsC4Po+hv0n8NpebecU7X2tZxtQVbqRJEnPuavMoDInba4FLomIt0fEXODzwARgOUBEXBMR+9+D\n4PPA8RHxiYg4MSLeA7ypWI4kSaoTVV+SmJm3RsR04GoqpwHWAgszc0vRZRZw3H79H4mI11G52uB9\nwEbgnZnZ+4oESZJUQ1VNNJQkSY3Lu25IkiTAUCBJkgp1Fwoi4vKIeDgiOiNiVUScUeuayoiIV0bE\n7RHxeETsjYjzal3T4YiID0bE6ojYHhGbI+JfIuKEWtdVRkRcFhH3RkRH8borIhbVuq7BEhEfKP7N\njcjJvBFxVVH//q+f1LqusiJiTkR8OSK2RsSu4t/evFrXVUbx2dz7b7M3Ij5T69rKiIimiPhYRDxU\n/G1+HhF/Weu6yoqISRHx6Yh4pNieOyPi9GqWUVehYL+HLV0FnEblCYwriomNI81EKpMw3wM0wsSN\nVwKfAV5G5aFWY4BvR8T4mlZVzmPA+4F5VG7b/R3g6xFxUk2rGgRFiL6Uyv87I9l9VCYyzyper6ht\nOeVExFTgB0AXsBA4CVgCjNRbm57Or/4ms4CzqXy+3VrLog7DB4B3U/mcngtcCVwZEX9S06rK+yKV\nWwS8FTiFyvOFVhY3HhyQuppoGBGrgLsz84rifVD5AL8+Mw/2sKURISL2Am/IzNtrXctgKYLaU1Tu\nZHlnres5XBGxDfjzzPxSrWspKyImAWuoPITsw8A9mflnta2qehFxFfD6zByRe9P7i4i/oXIH2FfV\nupahEBGfBn4/M0fqUcNvAJsy85L92v4Z2JWZb69dZdWLiHHAL4FzM/Nb+7X/EPhmZv7VQJZTN0cK\n9nvY0h372rKSWA71sCXVzlQqewhP17qQw1EcPnwLlXtt9HlDrRHis8A3MvM7tS5kELykOPX2YETc\nHBHH9T+kLp0L/DAibi1Ou7VHxLtqXdRgKD6z30pl73SkugtYEBEvAYiIU4GXA9+saVXljKbybKLe\ndwrupIojbWUenTxUyjxsSTVQHMH5NHBnZo7Ic70RcQqVELAvXb8xMwf+ZJs6UwSbl1I5vDvSrQLe\nATwAzAY+Anw/Ik7JzJH2hLTjqRy5WQr8NfA7wPUR0ZWZX65pZYfvjcAU4Mb+OtaxvwGOAtZHRA+V\nHeUPZeZXa1tW9TJzR0T8J/DhiFhP5bvzAio71T8b6HLqKRRo5FgG/BaVRD1SrQdOpfKh9ibgpog4\ncyQGg4h4HpWQ9prM7K51PYcrM/e/X/t9EbEaeBT4I2Cknd5pAlZn5oeL9/cWgfQyYKSHgouBf8vM\nvp57MxKcT+WL8y3AT6gE67+NiCdGaGh7G/CPVB5cuAdop/LIgJaBLqCeQkGZhy1pmEXE3wG/D7wy\nM5+sdT1lZeYe4KHi7T0R8TvAFVT26kaaFmAG0F4cxYHKUbcziwlTzVlPk4eqlJkdEfFT4MW1rqWE\nJ4F1vdrWAX9Qg1oGTUQ8n8qE4zfUupbD9Engmsz8WvH+/oh4AfBBRmBoy8yHgbOKCeBHZebmiPgq\nv/qs61fdzCko9nD2PWwJOOBhS6Ue7KDBVQSC1wNnZeaGWtczyJqA5loXUdJK4Lep7OWcWrx+CNwM\nnDqSAwE8N4HyxVS+YEeaH/Drpz9PpHLkYyS7mMrh6ZF47n1/E6jsjO5vL3X03VhGZnYWgeBoKle9\n/J+Bjq2nIwVQeUjS8qg8iXE1sJj9HrY0kkTERCofZPv23I4vJrE8nZmP1a6yciJiGdAKnAfsjIh9\nR3Q6MnNEPd46Ij4O/BuwAZhMZbLUq6g8vXPEKc6zHzC3IyJ2Atsys/deat2LiE8B36DyxfkbwEeB\nbqCtlnWVdB3wg4j4IJXL9l4GvAu45JCj6lixs/YOYHlm7q1xOYfrG8BfRsRG4H4qlykvBr5Q06pK\niohzqHznPAC8hMqRkJ9QxXdoXYWCATxsaSQ5HfgulRn6SWWiEVQm5Vxcq6IOw2VUtuN7vdovAm4a\n9moOz7HBbsQoAAAArklEQVRU/g6zgQ7gR8A5DTJrf5+RfHTgeVTOg04DtgB3Ar9bPFJ9RMnMH0bE\nG6lMaPsw8DBwxUicyLaf11B56N1Im99xMH8CfIzKlTvHAk8AnyvaRqIpwDVUwvTTwD8Df5mZvY+G\n9Kmu7lMgSZJqZ0SfN5EkSYPHUCBJkgBDgSRJKhgKJEkSYCiQJEkFQ4EkSQIMBZIkqWAokCRJgKFA\nkiQVDAWSJAkwFEiSpML/BwNuOMilNKxWAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7f3821fbb850>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# show the exponential distribution coefficient prior for\n",
    "# the movie-level priors and the impact on the choice of\n",
    "# the rate parameter\n",
    "\n",
    "# when alpha (a) is 1, then the gamma distribution is an\n",
    "# exponential distribution with rate parameter 1 / scale\n",
    "samples = scipy.stats.gamma.rvs(a=1., scale=1., size=2000)\n",
    "\n",
    "# the exponential distribution mean is equal to its inverse\n",
    "# rate parameter\n",
    "mean = samples.mean()\n",
    "\n",
    "plt.hist(samples, normed=True, histtype=\"stepfilled\", alpha=.2)\n",
    "plt.axvline(mean, color=\"#AA0022\")\n",
    "plt.annotate(\"{:.2f}\".format(mean), xy=(mean,0), xycoords=\"data\" ,xytext=(5,10),\n",
    "    textcoords=\"offset points\", rotation=90, va=\"bottom\", fontsize=\"large\", color=\"#AA0022\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# this array stores the counts of each rating for each\n",
    "# movie where each row corresponds to a movie and each\n",
    "# column corresponds to counts for a rating at that level\n",
    "data = np.array([\n",
    "    [ 0, 0, 0, 0, 0, 2 ], # Movie 1 with 2 5-star ratings\n",
    "    [ 0, 0, 0, 0, 1, 4 ], # Movie 2 with 1 4-star rating and 4 5-star ratings\n",
    "    [ 0, 1, 1, 2, 4, 2 ], # ...\n",
    "    [ 0, 3, 3, 6, 12, 6 ],\n",
    "    [ 0, 0, 2, 4, 5, 1 ],\n",
    "    [ 0, 1, 3, 3, 2, 1 ],\n",
    "    [ 1, 0, 4, 4, 1, 2 ],\n",
    "])\n",
    "\n",
    "# rate hyperparameter for the exponential coefficient\n",
    "rate = 1.\n",
    "\n",
    "# concentration hyperparameter dictating how much effect\n",
    "# the top-level hyperprior has on individual movies\n",
    "concentration = 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|������������������������������| 3000/3000 [00:20<00:00, 149.70it/s]\n"
     ]
    }
   ],
   "source": [
    "(numberOfMovies, numberOfRatingLevels) = data.shape\n",
    "\n",
    "model = pm.Model()\n",
    "\n",
    "model.verbose = 0\n",
    "\n",
    "with model:\n",
    "    # top-level prior\n",
    "    gamma = []\n",
    "    \n",
    "    alpha = 1. / numberOfRatingLevels\n",
    "    \n",
    "    for k in range(numberOfRatingLevels):\n",
    "        gamma_k = pm.Gamma(\"gamma_%i\" % k, alpha=alpha, beta=rate)\n",
    "        \n",
    "        gamma_k = gamma_k * concentration\n",
    "        \n",
    "        gamma.append(gamma_k)\n",
    "        \n",
    "    # each movie\n",
    "    for m in range(numberOfMovies):\n",
    "        # sample the Dirichlet by sampling from gammas and normalizing\n",
    "        theta_m = []\n",
    "        \n",
    "        for k in range(0, numberOfRatingLevels):\n",
    "            theta_mk = pm.Gamma(\"theta_%i%i\" % (m, k), alpha=gamma[k], beta=1.)\n",
    "        \n",
    "            theta_m.append(theta_mk)\n",
    "            \n",
    "        theta_m = theta_m / np.sum(theta_m)\n",
    "\n",
    "        # rating counts for the movie\n",
    "        ratingCounts = data[m]\n",
    "        \n",
    "        # use a multinomial to achieve repeated categorical draws\n",
    "        phi_m = pm.Multinomial(\"phi_%i\" % m, n=ratingCounts.sum(), p=theta_m, observed=ratingCounts)\n",
    "        \n",
    "    # use the Metropolis sampler as the NUTS sampler doesn't like this model\n",
    "    step = pm.Metropolis()\n",
    "    \n",
    "    # obtain 3000 samples\n",
    "    trace = pm.sample(3000, step=step)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# get the mean sample values of the last 2000 samples\n",
    "sampleMeans = np.array(pm.df_summary(trace[-2000:])['mean'])\n",
    "\n",
    "# the first variables correspond to gamma and the rest\n",
    "# correspond to each theta_m\n",
    "sampleMeans = sampleMeans.reshape((numberOfMovies + 1, numberOfRatingLevels))\n",
    "\n",
    "# normalize these to be the parameters of a multinomial\n",
    "# distribution and in the case of gamma, it would be the\n",
    "# expected value for draws from a Dirichlet with these\n",
    "# parameters\n",
    "for i in range(numberOfMovies + 1):\n",
    "    sampleMeans[i] = sampleMeans[i] / sampleMeans[i].sum()\n",
    "\n",
    "# our infererred parameter estimates\n",
    "gamma_ = sampleMeans[0]\n",
    "\n",
    "theta_m_ = sampleMeans[1:]\n",
    "\n",
    "# some additional statistics for comparison\n",
    "normalizedRatingCounts = np.array(data, np.float64)\n",
    "\n",
    "normalizedRatingCountMeans = normalizedRatingCounts.sum(axis=0)\n",
    "\n",
    "normalizedRatingCountMeans = normalizedRatingCountMeans / normalizedRatingCountMeans.sum()\n",
    "\n",
    "for i in range(numberOfMovies):\n",
    "    normalizedRatingCounts[i] = normalizedRatingCounts[i] / normalizedRatingCounts[i].sum()\n",
    "\n",
    "normalizedRatingCountStds = normalizedRatingCounts.std(axis=0)\n",
    "\n",
    "ratingCountSums = data.sum(axis=0)\n",
    "\n",
    "# compute the average ratings\n",
    "levels = np.diag(np.arange(numberOfRatingLevels))\n",
    "\n",
    "maximumLikelihoodRatingAverages = np.dot(normalizedRatingCounts, levels).sum(axis=1)\n",
    "\n",
    "inferredRatingAverages = np.dot(theta_m_, levels).sum(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Comparing the top-level prior parameter estimate and the maximum-likelihood estimate\n",
      "shows how the model is pooling the data.\n",
      "\n",
      "Original data (movies x rating counts):\n",
      "\n",
      "[[ 0  0  0  0  0  2]\n",
      " [ 0  0  0  0  1  4]\n",
      " [ 0  1  1  2  4  2]\n",
      " [ 0  3  3  6 12  6]\n",
      " [ 0  0  2  4  5  1]\n",
      " [ 0  1  3  3  2  1]\n",
      " [ 1  0  4  4  1  2]]\n",
      "\n",
      "Top-level prior parameters (the pooled estimate of the probability of chosing a rating):\n",
      "\n",
      "[ 0.02406627  0.07585432  0.17492069  0.22285698  0.250879    0.25142274]\n",
      "\n",
      "Maximum-likelihood probability of chosing a rating from the raw count data:\n",
      "\n",
      "[ 0.01234568  0.0617284   0.16049383  0.2345679   0.30864198  0.22222222]\n",
      "\n",
      "Total counts for each rating:\n",
      "\n",
      "[ 1  5 13 19 25 18]\n",
      "\n",
      "Standard deviation for the normalized rating counts across all movies:\n",
      "\n",
      "[ 0.02916059  0.04948717  0.12307474  0.13384256  0.15478022  0.34554174]\n",
      "\n",
      "Estimates for ratings 3, 4, and 5 show how the model accounts for lower / higher\n",
      "number of counts and lower / higher standard deviation in forming its pooled estimate.\n",
      "\n",
      "Estimates for ratings 0 and 1 show how the model accounts for low numbers of counts.\n"
     ]
    }
   ],
   "source": [
    "print \"\"\"\n",
    "Comparing the top-level prior parameter estimate and the maximum-likelihood estimate\n",
    "shows how the model is pooling the data.\"\"\"\n",
    "\n",
    "print \"\"\"\n",
    "Original data (movies x rating counts):\n",
    "\"\"\"\n",
    "print data\n",
    "\n",
    "print \"\"\"\n",
    "Top-level prior parameters (the pooled estimate of the probability of chosing a rating):\n",
    "\"\"\"\n",
    "print gamma_\n",
    "\n",
    "print \"\"\"\n",
    "Maximum-likelihood probability of chosing a rating from the raw count data:\n",
    "\"\"\"\n",
    "print normalizedRatingCountMeans\n",
    "\n",
    "print \"\"\"\n",
    "Total counts for each rating:\n",
    "\"\"\"\n",
    "print ratingCountSums\n",
    "\n",
    "print \"\"\"\n",
    "Standard deviation for the normalized rating counts across all movies:\n",
    "\"\"\"\n",
    "print normalizedRatingCountStds\n",
    "\n",
    "print \"\"\"\n",
    "Estimates for ratings 3, 4, and 5 show how the model accounts for lower / higher\n",
    "number of counts and lower / higher standard deviation in forming its pooled estimate.\n",
    "\n",
    "Estimates for ratings 0 and 1 show how the model accounts for low numbers of counts.\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Comparing the inferred estimates of the probability of chosing a rating for\n",
      "individual movies against the respective maximum-likelihood estimates is\n",
      "also informative.\n",
      "\n",
      "Rating count data for movies 1 and 2 below.  Movie 1 has less than half the\n",
      "ratings as movie 2 but all of its ratings are 5-star.  Movie 2 has more\n",
      "5-star ratings but also one 4-star rating.  The maximum-likelihood estimate\n",
      "would ignore the number of ratings and focus on the proportions.  The smoothed\n",
      "inferred estimate accounts for the number of ratings and also applies the\n",
      "pooled prior estimate of the rating probabilities.\n",
      "\n",
      "[[0 0 0 0 0 2]\n",
      " [0 0 0 0 1 4]]\n",
      "\n",
      "Maximum-likelihood rating probability estimates for movies 1 and 2:\n",
      "\n",
      "[[ 0.   0.   0.   0.   0.   1. ]\n",
      " [ 0.   0.   0.   0.   0.2  0.8]]\n",
      "\n",
      "Inferred rating probability estimates for movies 1 and 2:\n",
      "\n",
      "[[ 0.02144172  0.05540504  0.12788657  0.16527906  0.18168197  0.44830564]\n",
      " [ 0.01063829  0.04511137  0.09015204  0.11610496  0.22394176  0.51405158]]\n",
      "\n",
      "The inferred estimate places higher probability on chosing a 5-star\n",
      "rating for movie 2 versus movie 1 as opposed to the maximum-likelihood\n",
      "probability.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print \"\"\"\n",
    "Comparing the inferred estimates of the probability of chosing a rating for\n",
    "individual movies against the respective maximum-likelihood estimates is\n",
    "also informative.\"\"\"\n",
    "\n",
    "print \"\"\"\n",
    "Rating count data for movies 1 and 2 below.  Movie 1 has less than half the\n",
    "ratings as movie 2 but all of its ratings are 5-star.  Movie 2 has more\n",
    "5-star ratings but also one 4-star rating.  The maximum-likelihood estimate\n",
    "would ignore the number of ratings and focus on the proportions.  The smoothed\n",
    "inferred estimate accounts for the number of ratings and also applies the\n",
    "pooled prior estimate of the rating probabilities.\n",
    "\"\"\"\n",
    "\n",
    "print data[0:2]\n",
    "\n",
    "print \"\"\"\n",
    "Maximum-likelihood rating probability estimates for movies 1 and 2:\n",
    "\"\"\"\n",
    "print normalizedRatingCounts[0:2]\n",
    "\n",
    "print \"\"\"\n",
    "Inferred rating probability estimates for movies 1 and 2:\n",
    "\"\"\"\n",
    "print theta_m_[0:2]\n",
    "\n",
    "print \"\"\"\n",
    "The inferred estimate places higher probability on chosing a 5-star\n",
    "rating for movie 2 versus movie 1 as opposed to the maximum-likelihood\n",
    "probability.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Rating count data for movies 3 and 4 below.  Movies 3 and 4 have identical\n",
      "rating count proportions (and identical maximum-likelihood estimates) but\n",
      "movie 4 has three times as many ratings as movie 3.  The higher number\n",
      "of ratings should push the inferred estimate of movie 4 closer to the\n",
      "maximum- likelihood estimate as it should start to dominate the prior.\n",
      "\n",
      "[[ 0  1  1  2  4  2]\n",
      " [ 0  3  3  6 12  6]]\n",
      "\n",
      "Maximum-likelihood rating probability estimates for movies 3 and 4:\n",
      "\n",
      "[[ 0.   0.1  0.1  0.2  0.4  0.2]\n",
      " [ 0.   0.1  0.1  0.2  0.4  0.2]]\n",
      "\n",
      "Inferred rating probability estimates for movies 3 and 4:\n",
      "\n",
      "[[ 0.00738755  0.09079506  0.12985281  0.20999552  0.3488494   0.21311968]\n",
      " [ 0.00366227  0.09739613  0.10971591  0.2088047   0.36784195  0.21257903]]\n",
      "\n",
      "The inferred estimate for movie 4 is closer to the maximum-likelihood\n",
      "estimate as we would expect given its higher number of overall counts.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print \"\"\"\n",
    "Rating count data for movies 3 and 4 below.  Movies 3 and 4 have identical\n",
    "rating count proportions (and identical maximum-likelihood estimates) but\n",
    "movie 4 has three times as many ratings as movie 3.  The higher number\n",
    "of ratings should push the inferred estimate of movie 4 closer to the\n",
    "maximum- likelihood estimate as it should start to dominate the prior.\n",
    "\"\"\"\n",
    "\n",
    "print data[2:4]\n",
    "\n",
    "print \"\"\"\n",
    "Maximum-likelihood rating probability estimates for movies 3 and 4:\n",
    "\"\"\"\n",
    "print normalizedRatingCounts[2:4]\n",
    "\n",
    "print \"\"\"\n",
    "Inferred rating probability estimates for movies 3 and 4:\n",
    "\"\"\"\n",
    "print theta_m_[2:4]\n",
    "\n",
    "print \"\"\"\n",
    "The inferred estimate for movie 4 is closer to the maximum-likelihood\n",
    "estimate as we would expect given its higher number of overall counts.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "To compute an average rating for each movie, a maximum-likelihood\n",
      "approach could be used (in this case an average) or the inferred\n",
      "rating probabilities can be used.\n",
      "\n",
      "Raw movie rating count data for reference (movie x rating counts):\n",
      "\n",
      "[[ 0  0  0  0  0  2]\n",
      " [ 0  0  0  0  1  4]\n",
      " [ 0  1  1  2  4  2]\n",
      " [ 0  3  3  6 12  6]\n",
      " [ 0  0  2  4  5  1]\n",
      " [ 0  1  3  3  2  1]\n",
      " [ 1  0  4  4  1  2]]\n",
      "\n",
      "Maximum-likelihood estimate of rating for each movie:\n",
      "\n",
      "[ 5.          4.8         3.5         3.5         3.41666667  2.9\n",
      "  2.83333333]\n",
      "\n",
      "Inferred estimate of rating for each movie:\n",
      "\n",
      "[ 3.77527145  4.03975528  3.44148319  3.47750503  3.39474854  3.09424091\n",
      "  3.00101262]\n",
      "\n",
      "The inferred estimate for movie 2 is higher than movie 1 as\n",
      "expected compared to the maximum-likelihood estimates because\n",
      "of the higher number of overall ratings for movie 2.\n",
      "\n",
      "The inferred estimate for movie 4 moves closer to the\n",
      "maximum-likelihood estimate than movie 3 as expected because\n",
      "of the  higher number of ratings (the influence of the prior\n",
      "can be changed with the concentration parameter).\n",
      "\n",
      "The inferred estimate for movie 7 is increased relative to\n",
      "the maximum-likelihood estimate despite the presence of a\n",
      "zero rating.\n"
     ]
    }
   ],
   "source": [
    "print \"\"\"\n",
    "To compute an average rating for each movie, a maximum-likelihood\n",
    "approach could be used (in this case an average) or the inferred\n",
    "rating probabilities can be used.\"\"\"\n",
    "\n",
    "print \"\"\"\n",
    "Raw movie rating count data for reference (movie x rating counts):\n",
    "\"\"\"\n",
    "print data\n",
    "\n",
    "print \"\"\"\n",
    "Maximum-likelihood estimate of rating for each movie:\n",
    "\"\"\"\n",
    "print maximumLikelihoodRatingAverages\n",
    "\n",
    "print \"\"\"\n",
    "Inferred estimate of rating for each movie:\n",
    "\"\"\"\n",
    "print inferredRatingAverages\n",
    "\n",
    "print \"\"\"\n",
    "The inferred estimate for movie 2 is higher than movie 1 as\n",
    "expected compared to the maximum-likelihood estimates because\n",
    "of the higher number of overall ratings for movie 2.\"\"\"\n",
    "\n",
    "print \"\"\"\n",
    "The inferred estimate for movie 4 moves closer to the\n",
    "maximum-likelihood estimate than movie 3 as expected because\n",
    "of the  higher number of ratings (the influence of the prior\n",
    "can be changed with the concentration parameter).\"\"\"\n",
    "\n",
    "print \"\"\"\n",
    "The inferred estimate for movie 7 is increased relative to\n",
    "the maximum-likelihood estimate despite the presence of a\n",
    "zero rating.\"\"\""
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}