shaunagm/gist:c09d045c7a0751e7f0c4

## gistfile1.json
{
 "metadata": {
  "name": "Crowdstorming Project - S. Gordon-McKeon"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Introduction\n",
      "\n",
      "This notebook records my analysis for the [Crowdstorming Research: Many analysts, one dataset](https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit) project.  You can find the datasets on the OpenScienceFramework [project page](https://osf.io/gvm2z/).  \n",
      "\n",
      "The research questions for this project are:\n",
      "\n",
      "__1: Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players?__\n",
      "\n",
      "__2: Are soccer referees from countries high in skin-tone prejudice more likely to award red cards to dark skin toned players?__\n",
      "\n",
      "As part of the project, other reachers read a summarized version of the analysis plan and gave feedback.  That feedback has been incorporated into this document.  To see the earlier version of the document, before feedback, go here: ???.  Where I made changes in response to feedback, I have noted it with the phrase, \"In response to feedback\"."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Analysis Plan\n",
      "\n",
      "### Choosing the Model\n",
      "\n",
      "The dependent variable in this analysis is the number of red cards given.  In order to choose an appropriate statistical test, we'll need to look at the data as a whole.  For ease of (my) use, we'll load the data and create a [list of dicts](http://developer.nokia.com/community/wiki/Archived:List_of_Dictionaries_in_Python).  To check that we've done so successfully, we'll print out the first two rows."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "import csv\n",
      "\n",
      "data_reader = csv.DictReader(open('./data/new_crowdstorming.csv', 'r'))   # Open datafile\n",
      "data = []\n",
      "for c,row in enumerate(data_reader):    # Create list of dics from the DictReader object\n",
      "    data.append(row)\n",
      "print data[:2]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[{'weight': '72', 'nExp': '750', 'height': '177', 'player': 'Lucas Wilchez', 'meanExp': '0.396', 'rater2': '0.5', 'yellowReds': '0', 'leagueCountry': 'Spain', 'rater1': '0.25', 'nIAT': '712', 'seIAT': '0.000564112354334542', 'seExp': '0.0026964901062936', 'Alpha_3': 'GRC', 'yellowCards': '0', 'photoID': '95212.jpg', 'club': 'Real Zaragoza', 'birthday': '31.08.1983', 'goals': '0', 'ties': '0', 'defeats': '1', 'meanIAT': '0.326391469021736', 'refCountry': '1', 'refNum': '1', 'victories': '0', 'games': '1', 'position': 'Attacking Midfielder', 'redCards': '0', 'playerShort': 'lucas-wilchez'}, {'weight': '82', 'nExp': '49', 'height': '179', 'player': 'John Utaka', 'meanExp': '-0.204081632653061', 'rater2': '0.75', 'yellowReds': '0', 'leagueCountry': 'France', 'rater1': '0.75', 'nIAT': '40', 'seIAT': '0.0108748941063986', 'seExp': '0.0615044043187379', 'Alpha_3': 'ZMB', 'yellowCards': '1', 'photoID': '1663.jpg', 'club': 'Montpellier HSC', 'birthday': '08.01.1982', 'goals': '0', 'ties': '0', 'defeats': '1', 'meanIAT': '0.203374724564378', 'refCountry': '2', 'refNum': '2', 'victories': '0', 'games': '1', 'position': 'Right Winger', 'redCards': '0', 'playerShort': 'john-utaka'}]\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Success!  The next step is to check the distribution of the data so we can choose an appropriate model.  The key here is the dependent variable, which will be the number of redCards.  We can use matplotlib to take a look at the distribution of redCards awarded:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as p  # Import for visualizations\n",
      "\n",
      "redCards = [float(row[\"redCards\"]) for row in data]\n",
      "\n",
      "# Create a histogram\n",
      "p.figure(1)\n",
      "n, bins, patches = p.hist(redCards,bins=5,range=(0,5))\n",
      "p.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAD9CAYAAABX0LttAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X9M3Pd9x/HnuXdViDLAXOMjvaMigSOYmNpeDEZVvV1D\nAdtbwAsJhG4G144qOVpHvKlhm1RFrpRAk25t1pr9UeEG02nnyH8A6+ILjjPqrF2uCs66yq7mU3U2\ndwemMr9MZocr+LM/sL+N83UwYDuH49dD+kp37/t+vry/X+zvy99fPocxxiAiIvIBK1LdgIiILD8K\nBxERsVE4iIiIjcJBRERsFA4iImKjcBAREZt5w2Hnzp14PB6Ki4uvqn//+99n9erVrFmzhpaWFqve\n2tqK3++nsLCQvr4+qz4wMEBxcTF+v5/m5marPj09TX19PX6/n7KyMs6cOWN91tnZSUFBAQUFBRw4\ncOCGV1RERBbBzOPYsWPm+PHjZs2aNVbtzTffNF/+8pdNMpk0xhjz29/+1hhjzIkTJ8zatWtNMpk0\n0WjU5OXlmUuXLhljjCkpKTHhcNgYY8yWLVvM4cOHjTHG7Nu3z+zevdsYY0wwGDT19fXGGGNGR0fN\nAw88YMbHx834+Lj1WkREPh7zHjls2rSJlStXXlX753/+Z/7u7/4Ol8sFwL333gtAT08PDQ0NuFwu\ncnNzyc/PJxwOMzw8zNTUFKWlpQA0NjbS3d0NQG9vL01NTQDU1tZy9OhRAF5//XUqKyvJzMwkMzOT\niooKQqHQTYxEERGZj3OxAyKRCMeOHePv//7vueuuu/jOd77Dhg0bGBoaoqyszJrP5/ORSCRwuVz4\nfD6r7vV6SSQSACQSCXJycuYacTrJyMhgdHSUoaGhq8ZcWdaHORyOxbYvInLHMwv4jzEWfUF6ZmaG\n8fFx3n77bV566SXq6uqW1NzNYozRZAzPPfdcyntYDpO2g7aFtsX800ItOhx8Ph+PPfYYACUlJaxY\nsYJz587h9XqJxWLWfPF4HJ/Ph9frJR6P2+owdxQxODgIzIXO5OQkbrfbtqxYLHbVkYSIiNxaiw6H\nbdu28eabbwJw6tQpkskkn/nMZ6iuriYYDJJMJolGo0QiEUpLS8nOziY9PZ1wOIwxhq6uLmpqagCo\nrq6ms7MTgEOHDlFeXg5AZWUlfX19TExMMD4+zpEjR6iqqrpZ6ywiItcx7zWHhoYGfvrTnzI6OkpO\nTg7f+ta32LlzJzt37qS4uJhPf/rT1m2mRUVF1NXVUVRUhNPppL293bom0N7ezo4dO7h48SJbt25l\n8+bNAOzatYvt27fj9/txu90Eg0EAsrKy+OY3v0lJSQkAzz33HJmZmbdsI3wSBAKBVLewLGg7/J62\nxe9pWyyewyzmJNQy43A4FnUOTUTkTrfQ/aaekBYRERuFg4iI2CgcRETERuEgIiI2CgcREbFROIiI\niI3CQUREbBQOIiJio3AQEREbhYOIiNgoHERExEbhICIiNgoHERGxUTiIiIiNwkFERGwUDiIiYjPv\nN8HdDrKyclLdwrKwcmU6//u/v8TpvO1/pSKyDNz23wQHvwRWprqVlPvUp/KZmpogLS0t1a2IyDK2\n0G+C+wT8M9MLuFPdRMo5HDpDKCI3z7x7lJ07d+LxeCguLrZ99g//8A+sWLGCsbExq9ba2orf76ew\nsJC+vj6rPjAwQHFxMX6/n+bmZqs+PT1NfX09fr+fsrIyzpw5Y33W2dlJQUEBBQUFHDhw4IZWUkRE\nFsnM49ixY+b48eNmzZo1V9UHBwdNVVWVyc3NNaOjo8YYY06cOGHWrl1rksmkiUajJi8vz1y6dMkY\nY0xJSYkJh8PGGGO2bNliDh8+bIwxZt++fWb37t3GGGOCwaCpr683xhgzOjpqHnjgATM+Pm7Gx8et\n1x8GGDhnwNzxk9N5l7lw4cJ8v04REXOd3b5l3iOHTZs2sXKl/Xz+X//1X/Piiy9eVevp6aGhoQGX\ny0Vubi75+fmEw2GGh4eZmpqitLQUgMbGRrq7uwHo7e2lqakJgNraWo4ePQrA66+/TmVlJZmZmWRm\nZlJRUUEoFLrRHBQRkQVa9DWHnp4efD4fn//856+qDw0NUVZWZr33+XwkEglcLhc+n8+qe71eEokE\nAIlEgpycubuNnE4nGRkZjI6OMjQ0dNWYK8u6theBKxdhA5cnEREB6O/vp7+/f9HjFhUOFy5c4IUX\nXuDIkSNWzaT8Zqdn0QVpEZFrCwQCBAIB6/3evXsXNG5Rt7j85je/4fTp06xdu5b777+feDzOww8/\nzMjICF6vl1gsZs0bj8fx+Xx4vV7i8bitDnNHEYODgwDMzMwwOTmJ2+22LSsWi111JCEiIrfWosKh\nuLiYkZERotEo0WgUn8/H8ePH8Xg8VFdXEwwGSSaTRKNRIpEIpaWlZGdnk56eTjgcxhhDV1cXNTU1\nAFRXV9PZ2QnAoUOHKC8vB6CyspK+vj4mJiYYHx/nyJEjVFVV3eRVFxGRjzLvaaWGhgZ++tOfMjo6\nSk5ODt/61rf46le/an0+9xDanKKiIurq6igqKsLpdNLe3m593t7ezo4dO7h48SJbt25l8+bNAOza\ntYvt27fj9/txu90Eg0EAsrKy+OY3v0lJSQkAzz33HJmZmTd3zUVE5CN9Ap6QPoeuOYDTmcb582N6\nQlpE5rXQJ6T1WK2IiNgoHERExEbhICIiNgoHERGxUTiIiIiNwkFERGwUDiIiYqNwEBERG4WDiIjY\nKBxERMRG4SAiIjYKBxERsVE4iIiIjcJBRERsFA4iImKjcBARERuFg4iI2CgcRETEZt5w2LlzJx6P\nh+LiYqv2jW98g9WrV7N27Voee+wxJicnrc9aW1vx+/0UFhbS19dn1QcGBiguLsbv99Pc3GzVp6en\nqa+vx+/3U1ZWxpkzZ6zPOjs7KSgooKCggAMHDtyUlRURkQUy8zh27Jg5fvy4WbNmjVXr6+szs7Oz\nxhhjWlpaTEtLizHGmBMnTpi1a9eaZDJpotGoycvLM5cuXTLGGFNSUmLC4bAxxpgtW7aYw4cPG2OM\n2bdvn9m9e7cxxphgMGjq6+uNMcaMjo6aBx54wIyPj5vx8XHr9YcBBs4ZMHf85HTeZS5cuDDfr1NE\nxFxnt2+Z98hh06ZNrFy58qpaRUUFK1bMDdu4cSPxeByAnp4eGhoacLlc5Obmkp+fTzgcZnh4mKmp\nKUpLSwFobGyku7sbgN7eXpqamgCora3l6NGjALz++utUVlaSmZlJZmYmFRUVhEKhmxaIIiIyP+eN\nDN6/fz8NDQ0ADA0NUVZWZn3m8/lIJBK4XC58Pp9V93q9JBIJABKJBDk5OXONOJ1kZGQwOjrK0NDQ\nVWOuLOvaXgTSLr8OXJ5ERASgv7+f/v7+RY9bcjg8//zzfPrTn+YrX/nKUhdxkzwLuFPcg4jI8hQI\nBAgEAtb7vXv3Lmjcku5WeuWVV3jttdf4l3/5F6vm9XqJxWLW+3g8js/nw+v1WqeePli/MmZwcBCA\nmZkZJicncbvdtmXFYrGrjiREROTWWnQ4hEIhXnrpJXp6erjrrrusenV1NcFgkGQySTQaJRKJUFpa\nSnZ2Nunp6YTDYYwxdHV1UVNTY43p7OwE4NChQ5SXlwNQWVlJX18fExMTjI+Pc+TIEaqqqm7G+oqI\nyELMd7X6ySefNPfdd59xuVzG5/OZjo4Ok5+fbz73uc+ZdevWmXXr1ll3GxljzPPPP2/y8vLMgw8+\naEKhkFV/5513zJo1a0xeXp75+te/btXff/9988QTT5j8/HyzceNGE41Grc/2799v8vPzTX5+vnnl\nlVc+8qq77lbS3UoisnDX2e1bHJdnvi05HA7gHLrmAE5nGufPj5GWlnb9mUXkjuVwOFjIbl9PSIuI\niI3CQUREbBQOIiJio3AQEREbhYOIiNgoHERExEbhICIiNgoHERGxUTiIiIiNwkFERGwUDiIiYqNw\nEBERG4WDiIjYKBxERMRG4SAiIjYKBxERsVE4iIiIjcJBRERsFA4iImIzbzjs3LkTj8dDcXGxVRsb\nG6OiooKCggIqKyuZmJiwPmttbcXv91NYWEhfX59VHxgYoLi4GL/fT3Nzs1Wfnp6mvr4ev99PWVkZ\nZ86csT7r7OykoKCAgoICDhw4cFNWVkREFsjM49ixY+b48eNmzZo1Vu0b3/iG+fa3v22MMaatrc20\ntLQYY4w5ceKEWbt2rUkmkyYajZq8vDxz6dIlY4wxJSUlJhwOG2OM2bJlizl8+LAxxph9+/aZ3bt3\nG2OMCQaDpr6+3hhjzOjoqHnggQfM+Pi4GR8ft15/GGDgnAFzx09O513mwoUL8/06RUTMdXb7lnmP\nHDZt2sTKlSuvqvX29tLU1ARAU1MT3d3dAPT09NDQ0IDL5SI3N5f8/HzC4TDDw8NMTU1RWloKQGNj\nozXmg8uqra3l6NGjALz++utUVlaSmZlJZmYmFRUVhEKhmxaIIiIyP+diB4yMjODxeADweDyMjIwA\nMDQ0RFlZmTWfz+cjkUjgcrnw+XxW3ev1kkgkAEgkEuTk5Mw14nSSkZHB6OgoQ0NDV425sqxrexFI\nu/w6cHkSERGA/v5++vv7Fz1u0eHwQQ6HA4fDcSOLuAmeBdwp7kFEZHkKBAIEAgHr/d69exc0btF3\nK3k8Hs6ePQvA8PAwq1atAuaOCGKxmDVfPB7H5/Ph9XqJx+O2+pUxg4ODAMzMzDA5OYnb7bYtKxaL\nXXUkISIit9aiw6G6uprOzk5g7o6ibdu2WfVgMEgymSQajRKJRCgtLSU7O5v09HTC4TDGGLq6uqip\nqbEt69ChQ5SXlwNQWVlJX18fExMTjI+Pc+TIEaqqqm7KCouIyALMd7X6ySefNPfdd59xuVzG5/OZ\n/fv3m9HRUVNeXm78fr+pqKi46i6i559/3uTl5ZkHH3zQhEIhq/7OO++YNWvWmLy8PPP1r3/dqr//\n/vvmiSeeMPn5+Wbjxo0mGo1an+3fv9/k5+eb/Px888orr3zkVXfdraS7lURk4a6z27c4Ls98W5q7\n3nEOXXMApzON8+fHSEtLu/7MInLHcjgcLGS3ryekRUTERuEgIiI2CgcREbFROIiIiI3CQUREbBQO\nIiJio3AQEREbhYOIiNgoHERExEbhICIiNgoHERGxUTiIiIiNwkFERGwUDiIiYqNwEBERG4WDiIjY\nKBxERMRG4SAiIjYKBxERsVlyOLS2tvLQQw9RXFzMV77yFaanpxkbG6OiooKCggIqKyuZmJi4an6/\n309hYSF9fX1WfWBggOLiYvx+P83NzVZ9enqa+vp6/H4/ZWVlnDlzZqmtiojIIi0pHE6fPs0Pf/hD\njh8/zq9+9StmZ2cJBoO0tbVRUVHBqVOnKC8vp62tDYCTJ09y8OBBTp48SSgU4umnn7a+4Hr37t10\ndHQQiUSIRCKEQiEAOjo6cLvdRCIR9uzZQ0tLy01aZRERuZ4lhUN6ejoul4sLFy4wMzPDhQsX+Oxn\nP0tvby9NTU0ANDU10d3dDUBPTw8NDQ24XC5yc3PJz88nHA4zPDzM1NQUpaWlADQ2NlpjPris2tpa\njh49esMrKyIiC+NcyqCsrCz+5m/+hs997nOkpaVRVVVFRUUFIyMjeDweADweDyMjIwAMDQ1RVlZm\njff5fCQSCVwuFz6fz6p7vV4SiQQAiUSCnJycuSadTjIyMhgbGyMrK+tD3bwIpF1+Hbg8iYgIQH9/\nP/39/Yset6Rw+M1vfsP3vvc9Tp8+TUZGBk888QQ//vGPr5rH4XDgcDiWsvhFehZwfww/R0Tk9hMI\nBAgEAtb7vXv3Lmjckk4rvfPOO3zhC1/A7XbjdDp57LHH+K//+i+ys7M5e/YsAMPDw6xatQqYOyKI\nxWLW+Hg8js/nw+v1Eo/HbfUrYwYHBwGYmZlhcnLyGkcNIiJyKywpHAoLC3n77be5ePEixhjeeOMN\nioqKePTRR+ns7ASgs7OTbdu2AVBdXU0wGCSZTBKNRolEIpSWlpKdnU16ejrhcBhjDF1dXdTU1Fhj\nrizr0KFDlJeX34z1FRGRBVjSaaW1a9fS2NjIhg0bWLFiBX/4h3/I1772Naampqirq6Ojo4Pc3Fxe\nffVVAIqKiqirq6OoqAin00l7e7t1yqm9vZ0dO3Zw8eJFtm7dyubNmwHYtWsX27dvx+/343a7CQaD\nN2mVRUTkehzmyj2lt6G5gDmHrjmA05nG+fNjpKWlXX9mEbljORwOFrLb1xPSIiJio3AQEREbhYOI\niNgoHERExEbhICIiNgoHERGxUTiIiIiNwkFERGwUDiIiYqNwEBERG4WDiIjYKBxERMRG4SAiIjYK\nBxERsVE4iIiIjcJBRERsFA4iImKjcBARERuFg4iI2Cw5HCYmJnj88cdZvXo1RUVFhMNhxsbGqKio\noKCggMrKSiYmJqz5W1tb8fv9FBYW0tfXZ9UHBgYoLi7G7/fT3Nxs1aenp6mvr8fv91NWVsaZM2eW\n2qqIiCzSksOhubmZrVu38utf/5r/+Z//obCwkLa2NioqKjh16hTl5eW0tbUBcPLkSQ4ePMjJkycJ\nhUI8/fTT1hdc7969m46ODiKRCJFIhFAoBEBHRwdut5tIJMKePXtoaWm5CasrIiILsaRwmJyc5K23\n3mLnzp0AOJ1OMjIy6O3tpampCYCmpia6u7sB6OnpoaGhAZfLRW5uLvn5+YTDYYaHh5mamqK0tBSA\nxsZGa8wHl1VbW8vRo0dvbE1FRGTBnEsZFI1Guffee/nqV7/KL3/5Sx5++GG+973vMTIygsfjAcDj\n8TAyMgLA0NAQZWVl1nifz0cikcDlcuHz+ay61+slkUgAkEgkyMnJmWvycviMjY2RlZX1oW5eBNIu\nvw5cnkREBKC/v5/+/v5Fj1tSOMzMzHD8+HF+8IMfUFJSwjPPPGOdQrrC4XDgcDiWsvhFehZwfww/\nR0Tk9hMIBAgEAtb7vXv3Lmjckk4r+Xw+fD4fJSUlADz++OMcP36c7Oxszp49C8Dw8DCrVq0C5o4I\nYrGYNT4ej+Pz+fB6vcTjcVv9ypjBwUFgLowmJyevcdQgIiK3wpLCITs7m5ycHE6dOgXAG2+8wUMP\nPcSjjz5KZ2cnAJ2dnWzbtg2A6upqgsEgyWSSaDRKJBKhtLSU7Oxs0tPTCYfDGGPo6uqipqbGGnNl\nWYcOHaK8vPyGV1ZERBZmSaeVAL7//e/z53/+5ySTSfLy8vjRj37E7OwsdXV1dHR0kJuby6uvvgpA\nUVERdXV1FBUV4XQ6aW9vt045tbe3s2PHDi5evMjWrVvZvHkzALt27WL79u34/X7cbjfBYPAmrK6I\niCyEw1y5p/Q2NBcw59A1B3A60zh/foy0tLTrzywidyyHw8FCdvt6QlpERGwUDiIiYqNwEBERG4WD\niIjYKBxERMRG4SAiIjYKBxERsVE4iIiIjcJBRERsFA4iImKjcBARERuFg4iI2CgcRETERuEgIiI2\nCgcREbFROIiIiI3CQUREbBQOIiJio3AQERGbJYfD7Ows69ev59FHHwVgbGyMiooKCgoKqKysZGJi\nwpq3tbUVv99PYWEhfX19Vn1gYIDi4mL8fj/Nzc1WfXp6mvr6evx+P2VlZZw5c2apbYqIyBIsORxe\nfvllioqKcDgcALS1tVFRUcGpU6coLy+nra0NgJMnT3Lw4EFOnjxJKBTi6aeftr7cevfu3XR0dBCJ\nRIhEIoRCIQA6Ojpwu91EIhH27NlDS0vLja6niIgswpLCIR6P89prr/HUU09ZO/re3l6ampoAaGpq\noru7G4Cenh4aGhpwuVzk5uaSn59POBxmeHiYqakpSktLAWhsbLTGfHBZtbW1HD169MbWUkREFsW5\nlEF79uzhpZde4vz581ZtZGQEj8cDgMfjYWRkBIChoSHKysqs+Xw+H4lEApfLhc/ns+per5dEIgFA\nIpEgJydnrkGnk4yMDMbGxsjKyrpGNy8CaZdfBy5PIiIC0N/fT39//6LHLTocfvKTn7Bq1SrWr1//\nkT/Q4XBYp5tuvWcB98f0s0REbi+BQIBAIGC937t374LGLTocfv7zn9Pb28trr73G+++/z/nz59m+\nfTsej4ezZ8+SnZ3N8PAwq1atAuaOCGKxmDU+Ho/j8/nwer3E43Fb/cqYwcFBPvvZzzIzM8Pk5ORH\nHDWIiMitsOhrDi+88AKxWIxoNEowGOSRRx6hq6uL6upqOjs7Aejs7GTbtm0AVFdXEwwGSSaTRKNR\nIpEIpaWlZGdnk56eTjgcxhhDV1cXNTU11pgryzp06BDl5eU3a31FRGQBlnTN4YOunD7627/9W+rq\n6ujo6CA3N5dXX30VgKKiIurq6igqKsLpdNLe3m6NaW9vZ8eOHVy8eJGtW7eyefNmAHbt2sX27dvx\n+/243W6CweCNtikiIovgMFduN7oNzYXMOXTNAZzONM6fHyMtLe36M4vIHcvhcLCQ3b6ekBYRERuF\ng4iI2CgcRETERuEgIiI2CgcREbFROIiIiI3CQUREbBQOIiJio3AQEREbhYOIiNgoHERExEbhICIi\nNgoHERGxUTiIiIiNwkFERGwUDiIiYqNwEBERG4WDiIjYKBxERMRmSeEQi8X40pe+xEMPPcSaNWv4\np3/6JwDGxsaoqKigoKCAyspKJiYmrDGtra34/X4KCwvp6+uz6gMDAxQXF+P3+2lubrbq09PT1NfX\n4/f7KSsr48yZM0tdRxERWaQlhYPL5eK73/0uJ06c4O2332bfvn38+te/pq2tjYqKCk6dOkV5eTlt\nbW0AnDx5koMHD3Ly5ElCoRBPP/209QXXu3fvpqOjg0gkQiQSIRQKAdDR0YHb7SYSibBnzx5aWlpu\n0iqLiMj1LCkcsrOzWbduHQD33HMPq1evJpFI0NvbS1NTEwBNTU10d3cD0NPTQ0NDAy6Xi9zcXPLz\n8wmHwwwPDzM1NUVpaSkAjY2N1pgPLqu2tpajR4/e2JqKiMiCOW90AadPn+bdd99l48aNjIyM4PF4\nAPB4PIyMjAAwNDREWVmZNcbn85FIJHC5XPh8Pqvu9XpJJBIAJBIJcnJy5pp0OsnIyGBsbIysrKwP\ndfAikHb5deDyJCIiAP39/fT39y963A2Fw3vvvUdtbS0vv/wyf/AHf3DVZw6HA4fDcSOLX6BnAffH\n8HNERG4/gUCAQCBgvd+7d++Cxi35bqXf/e531NbWsn37drZt2wbMHS2cPXsWgOHhYVatWgXMHRHE\nYjFrbDwex+fz4fV6icfjtvqVMYODgwDMzMwwOTl5jaMGERG5FZYUDsYYdu3aRVFREc8884xVr66u\nprOzE4DOzk4rNKqrqwkGgySTSaLRKJFIhNLSUrKzs0lPTyccDmOMoauri5qaGtuyDh06RHl5+Q2t\nqIiILJzDXLltaBH+8z//kz/6oz/i85//vHXqqLW1ldLSUurq6hgcHCQ3N5dXX32VzMxMAF544QX2\n79+P0+nk5ZdfpqqqCpi7lXXHjh1cvHiRrVu3WrfFTk9Ps337dt59913cbjfBYJDc3Nyrm3c4gHPo\ntBI4nWmcPz9GWlra9WcWkTuWw+FgIbv9JYXDcqFw+D2Fg4gsxELDQU9Ii4iIjcJBRERsFA4iImKj\ncBARERuFg4iI2CgcRETERuEgIiI2CgcREbFROIiIiI3CQUREbBQOIiJio3AQEREbhYOIiNgoHERE\nxEbhICIiNgoHERGxUTiIiIiNwkFERGwUDp8QxlxKdQvLQn9/f6pbWDa0LX5P22LxlnU4hEIhCgsL\n8fv9fPvb3051O8va7OwMd999Nw6H446eqqq2pPpXsWxoh/h72haL50x1Ax9ldnaWv/zLv+SNN97A\n6/VSUlJCdXU1q1evTnVry9Ql4PpfGv5Jl0w6Ut2CyCfCsj1y+MUvfkF+fj65ubm4XC6efPJJenp6\nUt2WiMgdYdkeOSQSCXJycqz3Pp+PcDh8jTk/8/E1tezpX80ADoe2wxV79+5NdQvLhrbF4izbcFjI\nX3BjdBpFRORWWLanlbxeL7FYzHofi8Xw+Xwp7EhE5M6xbMNhw4YNRCIRTp8+TTKZ5ODBg1RXV6e6\nLRGRO8KyPa3kdDr5wQ9+QFVVFbOzs+zatUt3KomIfEwc5jY9cR8KhXjmmWeYnZ3lqaeeoqWlJdUt\npcTOnTv593//d1atWsWvfvWrVLeTUrFYjMbGRn7729/icDj42te+xl/91V+luq2UeP/99/njP/5j\npqenSSaT1NTU0Nramuq2UmZ2dpYNGzbg8/n4t3/7t1S3k1K5ubmkp6fzqU99CpfLxS9+8Ytrzndb\nhsPs7CwPPvjgVc9A/Ou//usdeWTx1ltvcc8999DY2HjHh8PZs2c5e/Ys69at47333uPhhx+mu7v7\njvxzAXDhwgXuvvtuZmZm+OIXv8h3vvMdvvjFL6a6rZT4x3/8RwYGBpiamqK3tzfV7aTU/fffz8DA\nAFlZWfPOt2yvOcxHz0D83qZNm1i5cmWq21gWsrOzWbduHQD33HMPq1evZmhoKMVdpc7dd98NQDKZ\nZHZ29ro7g0+qeDzOa6+9xlNPPaU7HC9byHa4LcPhWs9AJBKJFHYky83p06d599132bhxY6pbSZlL\nly6xbt06PB4PX/rSlygqKkp1SymxZ88eXnrpJVasuC13dzedw+Hgy1/+Mhs2bOCHP/zhR853W24t\nPeQk83nvvfd4/PHHefnll7nnnntS3U7KrFixgv/+7/8mHo9z7NixO/L/F/rJT37CqlWrWL9+vY4a\nLvvZz37Gu+++y+HDh9m3bx9vvfXWNee7LcNBz0DIR/nd735HbW0tf/EXf8G2bdtS3c6ykJGRwZ/8\nyZ/wzjvvpLqVj93Pf/5zent7uf/++2loaODNN9+ksbEx1W2l1H333QfAvffey5/92Z995AXp2zIc\n9AyEXIsxhl27dlFUVMQzzzyT6nZS6ty5c0xMTABw8eJFjhw5wvr161Pc1cfvhRdeIBaLEY1GCQaD\nPPLIIxw4cCDVbaXMhQsXmJqaAuD//u//6Ovro7i4+Jrz3pbh8MFnIIqKiqivr79j70hpaGjgC1/4\nAqdOnSJ4dNheAAAAiUlEQVQnJ4cf/ehHqW4pZX72s5/x4x//mP/4j/9g/fr1rF+/nlAolOq2UmJ4\neJhHHnmEdevWsXHjRh599FHKy8tT3VbK3emnpEdGRti0aZP15+JP//RPqaysvOa8t+WtrCIicmvd\nlkcOIiJyaykcRETERuEgIiI2CgcREbFROIiIiI3CQUREbP4f3urraLz6icUAAAAASUVORK5CYII=\n"
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The distribution is very clearly right-skewed.  This makes the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution) a natural choice.  Another key aspect of the Poisson distribution is that the data's mean and variance are approximately equal.  We can test that too:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as n  # Import numpy library\n",
      "\n",
      "print n.var([float(row[\"redCards\"]) for row in data])\n",
      "print n.mean([float(row[\"redCards\"]) for row in data])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0.0127439009136\n",
        "0.0125592352152"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can see that they're pretty close to equal.  So we know what distribution to use for the dependent variable.  Given that we have a number of independent variables to include, Poisson regression is a natural model choice.\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Transforming the Data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "\n",
      "What about the independent variables?  What do those look like?\n",
      "\n",
      "The independent variable of greatest interest is the skin tone ratings.  Let's make sure they're reliable.  To begin with, we apply [scipy's normality test](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html#scipy.stats.normaltest), which is based off the D'Agostino-Pearson normality test.  This will help us decide what inter-rater reliability measure to use."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import scipy.stats as s \n",
      "\n",
      "# Check IRR of ratings\n",
      "rater1 = [float(row[\"rater1\"]) for row in data if \"NA\" not in [row[\"rater1\"],row[\"rater2\"]]]\n",
      "rater2 = [float(row[\"rater2\"]) for row in data if \"NA\" not in [row[\"rater1\"],row[\"rater2\"]]]\n",
      "\n",
      "# Print results of scipy's normality test (based off D'Agostino-Pearson normality test)\n",
      "print s.stats.normaltest(rater1, axis=0)\n",
      "print s.stats.normaltest(rater2, axis=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(17807.487876428768, 0.0)\n",
        "(16465.879873231894, 0.0)\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The second value is a p-value -- for this normality test, a value less than .05 means that normality cannot be assumed.  So we'll test inter-rater reliability with Spearman's rank correlation coefficient (rho), which does not assume normality."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"Spearman: \", s.spearmanr(rater1,rater2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Spearman:  "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(0.85764169209866958, 0.0)\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "86% is a reasonable correlation.  We can apply a simple transformation to average the ratings (see below).\n",
      "\n",
      "Before continuing on, I'm going to reformat this data into a pandas dataframe.  I'll do this by importing the pandas library, reading in the data again, and selecting out some important rows.  Again, we'll print out the first few rows to make sure it looks sensible."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas\n",
      "\n",
      "df = pandas.read_csv(\"./data/new_crowdstorming.csv\")\n",
      "keys = ['playerShort','refNum','games','goals','yellowCards','redCards','position','meanIAT','meanExp', 'rater1', 'rater2','club','leagueCountry']\n",
      "df = df[keys]\n",
      "print df[:3]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "     playerShort  refNum  games  goals  yellowCards  redCards  \\\n",
        "0  lucas-wilchez       1      1      0            0         0   \n",
        "1     john-utaka       2      1      0            1         0   \n",
        "2    abdon-prats       3      1      0            1         0   \n",
        "\n",
        "               position   meanIAT   meanExp  rater1  rater2             club  \\\n",
        "0  Attacking Midfielder  0.326391  0.396000    0.25    0.50    Real Zaragoza   \n",
        "1          Right Winger  0.203375 -0.204082    0.75    0.75  Montpellier HSC   \n",
        "2                   NaN  0.369894  0.588297     NaN     NaN     RCD Mallorca   \n",
        "\n",
        "  leagueCountry  \n",
        "0         Spain  \n",
        "1        France  \n",
        "2         Spain  \n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/usr/local/lib/python2.7/dist-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.\n",
        "  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Looks good!\n",
      "\n",
      "As mentioned above, we will average the two ratings:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Drop NA ratings and make an average\n",
      "df = df.dropna(subset=['rater1','rater2'])\n",
      "df['rating'] = (df['rater1'] + df['rater2']) / 2\n",
      "print df[:3]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "     playerShort  refNum  games  goals  yellowCards  redCards  \\\n",
        "0  lucas-wilchez       1      1      0            0         0   \n",
        "1     john-utaka       2      1      0            1         0   \n",
        "5   aaron-hughes       4      1      0            0         0   \n",
        "\n",
        "               position   meanIAT   meanExp  rater1  rater2             club  \\\n",
        "0  Attacking Midfielder  0.326391  0.396000    0.25    0.50    Real Zaragoza   \n",
        "1          Right Winger  0.203375 -0.204082    0.75    0.75  Montpellier HSC   \n",
        "5           Center Back  0.325185  0.538462    0.25    0.00        Fulham FC   \n",
        "\n",
        "  leagueCountry  rating  \n",
        "0         Spain   0.375  \n",
        "1        France   0.750  \n",
        "5       England   0.125  \n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In addition to the independent variable of interest, there are a number of potential moderating variables.  These include:\n",
      "\n",
      "+  The number of yellow cards given (['yellowCards']).  Although it seems likely that number of yellow cards are correlated with number of red cards, the role of the yellow card as a \"caution\" or \"warning\" indicates the potential for a more complicated role.  For instance, players with lighter skin might disproportionately receive yellow cards and those with darker skin red cards for the same behavior.  This already exists as a variable, so no transformation was necessary.\n",
      "+  Number of goals by player (['goals']).  A player's success may enhance or attenuate bias.\n",
      "+  In response to feedback, I added player position as a variable.  It seemed plausible that some positions would not differ from each other in their influence on the variable redCards, for instance right midfielder vs left midfielder.  To test that intuition, I compared redCards per game per position:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "positions = df.groupby(['position'])\n",
      "positions['goals'].mean() / positions['games'].mean()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "position\n",
        "Attacking Midfielder    0.162301\n",
        "Center Back             0.051606\n",
        "Center Forward          0.319852\n",
        "Center Midfielder       0.122160\n",
        "Defensive Midfielder    0.061550\n",
        "Goalkeeper              0.000303\n",
        "Left Fullback           0.038574\n",
        "Left Midfielder         0.126883\n",
        "Left Winger             0.231471\n",
        "Right Fullback          0.034196\n",
        "Right Midfielder        0.119648\n",
        "Right Winger            0.245183\n",
        "dtype: float64"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From this, we can create a smaller number of positions by binning rates of redCards in the bins of: 0-.01; .01-.1; .1-.2; .2-.3; .3-.4.  This created 5 categories for this variable: goalkeeper (0.000307); defense, which included defensive midfielder (0.060537), center back (0.051464), left fullback (0.037560), and right fullback (0.033997); midfield, which included center midfielder (0.119374), left midfielder (0.125318) and right midfielder (0.120007) and attacking midfielder (0.159464); offense, which included left winger (0.228310) and right winger (0.239445); and center forward (0.316328).  If there was no position data for a dyad, it was dropped.  This new variable was called 'positionGroup':\n",
      "\n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "before_drop = len(df)\n",
      "df = df.dropna(subset=['position'])  \n",
      "print \"rows dropped: \", before_drop - len(df)\n",
      "\n",
      "def f(x):\n",
      "    positionDict = {\"Attacking Midfielder\" : \"Midfield\", \"Center Back\" : \"Defense\", \"Center Forward\" : \"Offense\", \"Center Midfielder\" : \"Midfield\", \"Defensive Midfielder\" : \"Defense\", \"Goalkeeper\" : \"Goalkeeper\", \"Left Fullback\" : \"Defense\", \"Left Midfielder\" : \"Midfield\", \"Left Winger\" : \"Offense\", \"Right Fullback\" : \"Defense\", \"Right Midfielder\" : \"Midfield\", \"Right Winger\" : \"Offence\"}\n",
      "    return positionDict[x]\n",
      "df['positionGroup'] = df['position'].apply(f,1)\n",
      "\n",
      "print df[:10]['positionGroup']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "rows dropped:  8461\n",
        "0       Midfield\n",
        "1        Offence\n",
        "5        Defense\n",
        "6        Defense\n",
        "7        Defense\n",
        "8     Goalkeeper\n",
        "9        Defense\n",
        "10       Defense\n",
        "11       Offense\n",
        "12       Defense\n",
        "Name: positionGroup, dtype: object"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can see here that 17,726 rows were dropped, and that we successfully created the new column."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "+  Although it was suggested that I address the hierarchical nature of the data, perhaps by adding fixed effects for club and league, more information about the dataset revealed that the information given for club and league was incomplete, and represented only the first club/league the player played in.  I therefore chose not to include this variable.\n",
      "+  Measures of bias in country of origin of referee (['meanIAT'], ['meanExp']).  See question 2.  These were scaled to be more readable:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "before_drop = len(df)\n",
      "df = df.dropna(subset=['meanIAT', 'meanExp'])\n",
      "print \"rows dropped: \", before_drop - len(df)\n",
      "\n",
      "df['meanIAT'] = df['meanIAT'] * 100\n",
      "df['meanExp'] = df['meanExp'] * 100\n",
      "print df[:3]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "rows dropped:  146\n",
        "     playerShort  refNum  games  goals  yellowCards  redCards  \\\n",
        "0  lucas-wilchez       1      1      0            0         0   \n",
        "1     john-utaka       2      1      0            1         0   \n",
        "5   aaron-hughes       4      1      0            0         0   \n",
        "\n",
        "               position    meanIAT    meanExp  rater1  rater2  \\\n",
        "0  Attacking Midfielder  32.639147  39.600000    0.25    0.50   \n",
        "1          Right Winger  20.337472 -20.408163    0.75    0.75   \n",
        "5           Center Back  32.518515  53.846154    0.25    0.00   \n",
        "\n",
        "              club leagueCountry  rating positionGroup  \n",
        "0    Real Zaragoza         Spain   0.375      Midfield  \n",
        "1  Montpellier HSC        France   0.750       Offence  \n",
        "5        Fulham FC       England   0.125       Defense  \n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "+  The number of games played (['games']).  A larger number of games provide more opportunities for a red card to be issued.  In response to feedback, this was made into an ['exposure variable'](http://www.theanalysisfactor.com/the-exposure-variable-in-poission-regression-models/).  Because it is given to the model as a one-dimensional array or vector, and not as part of the dataframe, I didn't make it into a new variable.  I calculated this last, once missing data had been dropped while exploring other variables."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "exposure_array = df['games'].values"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### How to Interpret Our Results\n",
      "\n",
      "To answer question 1, we'll look at the coefficient of regression for the variable ratings.\n",
      "\n",
      "To answer question 2a, we'll look at the coefficient of regression of the interaction term for the variable ratings and the meanIAT.\n",
      "\n",
      "To answer question 2b,  we'll look at the coefficient of regression of the interaction term for the variable ratings and the meanExp.\n",
      "\n",
      "Okay, let's begin!\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Analysis"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np \n",
      "import statsmodels.api as sm\n",
      "from patsy import dmatrices\n",
      "\n",
      "# Create + fit poisson model\n",
      "def test_question(y, X, exposure_array):  \n",
      "    poisson_mod = sm.Poisson(y, X, exposure=exposure_array)\n",
      "    poisson_res = poisson_mod.fit()\n",
      "    print poisson_res.summary()\n",
      "    # print np.exp(poisson_res.params)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This function takes a patsy dmatrices frame and fits a Poisson model with it, using exposure_array (defined above) as an exposure variable.  We multiply each of our variables by the variable of interest, rating, in order to determine the unique effects of rating when other variables are controlled for:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Define x and y \n",
      "y, X = dmatrices('redCards ~ rating + rating*goals + rating*positionGroup + rating*yellowCards + rating*meanIAT + rating*meanExp', data=df, return_type='dataframe')\n",
      "\n",
      "test_question(y, X, exposure_array)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Optimization terminated successfully.\n",
        "         Current function value: 0.062406\n",
        "         Iterations 11\n",
        "                          Poisson Regression Results                          \n",
        "==============================================================================\n",
        "Dep. Variable:               redCards   No. Observations:               116014\n",
        "Model:                        Poisson   Df Residuals:                   115996\n",
        "Method:                           MLE   Df Model:                           17\n",
        "Date:                Thu, 24 Jul 2014   Pseudo R-squ.:                 0.08945\n",
        "Time:                        12:18:53   Log-Likelihood:                -7240.0\n",
        "converged:                       True   LL-Null:                       -7951.2\n",
        "                                        LLR p-value:                2.312e-292\n",
        "======================================================================================================\n",
        "                                         coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
        "------------------------------------------------------------------------------------------------------\n",
        "Intercept                             -7.5088      0.790     -9.505      0.000        -9.057    -5.960\n",
        "positionGroup[T.Goalkeeper]            0.0383      0.133      0.288      0.774        -0.223     0.300\n",
        "positionGroup[T.Midfield]             -0.5870      0.106     -5.558      0.000        -0.794    -0.380\n",
        "positionGroup[T.Offence]              -0.1680      0.225     -0.746      0.456        -0.610     0.274\n",
        "positionGroup[T.Offense]              -0.4165      0.126     -3.314      0.001        -0.663    -0.170\n",
        "rating                                 2.1084      1.525      1.382      0.167        -0.881     5.098\n",
        "rating:positionGroup[T.Goalkeeper]    -0.9424      0.536     -1.759      0.079        -1.993     0.108\n",
        "rating:positionGroup[T.Midfield]       0.4554      0.277      1.644      0.100        -0.088     0.998\n",
        "rating:positionGroup[T.Offence]       -0.1720      0.427     -0.403      0.687        -1.009     0.664\n",
        "rating:positionGroup[T.Offense]        0.0743      0.259      0.287      0.774        -0.434     0.583\n",
        "goals                                 -0.0251      0.027     -0.926      0.355        -0.078     0.028\n",
        "rating:goals                          -0.0098      0.065     -0.149      0.881        -0.138     0.118\n",
        "yellowCards                            0.0588      0.026      2.249      0.025         0.008     0.110\n",
        "rating:yellowCards                    -0.1284      0.072     -1.779      0.075        -0.270     0.013\n",
        "meanIAT                                0.0607      0.026      2.309      0.021         0.009     0.112\n",
        "rating:meanIAT                        -0.0547      0.051     -1.071      0.284        -0.155     0.045\n",
        "meanExp                               -0.0001      0.004     -0.037      0.970        -0.007     0.007\n",
        "rating:meanExp                         0.0039      0.008      0.514      0.607        -0.011     0.019\n",
        "======================================================================================================"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Results"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The key results here are in the first column - the regression coefficients.  A regression coefficient can be interpreted as: \"for a one unit change in the predictor variable, the difference in the logs of expected counts is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant.\" ([source](http://www.ats.ucla.edu/stat/stata/output/stata_poisson_output.htm))  In order to better interpret this result, as well as to provide a standardized results that can be compared directly to other researchers' analyses, we'll transform this into an incidence rate ratio.  We can do this by exponentiating the coefficients using numpy's exp function:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# The same function as above, but displaying different output.\n",
      "def test_question(y, X, exposure_array):  \n",
      "    poisson_mod = sm.Poisson(y, X, exposure=exposure_array)\n",
      "    poisson_res = poisson_mod.fit()\n",
      "    # print poisson_res.summary()\n",
      "    print np.exp(poisson_res.params)\n",
      "\n",
      "test_question(y, X, exposure_array)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Optimization terminated successfully.\n",
        "         Current function value: 0.062406\n",
        "         Iterations 11\n",
        "Intercept                             0.000548\n",
        "positionGroup[T.Goalkeeper]           1.039091\n",
        "positionGroup[T.Midfield]             0.555981\n",
        "positionGroup[T.Offence]              0.845336\n",
        "positionGroup[T.Offense]              0.659383\n",
        "rating                                8.235204\n",
        "rating:positionGroup[T.Goalkeeper]    0.389701\n",
        "rating:positionGroup[T.Midfield]      1.576794\n",
        "rating:positionGroup[T.Offence]       0.841944\n",
        "rating:positionGroup[T.Offense]       1.077177\n",
        "goals                                 0.975218\n",
        "rating:goals                          0.990290\n",
        "yellowCards                           1.060577\n",
        "rating:yellowCards                    0.879504\n",
        "meanIAT                               1.062622\n",
        "rating:meanIAT                        0.946804\n",
        "meanExp                               0.999861\n",
        "rating:meanExp                        1.003863\n",
        "dtype: float64"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Question 1"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can interpret these exponentiated coeffients (IRRs) as the change to the rate of the dependent variable (redCards) given an increase in one unit of the predictor variable (rating).  Because interaction terms have been included, this is the change when all other variables are zero.\n",
      "\n",
      "This result, 8.235204, can be interpreted as, \"For each increase of 1 in the rating variable, a player is 8.235204 times more likely to receive a red card.\"  Keep in mind, though, that ratings range from 0-1.  An alternative interpretation is thus, \"A player whose skin was rated as darkest is 8.235204 more likely than a player whose skin was rated lighest to receive a red card.\"  \n",
      "\n",
      "It is also worth noting the standard error, confidence interval, etc in the original summary:\n",
      "\n",
      "    coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
      "    2.1084      1.525      1.382      0.167        -0.881     5.098\n",
      "                                       \n",
      "It is hard to have much confidence in our results."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Question 2a "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For this question, we want to look not just at the influence of ratings on redCards but the influence of meanIAT on the influence of ratings on redCards.  The interaction coefficient is therefore what we're interested in:\n",
      "\n",
      "    coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
      "    -0.0547      0.051     -1.071      0.284        -0.155     0.045\n",
      "\n",
      "The IRR is:\n",
      "\n",
      "    rating:meanIAT                        0.946804\n",
      "\n",
      "This result shows the increased influence of ratings on redCards based on a difference of 1 meanIAT. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Question 2b"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For this question, we want to look not just at the influence of ratings on redCards but the influence of meanExp on the influence of ratings on redCards.  The interaction coefficient is therefore what we're interested in:\n",
      "\n",
      "    coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
      "    0.0039      0.008      0.514      0.607        -0.011     0.019\n",
      "\n",
      "The IRR is:\n",
      "\n",
      "    rating:meanIAT                        1.003863\n",
      "\n",
      "This result shows the increased influence of ratings on redCards based on a difference of 1 meanExp. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Conclusion"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this analysis, we asked two questions: whether soccer referees were more likely to give red cards to dark skin toned players than light skin toned players, and whether implicit or explicit skin-tone prejudice in the country of origin of the referee moderated this preference.  To test this question, we used a dataset consisting of dyads of players and referees, the number of red cards given by referees to the players, ratings of the player's skin-tone, and several other variables of interest.  Given the heavily right-skewed count distribution of the red cards, we chose to use a poisson regression as our model.  Number of games was used as an exposure variable, with goals, position, yellow cards, and country-of-origin-wide bias measures, meanIAT (implicit) and meanExp (explicit) included in the model.  Although the data was hierarchical, with players embedded in teams embedded in leagues, the information provided was deemed inadequate to use for a hierarchical model.  All dyads missing any of this data was excluded, leaving X total observations.  We found an incidence rate ratio of 8.235204 for ratings, suggesting that a player whose skin was rated as darkest was 8.235204 more likely than a player whose skin was rated lighest to receive a red card.  However this difference was not significant, with a large error (1.525) and wide confidence intervals (-0.881,    5.098) around the regression coefficient found (2.1084).  Our analysis found a light negative impact of meanIAT on the influence of skin-tone in the form of an IRR of 0.946804.  Again, it is difficult to have confidence in the result (coeff: -0.0547; std error: 0.051; p: 0.284; CI: -0.155, 0.045).  Finally we found a slight positive impact of meanExp on the influence of skin-tone in the form of an IRR of 1.003863 (coeff: 0.0039; std error: 0.008; p: 0.607; CI: -0.011, 0.019).  These results do not allow us to infer an influence of skin-tone on the # of red cards given.  This does not allow us to say that an effect does not exist: a different methodology, or a different dataset (perhaps one with IAT or Exp measures for individual referees, or more comprehensive team and country data allowing hierarchical modeling) might find better evidence."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}