Skip to content

Instantly share code, notes, and snippets.

@rschutjens
Last active March 23, 2016 15:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rschutjens/73c706049904b367f19f to your computer and use it in GitHub Desktop.
Save rschutjens/73c706049904b367f19f to your computer and use it in GitHub Desktop.
Inferential Statistics Project from Udacity
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from IPython.display import IFrame\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"import warnings\n",
"warnings.filterwarnings('ignore') # to hide an ugly warning doing the boxplot\n",
"# using the following link and functions to read in data from a google spreadsheet or chart. Google sheets are \n",
"# in general displayed exactly like in the spreadsheet, except for text cells that are spawn more than 1 column\n",
"\n",
"# base link for online acces spreadsheet\n",
"linkss = 'https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/'\n",
"\n",
"# download links for xlsx version of spreadsheet\n",
"downloadlink = 'https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME&exportFormat=xlsx'\n",
"\n",
"#table for boxplot in pandas\n",
"table = pd.read_excel(downloadlink, sheetname = 'SB', header = 0,\n",
" index_col = 0,\n",
" parse_cols = \"A, B, C, D, E, F, G\",\n",
" convert_float = False, skip_footer = 26)\n",
"\n",
"\n",
"def print_sheet(sslink, sheet, cellrange, width, height):\n",
" # prints the table on sheet, with cellrange\n",
" link = sslink + 'pubhtml?&single=true&gid=' + sheet + '&range=' + cellrange + '&widget=false&chrome=false'\n",
" return IFrame(link, width = width, height = height)\n",
"\n",
"def print_chart(sslink, chart, width, height):\n",
" # prints chart from spreadsheet \n",
" link = sslink + 'pubchart?oid=' + chart + '&format=interactive'\n",
" return IFrame(link, width = width, height = height)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Italian soccer league statistics 2014\n",
"At [datahub.io](https://datahub.io/dataset/italian-football-data-serie-a-b) you can find the Italian soccer league data for the 2014 season. It contains all the data about every match played in the season. I will use the data set to perform some statistical tests as a final project for inferential statistics on Udacity.\n",
"\n",
"The projects includes research question and hypothesis, experimental design, descriptive and inferential statistics, and conclusions from statistical tests. I've used google spreadsheet to investigate the obove soccer database, you can find the spreadsheet [here](https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubhtml). To download the spreadsheet use this [link](https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/edit?usp=sharing) though. The latter link will also allow you to see the functions in the cells.\n",
"\n",
"### General info\n",
"\n",
"Lets start with some general info you can find out about the data set. 20 teams participate in the 2014 league, in soccer every team plays one another 2 times (one home and one away match), each team plays 38 matches, for a total of 380 matches. From the total 380 matches, 181 times the home team won, 109 times the away team won, and 109 matches ended in a draw.\n",
"\n",
"In total 1035 goals were scored, 470 goals were scored before half time, or 45% of the goals were scored before halftime. \n",
"\n",
"Besides match scores, the database also contains information about cards given for foul plays, shots on target, crowd attendance, and a whole lot of betting info. I decided that focussing on match scores, but if you want to look into other statistics, you can check all of them over [here](http://football-data.co.uk/notes.txt)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.454106280193\n"
]
},
{
"data": {
"text/html": [
"\n",
" <iframe\n",
" width=\"100%\"\n",
" height=\"250\"\n",
" src=\"https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubhtml?&single=true&gid=1601794270&range=a1:b11&widget=false&chrome=false\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x3782e30>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print 470/1035.0\n",
"print_sheet(linkss, '1601794270', 'a1:b11', '100%', 250)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get a sense of how the teams performed I created a scoreboard from the raw data. In the table below the wins, draws, losses, goals, goals against, and total points (a win gives 3 points, draw 1, and loss 0) per team is shown. The data is sorted on number of points (and after, number of goals). This is the same kind of scoreboard you see when they discuss the season on sports channels. As it gives a nice overview of how the teams performed it is a good way to start analyzing your data. \n",
"\n",
"You can clearly see that the number one team, Juventus, won nearly all of their games in the season, 33 out of 38 games.\n",
"\n",
"Check the [link](https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/edit?usp=sharing) mentioned before to check the way I implemented functions to create the table below. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <iframe\n",
" width=\"100%\"\n",
" height=\"500\"\n",
" src=\"https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubhtml?&single=true&gid=540906560&range=a27:g47&widget=false&chrome=false\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x3782b10>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print_sheet(linkss, '540906560', 'a27:g47', '100%', 500)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make the match statistics more visible a bargraph of wins/draws/losses per team is a good summary of team performance over all of their matches during the season. As google sheets does not support boxplots natively, I imported the above table into a pandas dataframe to create a boxplot. You can see the boxplot of the wins, draws and loses of the graph below. This would be a good way to start comparing different seasons to eachother. \n",
"\n",
"From the boxplots you can clearly see that there are 2 outliers for the number of wins, comparing it to the above table you see that Juventus and Roma were the two teams that greatly outperformed the other teams. Another observation from the boxplots is that the median performance of teams is relatively closer (between $Q_1$ and $Q_3$), than both the top and bottom performers. At the end of the season (at least in this case), the top performers gained a large lead compared to the middle 50% of teams (and inversely for the worst performers).\n",
"\n",
"Now that I introduced the dataset it is time to delve in deeper with statistical analysis."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <iframe\n",
" width=\"100%\"\n",
" height=\"550\"\n",
" src=\"https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubchart?oid=1404670815&format=interactive\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x3782af0>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print_chart(linkss, '1404670815', '100%', 550)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAEACAYAAAB4ayemAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFWFJREFUeJzt3X+sZHV5x/H3h6XZgsje3Sos1cptjLZq3VwxWhO0jiIW\nUw1UG43YsmOprX8gBqOF1urdNbYBTKi/YtoG6l5isCQkBDEKS+o98kNBCnvZRWE1kdVi3bW4qItQ\nS+XpH3Nm97o7d+/cOTP3nO/5fl7JgTnnzsx95jz3PDvnOd/5jiICMzNLwzF1B2BmZsNz0TYzS4iL\ntplZQly0zcwS4qJtZpYQF20zs4QsW7QlrZV0l6QdknZJmi23z0p6WNK95XLW5MM1M8ubhhmnLen4\niHhc0hrgDuBC4A3AgYi4YsIxmplZaaj2SEQ8Xt5cCxwL9Cu9JhGUmZkNNlTRlnSMpB3AXuCWiLi7\n/NEFkhYkXSlp3cSiNDMzYMj2yME7SycC1wPvAf4beCQiQtJHgVMi4vzJhGlmZrDCog0g6UPAzxf3\nsiWdCtwYEZsG3N+Tm5iZjSAijmhBDzN65Bn91oek44AzgQclbVx0tzcD9x/lF7d2mZ2drT0GL85d\njkvb87eUY5cr2sApwJykY8oif21EfEnS1ZJmgKeAPcBfDfFcrbNnz566Q7AROXdpyzV/yxbtiNgF\nnDZg+3kTicjMzJbkT0RW1O126w7BRuTcpS3X/K34QuSKf4EUk/4dZmZtI4kY5UKkHV1RFHWHYCNy\n7tKWa/5ctM3MEuL2iJlZA7k9MiG5nqKZWT1ctCvatm1b3SHYiPwPbtpyzZ+LtplZQob5RKQdpiiK\ng//Kz83NMT09DUCn06HT6dQWl62Mc5W2XPPnoj2Cw4vzli1baovFzPLi9khFuc5/0Aa59kTbItf8\nuWhXNDMzU3cIZpYRj9M2M2sgj9M2M2sBF+2Kcu2rtYFzl7Zc8+eibWaWEPe0zcwayD1tM7MWcNGu\nKNe+Whs4d2nLNX8u2mZmCXFP28ysgdzTNjNrARftinLtq7WBc5e2XPO3bNGWtFbSXZJ2SNolabbc\nvl7Sdkm7Jd0sad3kwzUzy9tQPW1Jx0fE45LWAHcAFwJvAX4cEZdLuhhYHxGXDHise9pmZitUqacd\nEY+XN9fSm4M7gLOBuXL7HHDOGOI0M7OjGKpoSzpG0g5gL3BLRNwNnBwR+wAiYi9w0uTCbK5c+2pt\n4NylLdf8DfXNNRHxFPASSScC10t6Eb13279yt6Ue3+12D34l19TUFDMzMwe/+aW/41NdX1hYaFQ8\nXve619NcL4ri4BeF9+vlICsepy3pQ8DjwF8AnYjYJ2kjMB8RLxhwf/e0zcxWaOSetqRn9EeGSDoO\nOBN4APgC0C3vthm4YWzRmpnZQMP0tE8B5iUtAHcBN0fEl4DLgDMl7QbOAC6dXJjN1T+9sfQ4d2nL\nNX/L9rQjYhdw2oDt+4HXTSIoMzMbzHOPmJk1kOceMTNrARftinLtq7WBc5e2XPPnom1mlhD3tM3M\nGsg9bTOzFnDRrijXvlobOHdpyzV/LtpmZglxT9vMrIHc0zYzawEX7Ypy7au1gXOXtlzz56JtZpYQ\n97TNzBrIPW0zsxZw0a4o175aGzh3acs1fy7aZmYJcU/bzKyB3NM2M2sBF+2KPv7xj9cdgo0o155o\nW+SaPxftihYWFuoOwcwy4qJd0fT0dN0h2Ig6nU7dIVgFueZv2W9jtyMVRXHw1Gzr1q0Ht3c6nWz/\nkMxsdXj0SEXdbpdt27bVHYaNoCgK/yObsLbnb+TRI5KeLekrkr4paZek95TbZyU9LOnecjlrEoGb\nmdkhy77TlrQR2BgRC5JOAO4BzgbeBhyIiCuWeXyr32m3/V97M6vHUu+0l+1pR8ReYG95+zFJDwDP\n6j/vWKNMkAu2ma2mFY0ekTQNzAB3lZsukLQg6UpJ68YcWxJyHSvaBs5d2nLN39CjR8rWyHXAe8t3\n3J8BPhIRIemjwBXA+YMe2+12Dw6Nm5qaYmZm5uA71P6OT3W9P067KfF43ettWZeqn8jPz8/XFv9K\n14uiODio4WhDiYcaPSLpWOCLwJcj4hMDfn4qcGNEbBrws1b3tM2sHlu29Ja2WqqnPWzRvhp4JCLe\nt2jbxrLfjaSLgJdFxLkDHuuibWZjJ0GbS0uVIX+nA+8AXitpx6LhfZdL2ilpAXg1cNHYo05A//TG\n0uPcpa6oO4BaDDN65A5gzYAf3TT+cMzM7Gj8iUgzS5LbI2Zm1ngu2hW5L5ou5y5tmzcXdYdQCxdt\nM0tSt1t3BPVwT9vMrIHc0zYzawEX7YrcF02Xc5e2XPPnom1mlhB/3dhRjGPCGgD39JupP2mPpako\nOuSYQl+INLMk+cM1NpJc+2pt4Nylrqg7gFq4aFfk7/Q1s9Xk9khFbT9FM2uqth97bo+YmbWAi3Zl\nRd0B2Ijc006b5x4xM0uI5x6Z1C9wT9vMbMXc056Q2dm6IzCznLhoV9TpFHWHYCNyTzttuebPRdvM\nLCHuaZtZkrZs6S1ttVRP20XbzJLU9kEAvhA5Ibn21drAuUtdUXcAtVi2aEt6tqSvSPqmpF2SLiy3\nr5e0XdJuSTdLWjf5cJvHc4+Y2Wpatj0iaSOwMSIWJJ0A3AOcDbwT+HFEXC7pYmB9RFwy4PGtbo+0\n/RTNrKnafuyN3B6JiL0RsVDefgx4AHg2vcI9V95tDjhnfOGamdkgK+ppS5oGZoA7gZMjYh/0Cjtw\n0riDS0NRdwA2Ive005br3CNDf91Y2Rq5DnhvRDwm6fATkyVPVLrdLtPT0wBMTU0xMzNz8Kue+gdO\nquuwQFE0Jx6vez2X9W63WfFUXS+Kgm3lRbJ+vRxkqCF/ko4Fvgh8OSI+UW57AOhExL6y7z0fES8Y\n8Fj3tM3MVqjqkL9/Bb7VL9ilLwDd8vZm4IZKESbKc4+Y2WoaZsjf6cA7gNdK2iHpXklnAZcBZ0ra\nDZwBXDrZUJvJc4+kq39qamnKNX/L9rQj4g5gzRI/ft14wzEzs6Pxx9jNLEmee2Ryv9hF28zGru2D\nADz3yITk2ldrA+cudUXdAdTCRbsizz1iZqvJ7ZGK2n6KZtZUbT/23B4xM2sBF+3KiroDsBG5p12f\nDRt675SrLFBUfo4NG+reEyvnom1mq+7RR3utjSrL/Hz153j00br3xMq5p11R2/tqZpPQlOOmKXEM\n4p72hHjuETNbTS7aFXnukXS5p522XPPnom1mlhD3tM1s1TWll9yUOAZxT9vMrAVctCvKta/WBs5d\n2nLNn4t2RZ57xMxWk3vaFTW5J2bWVE05bpoSxyDuaZuZtYCLdmVF3QHYiHLtibZFrvlz0TYzS4h7\n2hU1uSdm1lRNOW6aEscgS/W0l/029jbbsGE8s3zpiN06vPXrYf/+6jGYpSQQVDhuxhfHof+mIuv2\nyHimhyyymxqyLXLtiTaBqHjgRVCMYW5WJVawYYiiLekqSfsk7Vy0bVbSw5LuLZezJhummZnBED1t\nSa8EHgOujohN5bZZ4EBEXLHsL2hwT7sJ/awmxGC22pryd9+UOAYZeZx2RNwODDqJb0BHyswsL1V6\n2hdIWpB0paR1Y4soMe6Lpsu5S1uu+Rt19MhngI9EREj6KHAFcP5Sd+52u0xPTwMwNTXFzMwMnU4H\nOLTj61gPRFGeL3TKWAtWtr6wwvsfvj4PFMV8I/ZH29ZVZVjPIvPzzs+41/tHQN3xQEFR1L8/Op0O\nRVGwrZzMqF8vBxlqnLakU4Eb+z3tYX9W/tw97YbHYLbamvJ335Q4Bqk694hY1MOWtHHRz94M3F8t\nPDMzG8YwQ/6uAb4GPF/S9yW9E7hc0k5JC8CrgYsmHGdjHTrds9Q4d2nLNX/L9rQj4twBmz87gVjM\nzGwZWc890oR+VhNiMFttTfm7b0ocg3g+bbPDbNlSdwRmK+eiXVGufbU22Lq1qDsEqyDXY89F28ws\nIe5pu6edLe/7+jRl3zcljkHc0zYzawEX7Ypy7au1Q1F3AFZBrseei7Zla/PmuiMwWzn3tN3TNlt1\nTfm7b0ocg7inbWbWAi7aFeXaV2sD5y5tuebPRdvMLCHuabunbbbqmvJ335Q4BnFP2+wwnnvEUuSi\nXVGufbU28Nwjacv12HPRNjNLiHva7mlny/u+Pk3Z902JYxD3tM3MWsBFu6Jc+2rtUNQdgFWQ67Hn\nom3J2rChd3o76gLVHi/1YjBbTe5pu6edrCbsuybEkKKm7LemxDGIe9pmZi3gol1Rrn21NnDu6lW1\nNSUVlZ9j/fq698LKLVu0JV0laZ+knYu2rZe0XdJuSTdLWjfZMM2sTSKqL+N4nv37690Po1i2py3p\nlcBjwNURsancdhnw44i4XNLFwPqIuGSJx7un3fAYUtWEfdeEGHLV9n0/ck87Im4HHj1s89nAXHl7\nDjincoRmZrasUXvaJ0XEPoCI2AucNL6Q0uK+aLqcu9QVdQdQi2PH9DxHPUnpdrtMT08DMDU1xczM\nDJ1OBzh04KS6vrCwUOnxUFAUzXk9Xl/ZuvPn9XGtF0XBtm3bAA7Wy0GGGqct6VTgxkU97QeATkTs\nk7QRmI+IFyzxWPe0Gx5Dqpqw75oQQ662bGn39LpVx2mrXPq+AHTL25uBGypFV6Pqw47yG3Jk1gRt\nLthHM8yQv2uArwHPl/R9Se8ELgXOlLQbOKNcT854hh0V2Q05aov+qamlKdf8LdvTjohzl/jR68Yc\ni5mZLSPruUfGwT3N+jRh3zchBmsnzz1iZtYCLtqVFXUHYCPKtSfaFt1uUXcItXDRrmjz5rojMMvT\n3Nzy92kj97QtWU3oJzchhly1fd8v1dMe1ycizVZdoF/99EAtMRz6r9lqcHukIvdF6yOqDbIv5ucr\nD9SXC3aNiroDqIWLtplZQtzTtmQ1oafZhBhy5blHbCRt/qMxa7Jcjz0X7Yq2bi3qDsFG5OsRacs1\nfy7aZmYJcU+7Ivc069OEfd+EGKyd3NM2M2sBF+3KiroDyFq1L6Eo/CUWCfPcIzYSzz1Sn+pfYFH9\nOfwlFvXx3COT+gUt72lbutyPTlvb8+eetplZC7hoV5TrWNF2KOoOwCop6g6gFi7aZmYJcdGuqNPp\n1B2CjWh2tlN3CFZBrvnzhciK2j5pjZnVwxciJ8Rzj6TL1yPSlmv+Kn1zjaQ9wE+Bp4AnI+Ll4wjK\nzMwGq9QekfRd4KUR8ehR7tPq9kjbx4qaWT0m1R7RGJ7DzMyGVLXgBnCLpLslvWscAaWnqDsAG1Gu\nc1e0Ra75q9oeOSUifijpmcAtwAURcfth94nNmzczPT0NwNTUFDMzMweHyvUvJjRxXRrPV33Pz883\n4vV4/fD8FszP05h4vJ53/oqiYNu2bQBMT0+zdevWge2RsQ35kzQLHIiIKw7b3uqetqXL1yPS1vb8\njb2nLel4SSeUt58GvB64f/QQzcxsOVV62icDt0vaAdwJ3BgR28cTVjr6pzeWoqLuAKySou4AajHy\nOO2IeAiYGWMsZma2DA/Xq6h/QcHSk+vcFW2Ra/4894iZWQN57pEJcU87Xc5d2nLNn4u2mVlC3B4x\nM2sgt0fMzFrARbuiXPtqbZDr3BVtkWv+XLQtW3NzdUdgVeSaP/e0rbXGNeGX/36bKde5Ryp9c41Z\nk7nYWhu5PVKRe9rpcu5SV9QdQC1ctM3MEuKetpklacuW3tJWS/W0XbTNzBrIH66ZEPdF0+XcNZuk\nyksbuWibWSNFxFGX+fn5Ze/TRm6PmJk1kNsjZmYt4KJdkfui6XLu0pZr/ly0zcwS4p62mVkDuadt\nZtYClYq2pLMkPSjp25IuHldQKcm1r9YGzl3acs3fyEVb0jHAp4E/BF4EvF3S744rsFQsLCzUHYKN\nyLlLW675q/JO++XAdyLiexHxJPBvwNnjCSsdP/nJT+oOwUbk3KUt1/xVKdrPAv5z0frD5TYzM5sQ\nX4isaM+ePXWHYCNy7tKWa/5GHvIn6RXAlog4q1y/BIiIuOyw+3m8n5nZCMY6NaukNcBu4Azgh8A3\ngLdHxANVgjQzs6WN/B2REfFLSRcA2+m1Wa5ywTYzm6yJfyLSzMzGxxciRyDpi5JOrDsOG0zSrKT3\n1R2HrYykA3XHkIKR2yM5i4g31h2DrYykNRHxy7rjsKPyaf8Q/E57AEnvL/v1SPpHSf9e3n6NpM9J\nekjSBkmnSvqWpH+RdL+kmyStLe97oaRvSlqQdE2drycHkj4oabekW4Hf6W3SfJm/bwAXSnqjpDsl\n3SNpu6Rnlo/d2T9zkvSIpD8tb89JOkPSCyXdJeneMp/Pre2FZkLSxyTtknSfpLeW2zZK+mqZh52S\nTi+3nynpa5L+Q9K1ko4vt19aHpcLki6v8/WM1XJf15PjAvw+cG15+1bgTmAN8GHgXcB3gQ3AqcD/\nAi8u73stcG55+wfAr5W3T6z7NbV5AU4D7gPWAk8HvgO8D5gHPr3ofusW3T4f+Fh5+zPAG+hNx3AX\n8M/l9m8DxwGfpDcyCnpnp2vrfs1tXICflf9/C3Bzefsk4HvAyWVO/6bcLuBpwG8AXwWOK7f/NfB3\n5fH54KLnbs0x6Hfag90DvFTS04FfAF8HXga8CriN3h9M30MRsWvR46bL2/cB10h6B+DT8sl6FXB9\nRPwiIg4AN9DLUdD7h7TvtyTdLGkn8H56RRrgduDVwB8A/wS8WNJvAvsj4gl6+f+gpA8A0xHxi1V5\nVfk6Hfg8QET8CCjoHX93A38u6cPApoj4OfAK4IXAHZJ2AOcBzwF+Cjwh6UpJfww8seqvYkJctAeI\niP8D9gBd4A56hfo1wHMj4sHD7r74AP4lh64T/BG9CbVOA+4uJ9iy1bH4H9WfL7r9KeCTEbEJeDfw\n6+X2W+kV/lfSe3f+CPAn9PJORHweeBPwP8CXJHUmGbwdQQARcRu9PP0A+GzZxhKwPSJOi4iXRMTv\nRcRfRu/6xcuB64A3AjfVFPvYuZAs7TZ678ZupfdO7N3AvQPud8QnlkrPiYivApcAJwInTCJIA3o5\nOkfS2vLs6E3l9sNzcyLwX+Xtzf2NEfEw8AzgeRGxh16++7lH0m9HxEMR8Sl67+I3TeqFZK6fr9uA\nt0k6przu8CrgG5KeA/woIq4CrqL3huhO4PT+dQZJx0t6nqSnAVMRcRO9tkprcubRI0u7Dfhb4OsR\n8YSkJ8pt8KtXuY+44i3pWOBz5cUtAZ+IiJ9NOuBcRcQOSdcCO4F99D6dGxyZm63AdZL2A1/hUCsL\negd//03MbcA/0CveAG+V9GfAk/Q+/fv3E3gZVuYrIq4vp8m4D3gK+EBE/EjSecAHJD0JHADOi4hH\nJHWBz5eDAIJeT/sAcIOk/tnURav8WibGH64xM0uI2yNmZglx0TYzS4iLtplZQly0zcwS4qJtZpYQ\nF20zs4S4aJuZJcRF28wsIf8P+f36j8ssL60AAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x37b3070>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"table[['wins','draws','loses']].boxplot();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Home team advantage?\n",
"As with allmost every sports it is generally known that the home team has an advantage for playing at their home field. In soccer this is mitigated by making sure that every team plays two times against another team, one match at home, one away. Investigating the home field advantage is important to make sure that the competition is fair.\n",
"\n",
"One way of investigating the data if that is true is to perform a $\\chi^2$ test for goodness of fit. The expected values can be determined by looking at the total number of matches that were not draws, 290. If there would be no home team advantage the total number of wins, both at home and away, would be expected to be the same, 145 for each.\n",
"\n",
"When a match ends in a draw, there is no favor for the home team or away team, both receive a draw on their seasonal record for the match. For the degrees of freedom we do take draws into consideration as it is a viable category (some sports might always play out in such a way that there is always a winner, not so in soccer).\n",
"\n",
"Taking all of the above into account together results in the consistency table below. From the discussion above, the null hypothesis is given by the expected frequencies in the table. Performing the $\\chi^2$ test ($df = 2$) gives a value of $\\chi^2 = 17.88$, and $p > .0001$. Because of the low value of $p$, we reject the null hypothesis that the win rate at home and away is the same. \n",
"\n",
"It seems clear that there is a home field advantage, but how large is this effect? The $\\phi$ coefficient gives an effect size measure you can calculate from the $\\chi^2$ value, $\\phi_v =\\sqrt{\\frac{\\chi^2}{n}} = 0.217$, where $n = 380$ the total number of matches. For 2 degrees of freedom this results in a medium size effect. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <iframe\n",
" width=\"100%\"\n",
" height=\"75\"\n",
" src=\"https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubhtml?&single=true&gid=1855095297&range=a1:e3&widget=false&chrome=false\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x3782a50>"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print_sheet(linkss, '1855095297', 'a1:e3', '100%', 75)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More insights about the home field advantage can be gained by looking at the win percentage of teams for home and away matches separately. This can be done by separating the matches for a team in home and away, and calculating the win rate for both groups. \n",
"\n",
"A scatter plot of win percentage at home vs win percentage away is given below. The axes are chosen in such a way that the diagonal splits home advantage and away advantage. Points below the line away_win% = home_win% indicate that the team had higher winrate for matches played at home. So only one team out of 19 had an away advantage from the complete season. Another notable point is Juventus again, they had a perfect home win record."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <iframe\n",
" width=\"100%\"\n",
" height=\"375\"\n",
" src=\"https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubchart?oid=1168919608&format=interactive\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x3782c30>"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print_chart(linkss, '1168919608', '100%', 375)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Results of a linear regression on the home and away win % are shown in the table below. To get a better picture of how well the linear regression estimates the slope, a CI can be calculated, this results in a value of the slope between $(0.53, 0.99)$ for a CI of 95%. Only a rough estimate of the slope can be given by the data (a higher CI% would have a slope of 1 within the interval).\n",
"\n",
"Another way to look at the results would be to consider the intercept. One can interpret the intercept as, when teams have 0 away win percentage, we still expect them to have a 7% home win percentage. \n",
"\n",
"$r^2$ gives the percentage of away win percentage explained by home win percentage, this gives a sense of how well the data is correlated. As $r^2 = .65$ there is indeed a correlation, but it won't tell the whole story. A contribution to the low value could be the limited amount of datapoints, only 20 teams are present in the league.\n",
"\n",
"From both analysis we can determine that there is indeed a home field advantage in soccer. The latter is especially nice to get a sense of size effect, but the parameter estimates need to be more accurate. Luckily the home and away win % could also easily be made more accurate by combining the data of other seasons."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <iframe\n",
" width=\"100%\"\n",
" height=\"400\"\n",
" src=\"https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubhtml?&single=true&gid=558283501&range=q24:r41&widget=false&chrome=false\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x37825b0>"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print_sheet(linkss, '558283501', 'q24:r41', '100%', 400)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Are matches decided before half time?\n",
"Soccer matches, unlike many sports that are popular in the USA, have very low scores. As matches are decided with only a few goals, it is possible that coming back from being behind in goals is very hard. To get a sense for if this is true we can compare the statistics of a match at half time and full time. Just as in the end of a match a team either wins, loses, or the match is a draw, we decide by the half time score if the match would be a win, lose, or draw.\n",
"\n",
"These statistics are available in the data, and it results in a 3 by 3 consistency table (H means home team win, A means away team win, D means draw). Every match is only counted once in the table (i.e. every match has only one combination of the 9). In the consistency table the marginal totals on the right are exactly the totals of the consistency table used for home team advantage. \n",
"\n",
"When the ratio of the marginal totals of halftime home team and away team wins is compared to the ratio of fulltime results one gets almost the same ratio($181/109=1.66$, $136/85=1.6$). As the home field advantage of course influences the whole match, we do expect to see such ratio for halftime scores as well.\n",
"\n",
"From the table we see that most matches are a draw before halftime, 159 (most of which are 0-0), out of these matches 69 matches end in a win for either team at the end of the match. If we look at home team wins for both half and full time, we see that teams can keep their score advantage throughout the full match with a high frequency. Contrary, matches where teams gave away a score advantage are very rare, only 14 matches (combinations H-A, and A-H). The last two observations already give an indication that it is hard for teams to come back from being behind after half time, and that halftime and fulltime match scores are not independent. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <iframe\n",
" width=\"100%\"\n",
" height=\"250\"\n",
" src=\"https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubhtml?&single=true&gid=724456026&range=a1:i11&widget=false&chrome=false\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x3782db0>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print_sheet(linkss, '724456026', 'a1:i11', '100%', 250)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### $\\chi^2$ test for independence\n",
"\n",
"To formalize the discusssion from the last paragraph a $\\chi^2$ test for independence can be performed. The expected values are now dependent on the marginal totals (see the spreadsheet to see the calculations). For the null hypothesis, the halftime and fulltime scores are independent, resulting in the expectation values shown below the contingency table. The results of the test gives a $\\chi^2 = 172.48$, with $df = 4$, resulting in $p > .0001$. The null hypothesis is thus rejected, and we conclude that there is indeed a dependence between halftime and fulltime scores. $\\chi^2$'s value also means that the effect size is large, calculating Cramer's V: $V = \\sqrt{\\frac{\\chi^2}{df*n}} = 0.48$, confirming that this is indeed the case.\n",
"\n",
"When at halftime a team has the upperhand by being ahead in goals, the outcome of the full match did not change. In fact from those 136+63=221 games, 112+63=175 kept the same outcome, 17+15=32 games the game still ended in a draw, and only 7+7=14 times did the losing team pull of a victory by the end of the match.\n",
"\n",
"Even though for a large portion of the matches is seems that at halftime the match is already decided, especially if one team is already winning, this does not mean that playing the second half is pointless of course (unless maybe all you care about is to know who wins or loses). One factor is that the number of goals scored in the second half is higher than in the first. A coach could make some decisions about a match based on these statistics (like decide to switch certain players out or not), although it would need to take into account the actual situation during the season as well. \n",
"\n",
"[This websites](http://www.soccerwidow.com/football-gambling/betting-knowledge/betting-advice/betting-guidance/half-time-results-more-predictable/) provide an alternative discussion on the relation between halftime and fulltime. Even though part of it is true: that a large part of the the matches is undecided at halftime, 159 in this season. I do not agree that there is a deadlock in the first half. As stated before 175 out of 221 matches where one team had the winning advantage at halftime, they were successful in keeping their advantage. Looking at the matches that were a draw at halftime, it is true that a large proportion (101 out of 159) ended up being a win for either team. As such both effect seem to be of the same order of magnitude and target distinct categories, a complete analysis would thus needs to account for both.\n",
"\n",
"Even though above I try to give a sense of size effect by referring directly to the numbers to gain a more formal understanding about size effects one could split up the data per team like in the last section and perform various regression analysis."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Are soccer matches independent event?\n",
"One objection to the above investigation of soccer matches is that you cannot assume that individual matches are independent events. Many reasons exist for thinking that matches are dependent seem plausible, like teams play matches with relatively little time in between, injuries, or there seems to be a momentum or slumps for some teams. But actual evidence for such common sense reasoning seems to be elusive. Especially the latter one about momentum in baseball and basketball is discussed in this [article](http://www.thefreelibrary.com/Winning+Streaks+in+Sports+and+the+Misperception+of+Momentum.-a062990408).\n",
"\n",
"Of course soccer is a different sport than baseball and basketball, but like home field advantage, statistical properties can apply more generally. I did not do the statistical test as described in the article above (they performed a $\\chi^2$ test for goodness of fit under the assumption the result of a match is independent on the last match played), but decided to look more into the statistical description of goals scored in a match."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <iframe\n",
" width=\"100%\"\n",
" height=\"375\"\n",
" src=\"https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubchart?oid=1476950334&format=interactive\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x3782e10>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print_chart(linkss, '1476950334', '100%', 375)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above a histogram of total goals per match is shown (adding up home and away team goals for fulltime scores). It is easy to see that this is a Poisson distribution. Compare it to this [distribution](http://www.wolframalpha.com/input/?i=poisson+distribution+2.7) on wolfram alpha with the same mean (see the table all the way below for descriptive statistics, also the standard deviation of the disctribution is the square root of the mean, also a property of the Poisson distribution). From the mathematics underlying the Poisson distribution, it seems that goals scored are independent events (with an exponential distribution). More information about the Poisson distribution can be found on the [wiki page](https://en.wikipedia.org/wiki/Poisson_distribution#Law_of_rare_events). \n",
"\n",
"As goals determine the outcome of the match, and goals seem to be independent of eachother in one soccer match, it seems reasonable that different matches will also be independent of eachother. Of course this is no rigorous derivation, but that is out of the scope of this project. If anyone has any comments or questions about this please let me know. One reason why this is the case is as mentioned before, that soccer matches tend to have few goals, so that it satisfies the conditions for a rare event.\n",
"\n",
"From when I played soccer years ago, it seems psychologically you perceive to be in a positive or negative momentum. Statistically there seems to be no evidence from the results. For an individual player or coach this could guard against the attitude of 'losing ones guard' or 'taking it easy' in apparent positive flows, while for slumps this understanding could help keeping up a positive attitude as it seems that after any goal against one has a completely new chance to come on top. Maybe there is some truth in the fact that fans get angry when they see their team clearly giving up in adversity during a match (hello Dutch national team!)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <iframe\n",
" width=\"100%\"\n",
" height=\"200\"\n",
" src=\"https://docs.google.com/spreadsheets/d/1ARcUWUsxu5GyIQr65qafsJ5zWfNhLXCIW5zV0F8tTME/pubhtml?&single=true&gid=1559347413&range=i2:j8&widget=false&chrome=false\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x3782fd0>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print_sheet(linkss, '1559347413', 'i2:j8','100%', 200)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Further investigation"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"As mentioned in the section discussing halftime and fulltime results, the relationship could be investigated further by using regression methods on data per team. This will have similar limitation of only 20 data points as in my investigation of winrates, but can be improved upon by adding data of more seasons. A good start would be to perform some 2 sample t-tests to compare the seasons, one could for example also test what different qualities distinguish the A and B series. \n",
"\n",
"Another way could be to compare the teams using Anova the Anova method to figure out what qualities Juventus has as an outlier, and distinguishes it from the rest of the teams. \n",
"\n",
"Or one could analyze betting data in relation to the homefield advantage, or one could extend the home field advantages to number of goals scored (the histogram in last section is a start for that).\n",
"\n",
"As last mention you can see the way statistics can be used to find interesting occurences in a season. For example the boxplot data clearly shows the two best performers of the season, from the halftime vs fulltime consistency table there is 14 matches where a team came back from being behind, the scattter plot of winrates shows perfect home field win rate for Juventus, and the histogram of goals identifies the top matches with most goals. All of these could be identified with an automatic script to create an automated summary of the season."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment