Skip to content

Instantly share code, notes, and snippets.

@zaeemnalla
Created August 5, 2019 13:27
Show Gist options
  • Save zaeemnalla/e3a8ea2918788bbb91c8d8ef4ec491b9 to your computer and use it in GitHub Desktop.
Save zaeemnalla/e3a8ea2918788bbb91c8d8ef4ec491b9 to your computer and use it in GitHub Desktop.
A Premier League data extractor that creates a dataset of all match results and team stats
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Web Scraping\n",
"\n",
"The following code scrapes data from the premier league website. There are a total of 42 stats (some of which may need to be discarded) that it collects for each team from season 2006-2007 until 2017-2018, making it an approximate total of 504 web requests performed. These stats will serve as features of a dataset that'll be fed into an ANN and RNN."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"import requests\n",
"import progressbar\n",
"import pandas as pd\n",
"import numpy as np\n",
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Page"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"webpage = requests.get('https://www.premierleague.com/stats/top/clubs/wins?se=79')\n",
"soup = BeautifulSoup(webpage.text, 'html.parser')\n",
"\n",
"if not os.path.exists('files'):\n",
" os.makedirs('files/stats')\n",
" os.makedirs('files/results')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Links"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**select** is used instead of find_all since it allows the use of javascript selectors <br>\n",
"\n",
"------\n",
"\n",
"**General** <br>\n",
"*index 0-5* <br>\n",
"wins, losses, goals, yellow cards, red cards, substitutions on\n",
"\n",
"**Attack** <br>\n",
"*index 6-15* <br>\n",
"shots, shots on target, hit woodwork, goals from header, goals from penalty, goals from free kick, goals from inside box, goals from outside box, goals from counter attack, offsides\n",
"\n",
"**Defence** <br>\n",
"*index 16-29* <br>\n",
"clean sheets, goals conceded, saves, blocks, interceptions, tackles, last man tackles, clearances, headed clearances, caught opponent offside, own goals, penalties conceded, goals conceded from penalty, fouls\n",
"\n",
"**Team Play** <br>\n",
"*index 30-35* <br>\n",
"passes, through balls, long passes, backwards passes, crosses, corners taken\n",
"\n",
"**Others** <br>\n",
"*index 36-42 <br>\n",
"non-duplicates attributes from top i.e. don't appear in more* <br>\n",
"touches, big chances missed, clearances off line, dispossessed, penalties saved, high claims, punches"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"def attributes(links):\n",
" return [link[link.rfind('/')+1:] for link in links]\n",
"\n",
"def uniques(links):\n",
" l = []\n",
" for link in links:\n",
" if link not in l:\n",
" l.append(link)\n",
" return l\n",
"\n",
"top = [link['href'] for link in soup.select('a.topStatsLink')]\n",
"more = [link['href'] for link in soup.select('nav.moreStatsMenu a')]\n",
"links = uniques(attributes(more) + attributes(top))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dates\n",
"\n",
"Data is being collected from the 2006/2007 season since detailed and constant stats were collected from then onwards.\n",
"\n",
"Dates (i.e. seasons and their corresponding ids) cannot be scraped since it doesn't appear in the html so I'll manually extract it from the html that Chrome Dev Tools displays. Note that this html often varies to the plain output html from cURL or when downloading the page since the browser is capable of executing some js and changing the document."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"dates = {'2006-2007':15, '2007-2008':16, '2008-2009':17, '2009-2010':18, \n",
" '2010-2011':19, '2011-2012':20, '2012-2013':21, '2013-2014':22, \n",
" '2014-2015':27, '2015-2016':42, '2016-2017':54, '2017-2018':79}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stats\n",
"\n",
"The data couldn't be scraped from the webpage since the site uses ajax to update the table that includes the stats i.e. when you visit the webpage for a specific attribute (one of the links), the html contains the data for 'all seasons' and when you choose a particular season then the table is updated using ajax meaning that the source doesn't reflect the new values and still (and only will) show the data for 'all seasons'.\n",
"\n",
"To hack around this, visit the Network tab of Chrome Dev Tools and filter response resources by XHR (i.e. XML and JSON documents from XMLHttpRequests) - which is responses from AJAX requests. We can see that the last document contains the data from the ajax request and is what we need, however we can't simply send a get request for the link to that JSON document because it's an API that needs to authorise the request source. This will result in a 403 forbidden response. \n",
"\n",
"To get around that last hurdle, we need to include the headers (particularly general and request) sent by the premier league website to the API to block our identity and clone the premier league's. We get the required headers from the same place in the Chrome Dev Tools, clicking on the JSON document/response needed, and then clicking on the Headers tab of the panel to the right.\n",
"\n",
"https://www.agenty.com/docs/how-to/18/how-to-crawl-an-infinite-scrolling-ajax-website <br>\n",
"https://www.codementor.io/codementorteam/how-to-scrape-an-ajax-website-using-python-qw8fuitvi <br>\n",
"https://www.quora.com/What-are-the-best-ways-to-scrape-the-AJAX-driven-websites <br>\n",
"https://stackoverflow.com/questions/44080707/web-scraping-a-strange-html-setup-with-python-beautifulsoup-urllib\n",
"\n",
"-----\n",
"\n",
"One thing that was noticed is that if there's a NaN value for all teams for a certain attribute then that stat wasn't collected, however, if it's just for some/few teams then the value for that team is meant to be 0 e.g. red cards for Burnley in 2017-2018 was 0 as per their stats page (https://www.premierleague.com/clubs/43/Burnley/stats?se=79) but the page where the data is sourced from (https://www.premierleague.com/stats/top/clubs/total_red_card?se=79) doesn't display them in the table since their value is 0."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2006-2007\t[--------------------------------------------------------------] 100%\n",
"2007-2008\t[--------------------------------------------------------------] 100%\n",
"2008-2009\t[--------------------------------------------------------------] 100%\n",
"2009-2010\t[--------------------------------------------------------------] 100%\n",
"2010-2011\t[--------------------------------------------------------------] 100%\n",
"2011-2012\t[--------------------------------------------------------------] 100%\n",
"2012-2013\t[--------------------------------------------------------------] 100%\n",
"2013-2014\t[--------------------------------------------------------------] 100%\n",
"2014-2015\t[--------------------------------------------------------------] 100%\n",
"2015-2016\t[--------------------------------------------------------------] 100%\n",
"2016-2017\t[--------------------------------------------------------------] 100%\n",
"2017-2018\t[--------------------------------------------------------------] 100%\n"
]
}
],
"source": [
"for date in dates.keys():\n",
"\n",
" df = pd.DataFrame()\n",
" bar = progressbar.ProgressBar(maxval=len(links), widgets=[date + '\\t', progressbar.Bar('-', '[', ']'), ' ', progressbar.Percentage()])\n",
" bar.start()\n",
" for i, attribute in zip(range(len(links)), links):\n",
"\n",
" # setup\n",
" api = 'https://footballapi.pulselive.com/football/stats/ranked/teams/' + attribute\n",
" headers = {'Origin': 'https://www.premierleague.com'}\n",
" params = {'page': '0', 'pageSize': '20', 'compSeasons': dates[date], 'comps': '1', 'altIds': 'true'}\n",
"\n",
" # request\n",
" response = requests.get(api, params=params, headers=headers)\n",
" data = json.loads(response.text)\n",
"\n",
" # parse\n",
" teams = []; values = [];\n",
" for team in data['stats']['content']:\n",
" teams.append(team['owner']['name'])\n",
" values.append(team['value'])\n",
" series = pd.Series(values, teams, float, attribute)\n",
" if df.index.empty:\n",
" df = pd.DataFrame(series)\n",
" else:\n",
" df = df.join(series)\n",
"\n",
" # progress\n",
" bar.update(i+1)\n",
"\n",
" bar.finish()\n",
" df.dropna(axis=1, how='all', inplace=True)\n",
" df.fillna(0, inplace=True)\n",
" df.to_csv('files/stats/' + date + '.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Validation"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2006-2007\t35\n",
"2007-2008\t38\n",
"2008-2009\t37\n",
"2009-2010\t38\n",
"2010-2011\t40\n",
"2011-2012\t40\n",
"2012-2013\t40\n",
"2013-2014\t40\n",
"2014-2015\t40\n",
"2015-2016\t40\n",
"2016-2017\t40\n",
"2017-2018\t40\n"
]
}
],
"source": [
"datasets = []\n",
"for date in dates.keys():\n",
" dataset = pd.read_csv('files/' + date + '.csv', index_col=0)\n",
" datasets.append(dataset)\n",
" print(date + '\\t' + str(len(dataset.columns)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above shows the number of statistics collected for each team for each of the listed seasons. We can see that there only seems to be consistency from season 2010-2011 onwards because there are the same number of stats collected. To validate that, we'll see if the stats are the same ones collected."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Valid Set\n"
]
}
],
"source": [
"mismatches = []\n",
"\n",
"for i in range(4,12):\n",
" for j in range(i+1,12):\n",
" if datasets[i].columns.tolist() != datasets[j].columns.tolist():\n",
" mismatches.append((i,j))\n",
" \n",
"if len(mismatches) == 0:\n",
" print('Valid Set')\n",
"else:\n",
" print('Invalid Set')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Results"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2017-2018\t[--------------------------------------------------------------] 100%\n"
]
}
],
"source": [
"def get_team_ids(date):\n",
" # setup\n",
" api = 'https://footballapi.pulselive.com/football/compseasons/' + str(dates[date]) + '/teams'\n",
" headers = {'Origin': 'https://www.premierleague.com'}\n",
" \n",
" # request\n",
" response = requests.get(api, headers=headers)\n",
" teams = json.loads(response.text)\n",
" \n",
" # parse\n",
" team_ids = []\n",
" for team in teams:\n",
" team_ids.append(int(team['id']))\n",
" team_ids = ','.join(map(str, team_ids))\n",
" \n",
" return team_ids\n",
" \n",
"def get_results(date, team_ids):\n",
" # setup\n",
" api = 'https://footballapi.pulselive.com/football/fixtures'\n",
" headers = {'Origin': 'https://www.premierleague.com'}\n",
" params = {'comps':'1', 'compSeasons':dates[date], 'teams':team_ids, 'page':'0', 'pageSize':'380', 'sort':'asc', 'statuses':'C', 'altIds':'true'}\n",
"\n",
" # request\n",
" response = requests.get(api, params=params, headers=headers)\n",
" results = json.loads(response.text)\n",
" \n",
" # parse\n",
" df = pd.DataFrame(columns=['home_team', 'away_team', 'home_goals', 'away_goals', 'result'])\n",
" for result in results['content']:\n",
" row = []\n",
" row.append(result['teams'][0]['team']['name'])\n",
" row.append(result['teams'][1]['team']['name'])\n",
" row.append(result['teams'][0]['score'])\n",
" row.append(result['teams'][1]['score'])\n",
" row.append(result['outcome'])\n",
" row = pd.Series(row, index=df.columns)\n",
" df = df.append(row, ignore_index=True)\n",
" \n",
" return df\n",
"\n",
"bar = progressbar.ProgressBar(maxval=len(dates), widgets=['', '\\t', progressbar.Bar('-', '[', ']'), ' ', progressbar.Percentage()])\n",
"bar.start()\n",
"for i, date in zip(range(len(dates)), dates.keys()):\n",
" bar.widgets[0] = date\n",
" team_ids = get_team_ids(date)\n",
" results = get_results(date, team_ids)\n",
" bar.update(i+1)\n",
" results.to_csv('files/results/' + date + '.csv')\n",
"bar.finish()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Concatenation\n",
"\n",
"Concatenating the discrete sets for each season into one big set that contains all seasons - one for stats and another for results"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"files = ['2006-2007.csv', '2007-2008.csv', '2008-2009.csv', '2009-2010.csv', '2010-2011.csv', '2011-2012.csv', '2012-2013.csv', '2013-2014.csv', '2014-2015.csv', '2015-2016.csv', '2016-2017.csv', '2017-2018.csv']\n",
"\n",
"stats_df = pd.DataFrame()\n",
"results_df = pd.DataFrame()\n",
"for name in files:\n",
" \n",
" # Stats\n",
" f = 'files/stats/' + name\n",
" stats_series = pd.Series([name[:-4]]*20, name='season')\n",
" stats_season = pd.concat([pd.read_csv(f, index_col=False), stats_series], axis=1)\n",
" columns = stats_season.columns.tolist(); columns[0] = 'team'; stats_season.columns = columns\n",
" if stats_df.empty:\n",
" stats_df = stats_season\n",
" else:\n",
" stats_df = pd.concat([stats_df, stats_season])\n",
" \n",
" # Results\n",
" f = 'files/results/' + name\n",
" results_series = pd.Series([name[:-4]]*380, name='season')\n",
" results_season = pd.concat([pd.read_csv(f), results_series], axis=1)\n",
" if results_df.empty:\n",
" results_df = results_season\n",
" else:\n",
" results_df = pd.concat([results_df, results_season])\n",
" \n",
"stats_df = stats_df[stats_season.columns.tolist()]\n",
"stats_df.to_csv('files/stats/stats.csv', index=False)\n",
"\n",
"results_df.drop(results_df.columns[0], axis=1, inplace=True)\n",
"results_df.to_csv('files/results/results.csv', index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment