Skip to content

Instantly share code, notes, and snippets.

@CNuge
Last active February 20, 2020 05:16
Show Gist options
  • Save CNuge/ca6f6b3b257cf59124c87e3c8e7c5d88 to your computer and use it in GitHub Desktop.
Save CNuge/ca6f6b3b257cf59124c87e3c8e7c5d88 to your computer and use it in GitHub Desktop.
scrape yesterday's mlb scores
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It can be helpful to have the html from www.baseball-reference.com open in a second tab.\n",
"First we import the modules we need:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from urllib.request import urlopen #this module lets us grab web pages. \n",
"#note the would be just 'from urllib' in python 2.7\n",
"from bs4 import BeautifulSoup #this module lets us easily parse the html. \n",
"#if you don't have BeautifulSoup downloaded, type 'pip install beautifulsoup4' into the command line"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we define a variable for the webpage we need to scrape, and use urlopen to get the page source data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<http.client.HTTPResponse at 0x10dc4f780>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseball_reference = 'http://www.baseball-reference.com/'\n",
"\n",
"mlb_dat = urlopen(baseball_reference)\n",
"#lets look at what is stored here:\n",
"mlb_dat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I find looking at what is in the variables helps to visualize what is happening from one step to the next. You can use the following to look in mlb_dat to see what urlopen() returns from the page."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"scrolled": false
},
"outputs": [],
"source": [
"for x in mlb_dat:\n",
" print(x) #note if you run this you must rerun mlb_dat = urlopen(baseball_reference), \n",
" #as the data in a urlopen() result can only be called one time!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"mlb_dat has all the html code from the page source. That is a lot more information than we need! In order to separate the wheat from the chaff we first build a BeautifulSoup object that holds the data, and then we can use the html tags around the data we are after to separate things out. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"mlb_scores = BeautifulSoup(mlb_dat, 'lxml')"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"We know that we first need to find the 'scores' section of the page.\n",
"Recall from the html discussed earlier:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"'</div class=\"\" id=\"scores\">'\n",
"'<h2><a href=\"/boxes/?date=2017-07-25\">MLB Scores (Tuesday, July 25)</a></h2>'\n",
"'<div class=\"game_summaries\">'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NOTE: the url source changes from day to day, so the scores section and what is returned from the notebook will depend on which day you're running it!\n",
"OTHER NOTE: if you're running this yourself and the mlb season is over, be prepared for some uninteresting results!\n",
"OTHER OTHER NOTE: The single quotes around the html code are my own, this is to present them in a notebook readable format."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"game_section = mlb_scores.find(id=\"scores\")"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"The next section we were after was:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"'<div class=\"game_summary nohover\">'\n",
"' <table class=\"teams\">'\n",
"'\t\t<tbody>' \n",
"'\t\t<tr class=\"loser\">'\n",
"'\t\t\t<td><a href=\"/teams/OAK/2017.shtml\">Oakland Athletics</a></td>'\n",
"'\t\t\t<td class=\"right\">1</td>'\n",
"'\t\t\t<td class=\"right gamelink\">'\n",
"'\t\t\t\t<a href=\"/boxes/TOR/TOR201707250.shtml\">Final</a>'\n",
"'\t\t\t</td>'\n",
"'\t\t</tr>'\n",
"'\t\t<tr class=\"winner\">'\n",
"'\t\t\t<td><a href=\"/teams/TOR/2017.shtml\">Toronto Blue Jays</a></td>'\n",
"'\t\t\t<td class=\"right\">4</td>'\n",
"'\t\t\t<td class=\"right\">'\n",
"'\t\t\t</td>'\n",
"'\t\t</tr>'\n",
"'\t\t</tbody>'\n",
"'\t</table>'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see the 'teams' tag in the above wraps each game's data, so we want to find all of these so that we can search through the games for our team of interest."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/KCR/2017.shtml\">Kansas City Royals</a></td>\n",
" <td class=\"right\">4</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/BOS/BOS201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/BOS/2017.shtml\">Boston Red Sox</a></td>\n",
" <td class=\"right\">2</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/CLE/2017.shtml\">Cleveland Indians</a></td>\n",
" <td class=\"right\">9</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/CHA/CHA201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/CHW/2017.shtml\">Chicago White Sox</a></td>\n",
" <td class=\"right\">3</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/HOU/2017.shtml\">Houston Astros</a></td>\n",
" <td class=\"right\">6</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/DET/DET201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/DET/2017.shtml\">Detroit Tigers</a></td>\n",
" <td class=\"right\">5</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/SFG/2017.shtml\">San Francisco Giants</a></td>\n",
" <td class=\"right\">4</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/LAN/LAN201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/LAD/2017.shtml\">Los Angeles Dodgers</a></td>\n",
" <td class=\"right\">6</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/CIN/2017.shtml\">Cincinnati Reds</a></td>\n",
" <td class=\"right\">4</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/MIA/MIA201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/MIA/2017.shtml\">Miami Marlins</a></td>\n",
" <td class=\"right\">7</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/CHC/2017.shtml\">Chicago Cubs</a></td>\n",
" <td class=\"right\">1</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/MIL/MIL201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/MIL/2017.shtml\">Milwaukee Brewers</a></td>\n",
" <td class=\"right\">2</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/TBR/2017.shtml\">Tampa Bay Rays</a></td>\n",
" <td class=\"right\">1</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/NYA/NYA201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/NYY/2017.shtml\">New York Yankees</a></td>\n",
" <td class=\"right\">6</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/MIN/2017.shtml\">Minnesota Twins</a></td>\n",
" <td class=\"right\">6</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/OAK/OAK201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/OAK/2017.shtml\">Oakland Athletics</a></td>\n",
" <td class=\"right\">3</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/ATL/2017.shtml\">Atlanta Braves</a></td>\n",
" <td class=\"right\">3</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/PHI/PHI201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/PHI/2017.shtml\">Philadelphia Phillies</a></td>\n",
" <td class=\"right\">10</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/PIT/2017.shtml\">Pittsburgh Pirates</a></td>\n",
" <td class=\"right\">2</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/SDN/SDN201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/SDP/2017.shtml\">San Diego Padres</a></td>\n",
" <td class=\"right\">3</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/NYM/2017.shtml\">New York Mets</a></td>\n",
" <td class=\"right\">7</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/SEA/SEA201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/SEA/2017.shtml\">Seattle Mariners</a></td>\n",
" <td class=\"right\">5</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/ARI/2017.shtml\">Arizona Diamondbacks</a></td>\n",
" <td class=\"right\">0</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/SLN/SLN201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/STL/2017.shtml\">St. Louis Cardinals</a></td>\n",
" <td class=\"right\">1</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/BAL/2017.shtml\">Baltimore Orioles</a></td>\n",
" <td class=\"right\">2</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/TEX/TEX201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/TEX/2017.shtml\">Texas Rangers</a></td>\n",
" <td class=\"right\">8</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>, <table class=\"teams\">\n",
" <tbody>\n",
" <tr class=\"winner\">\n",
" <td><a href=\"/teams/LAA/2017.shtml\">Los Angeles Angels</a></td>\n",
" <td class=\"right\">7</td>\n",
" <td class=\"right gamelink\">\n",
" <a href=\"/boxes/TOR/TOR201707280.shtml\">Final</a>\n",
" </td>\n",
" </tr>\n",
" <tr class=\"loser\">\n",
" <td><a href=\"/teams/TOR/2017.shtml\">Toronto Blue Jays</a></td>\n",
" <td class=\"right\">2</td>\n",
" <td class=\"right\">\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
" </table>]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"yesterday_games = game_section.findAll('',{'class','teams'})\n",
"yesterday_games"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to scan 'yesterday_games' and find the game that the team of interest was involved in, and acquire the opponent and the score."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Los Angeles Angels: 7\n",
"Toronto Blue Jays: 2\n"
]
}
],
"source": [
"query_team = 'Toronto Blue Jays'\n",
"\n",
"for game in yesterday_games: #scan the games\n",
" winner=game.find('',{'class':'winner'}) #find the 'winner' tag\n",
" w_team = winner.td.get_text() #pull just the name of the team\n",
" loser =game.find('',{'class':'loser'}) #find the 'loser' tag\n",
" l_team = loser.td.get_text() #pull just the name of the team\n",
" if (w_team != query_team) and (l_team != query_team): # if our team isn't in the game, move on\n",
" continue\n",
" else: #otherwise we get the score, and report the output\n",
" w_score = winner.find('',{'class':'right'}).get_text()\n",
" l_score = loser.find('',{'class':'right'}).get_text()\n",
" #so we can see the results:\n",
" print('%s: %s' % (w_team, w_score))\n",
" print('%s: %s' % (l_team, l_score))\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... and the jays lost :|\n",
"\n",
"To make the above easily usable, and to hide the details in the backgroud we can turn the code into a function that will take an team name as the input, and return a dictonary with the teams as the two keys, and the scores as the values.\n",
"\n",
"So the above score would be returned as:\n",
"{'Los Angeles Angels': '7', 'Toronto Blue Jays': '2'}\n",
"I find this easiest to work with when we are building the information into a sentence to be returned for my morning report.\n",
"\n",
"In function form the code looks like:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_mlb_team_score(query_team):\n",
" \"\"\" for query_team put in the long for name of the team of interest.\n",
" i.e. 'Toronto Blue Jays'or 'New York Yankees' \"\"\"\n",
" baseball_reference = 'http://www.baseball-reference.com/'\n",
"\n",
" mlb_dat = urlopen(baseball_reference)\n",
"\n",
" mlb_scores= BeautifulSoup(mlb_dat, 'lxml')\n",
"\n",
" game_section = mlb_scores.find(id=\"scores\")\n",
"\n",
" yesterday_games = game_section.findAll('',{'class','teams'})\n",
"\n",
" for game in yesterday_games:\n",
" winner=game.find('',{'class':'winner'})\n",
" w_team = winner.td.get_text()\n",
" loser =game.find('',{'class':'loser'})\n",
" l_team = loser.td.get_text()\n",
" if (w_team != query_team) and (l_team != query_team):\n",
" continue\n",
" else:\n",
" w_score = winner.find('',{'class':'right'}).get_text()\n",
" l_score = loser.find('',{'class':'right'}).get_text()\n",
"\n",
" return {w_team:w_score,l_team:l_score}\n",
" return 'did not play yesterday' #accounts for off days when we can't find a score in yesterday's games\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So now to get the information about yesterday's Jay's game, we just need to input the following:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Los Angeles Angels': '7', 'Toronto Blue Jays': '2'}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"team = 'Toronto Blue Jays'\n",
"yesterday_game = get_mlb_team_score(team)\n",
"yesterday_game"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So the information is scraped, and now we just need some simple string building to produce the sentence for the morning report"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Yesterday, the Toronto Blue Jays lost to the Los Angeles Angels, 7-2.\n"
]
}
],
"source": [
" if yesterday_game == 'did not play yesterday':\n",
" print('The %s %s.' % (team, yesterday_game))\n",
" else:\n",
" other_team = [z for z in yesterday_game.keys() if z != team][0]\n",
" team_score = yesterday_game[team]\n",
" other_team_score = yesterday_game[other_team]\n",
" if int(team_score) < int(other_team_score): #the dict values were strings, so we need to convert to integers!\n",
" result = 'Yesterday, the %s lost to the %s, %s-%s.' % (team,other_team,other_team_score,team_score)\n",
" print(result)\n",
" else:\n",
" result = 'Yesterday, the %s beat the %s, %s-%s.' % (team,other_team,team_score,other_team_score)\n",
" print(result)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make the above above code something we can import into a different file, we make it into a function."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_team_result_text(team):\n",
" yesterday_game = get_mlb_team_score(team)\n",
" if yesterday_game == 'did not play yesterday':\n",
" return 'The %s %s.' % (team, yesterday_game)\n",
" else:\n",
" other_team = [z for z in yesterday_game.keys() if z != team][0]\n",
" team_score = yesterday_game[team]\n",
" other_team_score = yesterday_game[other_team]\n",
" if int(team_score) < int(other_team_score):\n",
" result = 'Yesterday, the %s lost to the %s, %s-%s.' % (team,other_team,other_team_score,team_score)\n",
" return result\n",
" else:\n",
" result = 'Yesterday, the %s beat the %s, %s-%s.' % (team,other_team,team_score,other_team_score)\n",
" return result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" This also means that in the future, all we need to type to get yesterday's result is:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Yesterday, the Toronto Blue Jays lost to the Los Angeles Angels, 7-2.'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_team_result_text('Toronto Blue Jays')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And there we have it! Two simple functions that will go to baseball-reference.com, grab the score of a baseball game from yesterday and return a simple one sentence summary. This function can now be used in my morning email report. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment