Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Dakar Rally 2019 data grab
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dakar Rally Scraper\n",
"\n",
"This notebook provides a case study in retrieving, cleaning, organising and processing data obtained from a third party website, specifically, timing and results data from the 2019 Dakar Rally.\n",
"\n",
"Another way of thinking of it is as a series of marks created during an exploratory data analysis performance.\n",
"\n",
"Or palaver.\n",
"\n",
"Whatever.\n",
"\n",
"Shall we begin?"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#Use requests cache so we can keep an archive of the results HTML\n",
"import requests\n",
"import requests_cache\n",
"requests_cache.install_cache('dakar_cache', backend='sqlite')\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Timing data is provided in two forms:\n",
"\n",
"- time at waypoint / split;\n",
"- gap to leader at that waypoint;\n",
"\n",
"Ranking data for the stage and overall at end of stage is also available.\n",
"\n",
"Timing and ranking data is available for:\n",
"\n",
"- car\n",
"- moto (motorbike)\n",
"- quad\n",
"- sxs\n",
"- truck\n"
]
},
{
"cell_type": "code",
"execution_count": 467,
"metadata": {},
"outputs": [],
"source": [
"STAGE = 8"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stage Info\n",
"\n",
"Retrieve some basic information about a stage."
]
},
{
"cell_type": "code",
"execution_count": 468,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Special</th>\n",
" <th>Moto</th>\n",
" <th>Quad</th>\n",
" <th>Car</th>\n",
" <th>SxS</th>\n",
" <th>Truck</th>\n",
" <th>Semi</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Start</td>\n",
" <td>07:15</td>\n",
" <td>09:07</td>\n",
" <td>07:24</td>\n",
" <td>10:11</td>\n",
" <td>08:15</td>\n",
" <td>10:38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Liaison</td>\n",
" <td>574km</td>\n",
" <td>574km</td>\n",
" <td>574km</td>\n",
" <td>574km</td>\n",
" <td>574km</td>\n",
" <td>574km</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Special</td>\n",
" <td>361km</td>\n",
" <td>361km</td>\n",
" <td>361km</td>\n",
" <td>361km</td>\n",
" <td>361km</td>\n",
" <td>361km</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Number of participants</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>At start</td>\n",
" <td>93</td>\n",
" <td>15</td>\n",
" <td>65</td>\n",
" <td>22</td>\n",
" <td>28</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Left</td>\n",
" <td>92</td>\n",
" <td>15</td>\n",
" <td>65</td>\n",
" <td>21</td>\n",
" <td>15</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Arrived</td>\n",
" <td>81</td>\n",
" <td>15</td>\n",
" <td>48</td>\n",
" <td>21</td>\n",
" <td>11</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Latest WP</td>\n",
" <td>ass</td>\n",
" <td>ass</td>\n",
" <td>ass</td>\n",
" <td>ass</td>\n",
" <td>ass</td>\n",
" <td>ass</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Leader at latest WP</td>\n",
" <td>001 WALKNER</td>\n",
" <td>240 CAVIGLIASSO</td>\n",
" <td>306 LOEB</td>\n",
" <td>360 LOPEZ CONTARDO</td>\n",
" <td>514 SOTNIKOV</td>\n",
" <td>310 TERRANOVA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Nb at latest WP</td>\n",
" <td>81</td>\n",
" <td>15</td>\n",
" <td>48</td>\n",
" <td>21</td>\n",
" <td>11</td>\n",
" <td>10</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Special Moto Quad Car \\\n",
"1 Start 07:15 09:07 07:24 \n",
"2 Liaison 574km 574km 574km \n",
"3 Special 361km 361km 361km \n",
"4 Number of participants NaN NaN NaN \n",
"5 At start 93 15 65 \n",
"6 Left 92 15 65 \n",
"7 Arrived 81 15 48 \n",
"8 Latest WP ass ass ass \n",
"9 Leader at latest WP 001 WALKNER 240 CAVIGLIASSO 306 LOEB \n",
"10 Nb at latest WP 81 15 48 \n",
"\n",
" SxS Truck Semi \n",
"1 10:11 08:15 10:38 \n",
"2 574km 574km 574km \n",
"3 361km 361km 361km \n",
"4 NaN NaN NaN \n",
"5 22 28 20 \n",
"6 21 15 18 \n",
"7 21 11 10 \n",
"8 ass ass ass \n",
"9 360 LOPEZ CONTARDO 514 SOTNIKOV 310 TERRANOVA \n",
"10 21 11 10 "
]
},
"execution_count": 468,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def get_stage_stats(stage):\n",
" stage_stats_url='https://gaps.dakar.com/2019/dakar/index_info.php?l=ukie&s={stage}&vh=a'\n",
"\n",
" html = requests.get(stage_stats_url.format(stage=STAGE)).content\n",
" \n",
" stage_stats_df=pd.read_html(html)[0]\n",
"\n",
" return stage_stats_df.rename(columns=stage_stats_df.iloc[0]).drop(stage_stats_df.index[0])\n",
"\n",
"get_stage_stats(STAGE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Timing Data\n",
"\n",
"Typical of many rallies, the live timing pages return several sorts of data:\n",
"\n",
"- times and gaps for each stage;\n",
"- stage and overall results for each stage."
]
},
{
"cell_type": "code",
"execution_count": 469,
"metadata": {},
"outputs": [],
"source": [
"URL_PATTERN='https://gaps.dakar.com/2019/dakar/?s={stage}&c=aso&l=ukie&vi={tab}&sv={timerank}&vh={vtype}&sws=99'\n",
"#sws - selected waypoint?"
]
},
{
"cell_type": "code",
"execution_count": 470,
"metadata": {},
"outputs": [],
"source": [
"#Vehicle types\n",
"VTYPE_ ={ 'car':'a','moto':'m','quad':'q','sxs':'s','truck':'c'}\n",
"\n",
"#Screen / tab selection\n",
"TAB_ = {'timing':0,'news':1,'ranking':2}\n",
"\n",
"#Options for timing data\n",
"TIMING_ = {'gap':0,'time':1}\n",
"\n",
"#Options for ranking data\n",
"RANKING_ = {'stage':0, 'general':1}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Previewing the data\n",
"\n",
"Let's see what the data looks like...\n",
"\n",
"*Uncomment and run the following to preview / inspect the data that's avaliable.*"
]
},
{
"cell_type": "code",
"execution_count": 471,
"metadata": {},
"outputs": [],
"source": [
"#pd.read_html(URL_PATTERN.format(stage=STAGE,tab=TAB_['timing'],vtype=VTYPE_['car'],timerank='time'))"
]
},
{
"cell_type": "code",
"execution_count": 472,
"metadata": {},
"outputs": [],
"source": [
"#pd.read_html(URL_PATTERN.format(stage=STAGE,tab=TAB_['ranking'],vtype=VTYPE_['car'],timerank='stage'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By inspection, we note that:\n",
"\n",
"- several tables are returned by each page;\n",
"- there are common identifier columns (`['Pos','Bib','Crew','Brand']`);\n",
"- there are irrelevant columns (`['View details','Select']`);\n",
"- the raw timing data columns (timig data at each waypoint) include information about split rank and how it compares to the rank at the previous split / waypoint; this needs to be cleaned before we can convert timestrings to timedeltas."
]
},
{
"cell_type": "code",
"execution_count": 473,
"metadata": {},
"outputs": [],
"source": [
"TIMING = RANKING = 0\n",
"CREWTEAM = 1\n",
"BRANDS = 2\n",
"COUNTRIES = 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retrieving the Data Tables\n",
"\n",
"We can define helper functions to pull back tables associated with the timing or ranking pages."
]
},
{
"cell_type": "code",
"execution_count": 474,
"metadata": {},
"outputs": [],
"source": [
"#Retrieve a page\n",
"def _data(stage,vtype='car',tab='timing', timerank='time'):\n",
" ''' Retrieve timing or ranking HTML page and scrape HTML tables. '''\n",
" timerank = RANKING_[timerank] if tab=='ranking' else TIMING_[timerank]\n",
" \n",
" url = URL_PATTERN.format(stage=stage,tab=TAB_[tab],vtype=VTYPE_[vtype],timerank=timerank)\n",
" html = requests.get(url).content\n",
" return pd.read_html(html, na_values=['-'])\n",
"\n",
"def _fetch_timing_data(stage,vtype='car', timerank='time'):\n",
" ''' Return data tables from timing page. '''\n",
" _tmp = _data(stage,vtype=vtype, tab='timing', timerank=timerank)\n",
" _tmp[TIMING].drop(columns=['View details','Select'], inplace=True)\n",
" return _tmp\n",
"\n",
"def _fetch_ranking_data(stage,vtype='car', timerank='stage'):\n",
" ''' Return data tables from ranking page. '''\n",
" rank_cols = ['Pos','Bib','Crew','Brand','Time','Gap','Penalty']\n",
" _tmp = _data(stage,vtype=vtype, tab='ranking', timerank=timerank)\n",
" _tmp[RANKING].drop(columns=['View details','Select'], inplace=True)\n",
" return _tmp\n",
"\n"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"#cache grab - grab all the HTML pages into a SQLite cache without expiry\n",
"\n",
"#The news tab returns news items as a list rather than in a table\n",
"def _get_news(stage,vtype='car',):\n",
" _tmp = _data(stage,vtype=vtype, tab='news')\n",
"# return _tmp\n",
"\n",
"for stage in [1,2,3,4,5,6,7, 8]:\n",
" for v in VTYPE_:\n",
" _get_news(stage,vtype=v)\n",
" get_stage_stats(stage)\n",
" for timing in TIMING_:\n",
" _fetch_timing_data(stage,vtype=v, timerank=timing)\n",
" for ranking in RANKING_: \n",
" _fetch_ranking_data(stage,vtype=v, timerank=ranking)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ranking Data\n",
"\n",
"Process the ranking data...\n",
"\n",
"So what have we got to work with?"
]
},
{
"cell_type": "code",
"execution_count": 475,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pos</th>\n",
" <th>Bib</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" <th>Time</th>\n",
" <th>Gap</th>\n",
" <th>Penalty</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>301</td>\n",
" <td>N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA</td>\n",
" <td>TOYOTA</td>\n",
" <td>29:15:50</td>\n",
" <td>0:00:00</td>\n",
" <td>00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>307</td>\n",
" <td>N. ROMA A. HARO BRAVO X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>30:02:19</td>\n",
" <td>0:46:29</td>\n",
" <td>00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>306</td>\n",
" <td>S. LOEB D. ELENA PH-SPORT</td>\n",
" <td>PEUGEOT</td>\n",
" <td>30:02:35</td>\n",
" <td>0:46:45</td>\n",
" <td>00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>304</td>\n",
" <td>S. PETERHANSEL D. CASTERA X-RAID MINI JCW TEAM</td>\n",
" <td>MINI</td>\n",
" <td>30:09:20</td>\n",
" <td>0:53:30</td>\n",
" <td>00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>308</td>\n",
" <td>C. DESPRES JP. COTTRET X-RAID MINI JCW TEAM</td>\n",
" <td>MINI</td>\n",
" <td>31:31:17</td>\n",
" <td>2:15:27</td>\n",
" <td>00:00:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pos Bib Crew Brand \\\n",
"0 1 301 N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA TOYOTA \n",
"1 2 307 N. ROMA A. HARO BRAVO X-RAID TEAM MINI \n",
"2 3 306 S. LOEB D. ELENA PH-SPORT PEUGEOT \n",
"3 4 304 S. PETERHANSEL D. CASTERA X-RAID MINI JCW TEAM MINI \n",
"4 5 308 C. DESPRES JP. COTTRET X-RAID MINI JCW TEAM MINI \n",
"\n",
" Time Gap Penalty \n",
"0 29:15:50 0:00:00 00:00:00 \n",
"1 30:02:19 0:46:29 00:00:00 \n",
"2 30:02:35 0:46:45 00:00:00 \n",
"3 30:09:20 0:53:30 00:00:00 \n",
"4 31:31:17 2:15:27 00:00:00 "
]
},
"execution_count": 475,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rdata = _fetch_ranking_data(STAGE, timerank='general')\n",
"\n",
"rdata[RANKING].head()"
]
},
{
"cell_type": "code",
"execution_count": 476,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pos int64\n",
"Bib int64\n",
"Crew object\n",
"Brand object\n",
"Time object\n",
"Gap object\n",
"Penalty object\n",
"dtype: object"
]
},
"execution_count": 476,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rdata[RANKING].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The basic retrieval returns a table with timing data as strings, and the `Bib` identifier as an integer.\n",
"\n",
"The `Bib` identifer, We could also regard it as a string so that we aren't tempted to treat it as a number, inwhich case we should also ensure that any extraneous whitespace is stripped if the `Bib` was already a string:\n",
"\n",
"```python\n",
"rdata[RANKING]['Bib'] = rdata[RANKING]['Bib'].astype(str).str.strip()\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 477,
"metadata": {},
"outputs": [],
"source": [
"rdata[RANKING]['Bib'] = rdata[RANKING]['Bib'].astype(int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Convert time to timedelta\n",
"\n",
"Several of the datasets return times, as strings, in the form: `HH:MM:SS`.\n",
"\n",
"We can convert these times to timedeltas.\n",
"\n",
"Timing related columns are all columns except those in `['Pos','Bib','Crew','Brand']`.\n",
"\n",
"We can also prefix timing columns in the timing data screens so we can recreate the order they should appear in:"
]
},
{
"cell_type": "code",
"execution_count": 478,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dss': '00_dss', 'wp1': '01_wp1', 'wp2': '02_wp2', 'ass': '03_ass'}"
]
},
"execution_count": 478,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Prefix each split designator with a split count\n",
"timingcols=['dss','wp1','wp2','ass']\n",
"{ x:'{}_{}'.format('{0:02d}'.format(i), x) for i, x in enumerate(timingcols, 0) }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the things we need to handle are timing columns where the timing data may be mixed with other sorts of data in the raw data table.\n",
"\n",
"Routines for cleaner the data are included in the timing handler function but they were actually \"backfilled\" into the function after creating them (originally) later on in the notebook."
]
},
{
"cell_type": "code",
"execution_count": 479,
"metadata": {},
"outputs": [],
"source": [
"from pandas.api.types import is_string_dtype\n",
"\n",
"#At first sight, this looks quite complicated, but a lot of it is backfilled \n",
"# to take into account some of the cleaning we need to do for the full (messy) timing data\n",
"def _get_timing(df, typ=TIMING, kind='simple'):\n",
" ''' Convert times to time deltas and\n",
" prefix waypoint / timing columns with a two digit counter. '''\n",
" \n",
" #Some of the exclusion column names are backfilled into this function\n",
" # from columns introduced later in the notebook\n",
" # What we're trying to do is identify columns that aren't timing related\n",
" timingcols = [c for c in df[typ].columns if c not in ['Pos','Bib','Crew','Brand', 'Refuel', 'Road Position'] ]\n",
" \n",
" #Clean up the data in a timing column, then cast to timedelta\n",
" for col in timingcols:\n",
" #In the simple approach, we just grab the timing data and dump the mess\n",
" if kind=='simple':\n",
" df[typ][col] = df[typ][col].str.extract(r'(\\d{2}:\\d{2}:\\d{2})')\n",
" else:\n",
" #The full on extractor - try to parse out all the data\n",
" # that has been munged into a timing column\n",
" if col==timingcols[-1]:\n",
" #There's an end effect:\n",
" # the last column in the timing dataset doesn't have position embedded in it\n",
" # In this case, just pull out the position gained/maintained/lost flag\n",
" df[typ][[col,col+'_gain']] = df[typ][col].str.extract(r'(\\d{2}:\\d{2}:\\d{2})(.*)', expand=True)\n",
" else:\n",
" #In the main body of the table, position gain as well as waypoint rank position are available\n",
" df[typ][[col,col+'_gain',col+'_pos']] = df[typ][col].str.extract(r'(\\d{2}:\\d{2}:\\d{2})(.*)\\((\\d*)\\)', expand=True)\n",
" #Ideally, the pos cols would be of int type, but int doesn't support NA\n",
" df[typ][col+'_pos'] = df[typ][col+'_pos'].astype(float)\n",
" \n",
" #Cast the time string to a timedelta\n",
" df[typ][col] = pd.to_timedelta( df[typ][col] )\n",
"\n",
"\n",
" #In timing screen, rename cols with a leading two digit index\n",
" #This allows us to report splits in order\n",
" #We only want to do this for the timing data columns, not the rank timing columns...\n",
" timingcols = [c for c in timingcols if c not in ['Time','Gap','Penalty']]\n",
" timingcols_map = { x:'{}_{}'.format('{0:02d}'.format(i), x) for i, x in enumerate(timingcols, 0) }\n",
" df[typ].rename(columns=timingcols_map, inplace=True)\n",
" \n",
" return df\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ranking Data Redux\n",
"\n",
"Normalise the times as timedeltas.\n",
"\n",
"For timestrings of the form `HH:MM:SS`, this is as simple as passing the timestring column to the *pandas* `.to_timedelta()` function:\n",
"\n",
"```python\n",
"pd.to_timedelta( df[TIMESTRING_COLUMN] )\n",
"```\n",
"\n",
"We just need to ensure we pass it the correct columns..."
]
},
{
"cell_type": "code",
"execution_count": 480,
"metadata": {},
"outputs": [],
"source": [
"def get_ranking_data(stage,vtype='car', timerank='stage'):\n",
" ''' Retrieve rank timing data and return it in a form we can work directly with. '''\n",
" \n",
" df = _fetch_ranking_data(stage,vtype=vtype, timerank=timerank)\n",
" df[RANKING]['Bib'] = df[RANKING]['Bib'].astype(int)\n",
" return _get_timing(df, typ=RANKING)"
]
},
{
"cell_type": "code",
"execution_count": 481,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pos</th>\n",
" <th>Bib</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" <th>Time</th>\n",
" <th>Gap</th>\n",
" <th>Penalty</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>306</td>\n",
" <td>S. LOEB D. ELENA PH-SPORT</td>\n",
" <td>PEUGEOT</td>\n",
" <td>03:54:53</td>\n",
" <td>NaT</td>\n",
" <td>0 days</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>301</td>\n",
" <td>N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA</td>\n",
" <td>TOYOTA</td>\n",
" <td>04:02:20</td>\n",
" <td>00:07:27</td>\n",
" <td>0 days</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>303</td>\n",
" <td>J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>04:10:08</td>\n",
" <td>00:15:15</td>\n",
" <td>0 days</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>302</td>\n",
" <td>G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R...</td>\n",
" <td>TOYOTA</td>\n",
" <td>04:10:48</td>\n",
" <td>00:15:55</td>\n",
" <td>0 days</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>307</td>\n",
" <td>N. ROMA A. HARO BRAVO X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>04:10:50</td>\n",
" <td>00:15:57</td>\n",
" <td>0 days</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pos Bib Crew Brand \\\n",
"0 1 306 S. LOEB D. ELENA PH-SPORT PEUGEOT \n",
"1 2 301 N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA TOYOTA \n",
"2 3 303 J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM MINI \n",
"3 4 302 G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R... TOYOTA \n",
"4 5 307 N. ROMA A. HARO BRAVO X-RAID TEAM MINI \n",
"\n",
" Time Gap Penalty \n",
"0 03:54:53 NaT 0 days \n",
"1 04:02:20 00:07:27 0 days \n",
"2 04:10:08 00:15:15 0 days \n",
"3 04:10:48 00:15:55 0 days \n",
"4 04:10:50 00:15:57 0 days "
]
},
"execution_count": 481,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ranking_data = get_ranking_data(STAGE)\n",
"\n",
"ranking_data[RANKING].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `Crew` data is a bit of a mishmash. If we were to normalise this table, we'd have to split that data out...\n",
"\n",
"For now, let's leave it...\n",
"\n",
"...because sometimes, it can be handy to be able to pull out a chunk of unnormalised data as a simple string."
]
},
{
"cell_type": "code",
"execution_count": 482,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pos int64\n",
"Bib int64\n",
"Crew object\n",
"Brand object\n",
"Time timedelta64[ns]\n",
"Gap timedelta64[ns]\n",
"Penalty timedelta64[ns]\n",
"dtype: object"
]
},
"execution_count": 482,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ranking_data[RANKING].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Timing Data\n",
"\n",
"The timing data needs some processing:"
]
},
{
"cell_type": "code",
"execution_count": 483,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pos</th>\n",
" <th>Bib</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" <th>dss</th>\n",
" <th>wp1</th>\n",
" <th>wp2</th>\n",
" <th>ass1</th>\n",
" <th>wp4</th>\n",
" <th>wp5</th>\n",
" <th>wp6</th>\n",
" <th>ass</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>NaN</td>\n",
" <td>422</td>\n",
" <td>BA. MARTINS RJP. FERREIRA TEAM BBR</td>\n",
" <td>CAN - AM</td>\n",
" <td>11:30:15= (52)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>NaN</td>\n",
" <td>405</td>\n",
" <td>MA. ALVAREZ PINEDA R. MENDIOLA ALVAREZ</td>\n",
" <td>TOYOTA</td>\n",
" <td>11:38:15= (55)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>NaN</td>\n",
" <td>404</td>\n",
" <td>A. AGUIRREGAVIRIA J. COMALLONGA FN SPEED TEAM</td>\n",
" <td>TOYOTA</td>\n",
" <td>11:38:45= (56)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>NaN</td>\n",
" <td>392</td>\n",
" <td>D. WEBER JJ. PONCE AYLWIN DURO4X4.COM</td>\n",
" <td>TOYOTA</td>\n",
" <td>11:39:15= (57)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64</th>\n",
" <td>NaN</td>\n",
" <td>420</td>\n",
" <td>N. FALLOUX F. GONZALEZ OVERDRIVE TOYOTA</td>\n",
" <td>TOYOTA</td>\n",
" <td>11:56:45= (62)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pos Bib Crew Brand \\\n",
"60 NaN 422 BA. MARTINS RJP. FERREIRA TEAM BBR CAN - AM \n",
"61 NaN 405 MA. ALVAREZ PINEDA R. MENDIOLA ALVAREZ TOYOTA \n",
"62 NaN 404 A. AGUIRREGAVIRIA J. COMALLONGA FN SPEED TEAM TOYOTA \n",
"63 NaN 392 D. WEBER JJ. PONCE AYLWIN DURO4X4.COM TOYOTA \n",
"64 NaN 420 N. FALLOUX F. GONZALEZ OVERDRIVE TOYOTA TOYOTA \n",
"\n",
" dss wp1 wp2 ass1 wp4 wp5 wp6 ass \n",
"60 11:30:15= (52) NaN NaN NaN NaN NaN NaN NaN \n",
"61 11:38:15= (55) NaN NaN NaN NaN NaN NaN NaN \n",
"62 11:38:45= (56) NaN NaN NaN NaN NaN NaN NaN \n",
"63 11:39:15= (57) NaN NaN NaN NaN NaN NaN NaN \n",
"64 11:56:45= (62) NaN NaN NaN NaN NaN NaN NaN "
]
},
"execution_count": 483,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = _fetch_timing_data(STAGE)\n",
"\n",
"data[TIMING][60:70]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A full inspection of the time data shows that some additional metadata corresponding to whether in-stage refuelling is allowed may also be recorded in the `Bib` column (for example, `403 ⛽`).\n",
"\n",
"We can extract this information into a separate dataframe / table."
]
},
{
"cell_type": "code",
"execution_count": 484,
"metadata": {},
"outputs": [],
"source": [
"def get_refuel_status(df):\n",
" ''' Parse the refuel status out of timing data Bib column.\n",
" Return extended dataframe with a clean Bib column and a new Refuel column. '''\n",
" \n",
" #The .str.extract() function allows us to match separate groups using a regex\n",
" # and return the corresponding group data as distinct columns\n",
" #Force the Bin type to a str if it isn't created as such so we can regex it...\n",
" df[['Bib','_tmp']] = df['Bib'].astype(str).str.extract(r'(\\d*)([^\\d]*)', expand=True)\n",
" \n",
" #Set the Refuel status as a Boolean\n",
" df.insert(2, 'Refuel', df['_tmp'])\n",
" df.drop('_tmp', axis=1, inplace=True)\n",
" df['Refuel'] = df['Refuel']!=''\n",
" \n",
" #Set the Bib value as an int\n",
" df['Bib'] = df['Bib'].astype(int)\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 485,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pos</th>\n",
" <th>Bib</th>\n",
" <th>Refuel</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" <th>dss</th>\n",
" <th>wp1</th>\n",
" <th>wp2</th>\n",
" <th>ass1</th>\n",
" <th>wp4</th>\n",
" <th>wp5</th>\n",
" <th>wp6</th>\n",
" <th>ass</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>NaN</td>\n",
" <td>422</td>\n",
" <td>False</td>\n",
" <td>BA. MARTINS RJP. FERREIRA TEAM BBR</td>\n",
" <td>CAN - AM</td>\n",
" <td>11:30:15= (52)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>NaN</td>\n",
" <td>405</td>\n",
" <td>False</td>\n",
" <td>MA. ALVAREZ PINEDA R. MENDIOLA ALVAREZ</td>\n",
" <td>TOYOTA</td>\n",
" <td>11:38:15= (55)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>NaN</td>\n",
" <td>404</td>\n",
" <td>False</td>\n",
" <td>A. AGUIRREGAVIRIA J. COMALLONGA FN SPEED TEAM</td>\n",
" <td>TOYOTA</td>\n",
" <td>11:38:45= (56)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>NaN</td>\n",
" <td>392</td>\n",
" <td>False</td>\n",
" <td>D. WEBER JJ. PONCE AYLWIN DURO4X4.COM</td>\n",
" <td>TOYOTA</td>\n",
" <td>11:39:15= (57)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64</th>\n",
" <td>NaN</td>\n",
" <td>420</td>\n",
" <td>False</td>\n",
" <td>N. FALLOUX F. GONZALEZ OVERDRIVE TOYOTA</td>\n",
" <td>TOYOTA</td>\n",
" <td>11:56:45= (62)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pos Bib Refuel Crew Brand \\\n",
"60 NaN 422 False BA. MARTINS RJP. FERREIRA TEAM BBR CAN - AM \n",
"61 NaN 405 False MA. ALVAREZ PINEDA R. MENDIOLA ALVAREZ TOYOTA \n",
"62 NaN 404 False A. AGUIRREGAVIRIA J. COMALLONGA FN SPEED TEAM TOYOTA \n",
"63 NaN 392 False D. WEBER JJ. PONCE AYLWIN DURO4X4.COM TOYOTA \n",
"64 NaN 420 False N. FALLOUX F. GONZALEZ OVERDRIVE TOYOTA TOYOTA \n",
"\n",
" dss wp1 wp2 ass1 wp4 wp5 wp6 ass \n",
"60 11:30:15= (52) NaN NaN NaN NaN NaN NaN NaN \n",
"61 11:38:15= (55) NaN NaN NaN NaN NaN NaN NaN \n",
"62 11:38:45= (56) NaN NaN NaN NaN NaN NaN NaN \n",
"63 11:39:15= (57) NaN NaN NaN NaN NaN NaN NaN \n",
"64 11:56:45= (62) NaN NaN NaN NaN NaN NaN NaN "
]
},
"execution_count": 485,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[TIMING] = get_refuel_status(data[TIMING])\n",
"data[TIMING][60:70]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also notice that the raw timing data includes information about split rank and how it compares to the rank at the previous split / waypoint, with the raw data taking the form `08:44:00= (11)`. Which is to say, `HH:MM:DDx (NN?)` where `x` is a comparator showing whether the rank at that waypoint improved (▲), remained the same as (=), or worsened (▼) compared to the previous waypoint.\n",
"\n",
"Note that the final `ass` column does not include the rank.\n",
"\n",
"We can use a regular expression to separate the data out, with each regex group being expanded into a separate column:"
]
},
{
"cell_type": "code",
"execution_count": 486,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>10:09:30</td>\n",
" <td>=</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>07:54:00</td>\n",
" <td>=</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>08:03:00</td>\n",
" <td>=</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>08:00:00</td>\n",
" <td>=</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>07:36:00</td>\n",
" <td>=</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 10:09:30 = 11\n",
"1 07:54:00 = 4\n",
"2 08:03:00 = 7\n",
"3 08:00:00 = 6\n",
"4 07:36:00 = 2"
]
},
"execution_count": 486,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[TIMING]['dss'].str.extract(r'(\\d{2}:\\d{2}:\\d{2})(.*)\\((\\d*)\\)', expand=True).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can backfill an expression of that form into the timing data handler function above...\n",
"\n",
"Now we wrap several steps together into a function that gets us a clean set of timing data, with columns of an appropriate type:"
]
},
{
"cell_type": "code",
"execution_count": 487,
"metadata": {},
"outputs": [],
"source": [
"def get_timing_data(stage,vtype='car', timerank='time', kind='simple'):\n",
" ''' Get timing data in a form ready to use. '''\n",
" df = _fetch_timing_data(stage,vtype=vtype, timerank=timerank)\n",
"\n",
" df[TIMING] = get_refuel_status(df[TIMING])\n",
" return _get_timing(df, typ=TIMING, kind=kind)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 488,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pos</th>\n",
" <th>Bib</th>\n",
" <th>Refuel</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" <th>00_dss</th>\n",
" <th>01_wp1</th>\n",
" <th>02_wp2</th>\n",
" <th>03_ass1</th>\n",
" <th>04_wp4</th>\n",
" <th>...</th>\n",
" <th>wp2_pos</th>\n",
" <th>ass1_gain</th>\n",
" <th>ass1_pos</th>\n",
" <th>wp4_gain</th>\n",
" <th>wp4_pos</th>\n",
" <th>wp5_gain</th>\n",
" <th>wp5_pos</th>\n",
" <th>wp6_gain</th>\n",
" <th>wp6_pos</th>\n",
" <th>ass_gain</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>306</td>\n",
" <td>False</td>\n",
" <td>S. LOEB D. ELENA PH-SPORT</td>\n",
" <td>PEUGEOT</td>\n",
" <td>10:09:30</td>\n",
" <td>00:31:03</td>\n",
" <td>00:59:29</td>\n",
" <td>01:29:58</td>\n",
" <td>NaT</td>\n",
" <td>...</td>\n",
" <td>6.0</td>\n",
" <td>▲</td>\n",
" <td>3.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>▲</td>\n",
" <td>1.0</td>\n",
" <td>=</td>\n",
" <td>1.0</td>\n",
" <td>=</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>301</td>\n",
" <td>False</td>\n",
" <td>N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA</td>\n",
" <td>TOYOTA</td>\n",
" <td>07:54:00</td>\n",
" <td>00:28:49</td>\n",
" <td>00:56:57</td>\n",
" <td>01:26:33</td>\n",
" <td>01:56:13</td>\n",
" <td>...</td>\n",
" <td>3.0</td>\n",
" <td>▲</td>\n",
" <td>1.0</td>\n",
" <td>=</td>\n",
" <td>1.0</td>\n",
" <td>▼</td>\n",
" <td>2.0</td>\n",
" <td>=</td>\n",
" <td>2.0</td>\n",
" <td>=</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3.0</td>\n",
" <td>303</td>\n",
" <td>False</td>\n",
" <td>J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>08:03:00</td>\n",
" <td>00:30:27</td>\n",
" <td>01:00:33</td>\n",
" <td>01:30:33</td>\n",
" <td>02:02:29</td>\n",
" <td>...</td>\n",
" <td>8.0</td>\n",
" <td>▲</td>\n",
" <td>5.0</td>\n",
" <td>▲</td>\n",
" <td>3.0</td>\n",
" <td>=</td>\n",
" <td>3.0</td>\n",
" <td>▼</td>\n",
" <td>4.0</td>\n",
" <td>▲</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.0</td>\n",
" <td>302</td>\n",
" <td>False</td>\n",
" <td>G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R...</td>\n",
" <td>TOYOTA</td>\n",
" <td>08:00:00</td>\n",
" <td>00:31:34</td>\n",
" <td>01:00:55</td>\n",
" <td>01:30:23</td>\n",
" <td>02:02:26</td>\n",
" <td>...</td>\n",
" <td>9.0</td>\n",
" <td>▲</td>\n",
" <td>4.0</td>\n",
" <td>▲</td>\n",
" <td>2.0</td>\n",
" <td>▼</td>\n",
" <td>5.0</td>\n",
" <td>▲</td>\n",
" <td>3.0</td>\n",
" <td>▼</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>307</td>\n",
" <td>False</td>\n",
" <td>N. ROMA A. HARO BRAVO X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>07:36:00</td>\n",
" <td>00:29:48</td>\n",
" <td>00:58:13</td>\n",
" <td>01:32:02</td>\n",
" <td>02:03:26</td>\n",
" <td>...</td>\n",
" <td>4.0</td>\n",
" <td>▼</td>\n",
" <td>6.0</td>\n",
" <td>▲</td>\n",
" <td>4.0</td>\n",
" <td>=</td>\n",
" <td>4.0</td>\n",
" <td>▼</td>\n",
" <td>5.0</td>\n",
" <td>=</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" Pos Bib Refuel Crew \\\n",
"0 1.0 306 False S. LOEB D. ELENA PH-SPORT \n",
"1 2.0 301 False N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA \n",
"2 3.0 303 False J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM \n",
"3 4.0 302 False G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R... \n",
"4 5.0 307 False N. ROMA A. HARO BRAVO X-RAID TEAM \n",
"\n",
" Brand 00_dss 01_wp1 02_wp2 03_ass1 04_wp4 ... wp2_pos \\\n",
"0 PEUGEOT 10:09:30 00:31:03 00:59:29 01:29:58 NaT ... 6.0 \n",
"1 TOYOTA 07:54:00 00:28:49 00:56:57 01:26:33 01:56:13 ... 3.0 \n",
"2 MINI 08:03:00 00:30:27 01:00:33 01:30:33 02:02:29 ... 8.0 \n",
"3 TOYOTA 08:00:00 00:31:34 01:00:55 01:30:23 02:02:26 ... 9.0 \n",
"4 MINI 07:36:00 00:29:48 00:58:13 01:32:02 02:03:26 ... 4.0 \n",
"\n",
" ass1_gain ass1_pos wp4_gain wp4_pos wp5_gain wp5_pos wp6_gain wp6_pos \\\n",
"0 ▲ 3.0 NaN NaN ▲ 1.0 = 1.0 \n",
"1 ▲ 1.0 = 1.0 ▼ 2.0 = 2.0 \n",
"2 ▲ 5.0 ▲ 3.0 = 3.0 ▼ 4.0 \n",
"3 ▲ 4.0 ▲ 2.0 ▼ 5.0 ▲ 3.0 \n",
"4 ▼ 6.0 ▲ 4.0 = 4.0 ▼ 5.0 \n",
"\n",
" ass_gain \n",
"0 = \n",
"1 = \n",
"2 ▲ \n",
"3 ▼ \n",
"4 = \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 488,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = get_timing_data(STAGE, kind='full')\n",
"data[TIMING].head()"
]
},
{
"cell_type": "code",
"execution_count": 489,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pos float64\n",
"Bib int64\n",
"Refuel bool\n",
"Crew object\n",
"Brand object\n",
"00_dss timedelta64[ns]\n",
"01_wp1 timedelta64[ns]\n",
"02_wp2 timedelta64[ns]\n",
"03_ass1 timedelta64[ns]\n",
"04_wp4 timedelta64[ns]\n",
"05_wp5 timedelta64[ns]\n",
"06_wp6 timedelta64[ns]\n",
"07_ass timedelta64[ns]\n",
"dss_gain object\n",
"dss_pos float64\n",
"wp1_gain object\n",
"wp1_pos float64\n",
"wp2_gain object\n",
"wp2_pos float64\n",
"ass1_gain object\n",
"ass1_pos float64\n",
"wp4_gain object\n",
"wp4_pos float64\n",
"wp5_gain object\n",
"wp5_pos float64\n",
"wp6_gain object\n",
"wp6_pos float64\n",
"ass_gain object\n",
"dtype: object"
]
},
"execution_count": 489,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[TIMING].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parse Metadata\n",
"\n",
"Some of the scraped tables are used to provide selection lists, but we might be able to use them as metadata tables.\n",
"\n",
"For example, here's a pretty complete set, although mangled together, set of competititor names, nationalities, and team names:"
]
},
{
"cell_type": "code",
"execution_count": 490,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Highlight</th>\n",
" <th>Filter</th>\n",
" <th>Bib</th>\n",
" <th>Names</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Highglight crew</td>\n",
" <td>Filter crew</td>\n",
" <td>1</td>\n",
" <td>M. WALKNER (Austria)RED BULL KTM FACTORY TEAM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Highglight crew</td>\n",
" <td>Filter crew</td>\n",
" <td>2</td>\n",
" <td>P. GONCALVES (Portugal)MONSTER ENERGY HONDA TE...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Highglight crew</td>\n",
" <td>Filter crew</td>\n",
" <td>3</td>\n",
" <td>T. PRICE (Australia)RED BULL KTM FACTORY TEAM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Highglight crew</td>\n",
" <td>Filter crew</td>\n",
" <td>4</td>\n",
" <td>A. VAN BEVEREN (France)YAMALUBE YAMAHA OFFICIA...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Highglight crew</td>\n",
" <td>Filter crew</td>\n",
" <td>5</td>\n",
" <td>J. BARREDA BORT (Spain)MONSTER ENERGY HONDA TE...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Highlight Filter Bib \\\n",
"0 Highglight crew Filter crew 1 \n",
"1 Highglight crew Filter crew 2 \n",
"2 Highglight crew Filter crew 3 \n",
"3 Highglight crew Filter crew 4 \n",
"4 Highglight crew Filter crew 5 \n",
"\n",
" Names \n",
"0 M. WALKNER (Austria)RED BULL KTM FACTORY TEAM \n",
"1 P. GONCALVES (Portugal)MONSTER ENERGY HONDA TE... \n",
"2 T. PRICE (Australia)RED BULL KTM FACTORY TEAM \n",
"3 A. VAN BEVEREN (France)YAMALUBE YAMAHA OFFICIA... \n",
"4 J. BARREDA BORT (Spain)MONSTER ENERGY HONDA TE... "
]
},
"execution_count": 490,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[ CREWTEAM ].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It'll probably be convenient to have the unique `Bib` values available as an index:"
]
},
{
"cell_type": "code",
"execution_count": 491,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Names</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Bib</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>M. WALKNER (Austria)RED BULL KTM FACTORY TEAM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>P. GONCALVES (Portugal)MONSTER ENERGY HONDA TE...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>T. PRICE (Australia)RED BULL KTM FACTORY TEAM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A. VAN BEVEREN (France)YAMALUBE YAMAHA OFFICIA...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>J. BARREDA BORT (Spain)MONSTER ENERGY HONDA TE...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Names\n",
"Bib \n",
"1 M. WALKNER (Austria)RED BULL KTM FACTORY TEAM\n",
"2 P. GONCALVES (Portugal)MONSTER ENERGY HONDA TE...\n",
"3 T. PRICE (Australia)RED BULL KTM FACTORY TEAM\n",
"4 A. VAN BEVEREN (France)YAMALUBE YAMAHA OFFICIA...\n",
"5 J. BARREDA BORT (Spain)MONSTER ENERGY HONDA TE..."
]
},
"execution_count": 491,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[ CREWTEAM ] = data[ CREWTEAM ][['Bib', 'Names']].set_index('Bib')\n",
"data[ CREWTEAM ].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `Names` may have several `Name (Country)` values, followed by a team name. The original HTML uses `<span>` tags to separate out values but the *pandas* `.read_html()` function flattens cell contents.\n",
" \n",
"Let's have a go at pulling out the team names, which appear at the end of the string. If we can split each name, and the team name, into separate columns, and then metl those columns into separate rows, grouped by `Bib` number, we should be able to grab the last row, corrsponding to the team, in each group:"
]
},
{
"cell_type": "code",
"execution_count": 492,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Bib\n",
"1 RED BULL KTM FACTORY TEAM\n",
"2 MONSTER ENERGY HONDA TEAM 2019\n",
"3 RED BULL KTM FACTORY TEAM\n",
"4 YAMALUBE YAMAHA OFFICIAL RALLY TEAM\n",
"5 MONSTER ENERGY HONDA TEAM 2019\n",
"Name: value, dtype: object"
]
},
"execution_count": 492,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Perhaps split on brackets?\n",
"# At least one team has brackets in the name at the end of the name\n",
"# So let's make that case, at least, a \"not bracket\" by setting a ) at the end to a :]:\n",
"# so we don't (mistakenly) split on it as if it were a country-associated bracket.\n",
"teams = data[ CREWTEAM ]['Names'].str.replace(r'\\)$',':]:').str.split(')').apply(pd.Series).reset_index().melt(id_vars='Bib', var_name='Num').dropna()\n",
"\n",
"#Find last item in each group, which is to say: the team\n",
"teamnames = teams.groupby('Bib').last()\n",
"#Defudge any brackets at the end back\n",
"teamnames = teamnames['value'].str.replace(':]:',')')\n",
"teamnames.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's go after the competitors. These are all *but* the last row in each group:"
]
},
{
"cell_type": "code",
"execution_count": 493,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Bib</th>\n",
" <th>Num</th>\n",
" <th>Name</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>M. WALKNER</td>\n",
" <td>Austria</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>P. GONCALVES</td>\n",
" <td>Portugal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>T. PRICE</td>\n",
" <td>Australia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>A. VAN BEVEREN</td>\n",
" <td>France</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>J. BARREDA BORT</td>\n",
" <td>Spain</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Bib Num Name Country\n",
"0 1 0 M. WALKNER Austria\n",
"1 2 0 P. GONCALVES Portugal\n",
"2 3 0 T. PRICE Australia\n",
"3 4 0 A. VAN BEVEREN France\n",
"4 5 0 J. BARREDA BORT Spain"
]
},
"execution_count": 493,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Remove last row in group i.e. the team\n",
"personnel = teams.groupby('Bib').apply(lambda x: x.iloc[:-1]).set_index('Bib').reset_index()\n",
"\n",
"personnel[['Name','Country']] = personnel['value'].str.split('(').apply(pd.Series)\n",
"\n",
"#Strip whitespace\n",
"for c in ['Name','Country']:\n",
" personnel[c] = personnel[c].str.strip()\n",
" \n",
"personnel[['Bib','Num','Name','Country']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For convenience, we might want to reshape this long form back to a wide form, with a single string containing all the competitor names associated with a particular `Bib` identifier:"
]
},
{
"cell_type": "code",
"execution_count": 494,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Bib\n",
"538 CG. RICKLER DEL MARE, DR. BURAN\n",
"539 F. SKROBANEK, P. LESAK, R. BACULIK\n",
"540 J. GINESTA, F. ESTER FERNANDEZ, M. DARDAILLON\n",
"541 A. BENBEKHTI, S. BENBEKHTI, R. OSMANI\n",
"542 S. BESNARD, F. DERONCE, S. LALICHE\n",
"Name: Name, dtype: object"
]
},
"execution_count": 494,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Create a single name string for each vehicle\n",
"#For each Bib number, group the rows associated with that number\n",
"# and aggregate the names in those rows into a single, comma separated, joined string\n",
"# indexed by the corresponding Bib number\n",
"personnel.groupby('Bib')['Name'].agg(lambda col: ', '.join(col)).tail()"
]
},
{
"cell_type": "code",
"execution_count": 495,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Filter</th>\n",
" <th>Names</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Filter brand</td>\n",
" <td>BETA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Filter brand</td>\n",
" <td>BMW</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Filter brand</td>\n",
" <td>BORGWARD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Filter brand</td>\n",
" <td>BOSUER</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Filter brand</td>\n",
" <td>BRP</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Filter Names\n",
"0 Filter brand BETA\n",
"1 Filter brand BMW\n",
"2 Filter brand BORGWARD\n",
"3 Filter brand BOSUER\n",
"4 Filter brand BRP"
]
},
"execution_count": 495,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[ BRANDS ].head()"
]
},
{
"cell_type": "code",
"execution_count": 496,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Filter</th>\n",
" <th>Names</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Filter country</td>\n",
" <td>Andorra (AND)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Filter country</td>\n",
" <td>United Arab Emirates (ARE)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Filter country</td>\n",
" <td>Argentina (ARG)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Filter country</td>\n",
" <td>Australia (AUS)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Filter country</td>\n",
" <td>Austria (AUT)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Filter Names\n",
"0 Filter country Andorra (AND)\n",
"1 Filter country United Arab Emirates (ARE)\n",
"2 Filter country Argentina (ARG)\n",
"3 Filter country Australia (AUS)\n",
"4 Filter country Austria (AUT)"
]
},
"execution_count": 496,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[ COUNTRIES ].head()"
]
},
{
"cell_type": "code",
"execution_count": 497,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Filter</th>\n",
" <th>Names</th>\n",
" <th>Country</th>\n",
" <th>CountryCode</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Filter country</td>\n",
" <td>Andorra (AND)</td>\n",
" <td>Andorra</td>\n",
" <td>AND</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Filter country</td>\n",
" <td>United Arab Emirates (ARE)</td>\n",
" <td>United Arab Emirates</td>\n",
" <td>ARE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Filter country</td>\n",
" <td>Argentina (ARG)</td>\n",
" <td>Argentina</td>\n",
" <td>ARG</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Filter country</td>\n",
" <td>Australia (AUS)</td>\n",
" <td>Australia</td>\n",
" <td>AUS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Filter country</td>\n",
" <td>Austria (AUT)</td>\n",
" <td>Austria</td>\n",
" <td>AUT</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Filter Names Country \\\n",
"0 Filter country Andorra (AND) Andorra \n",
"1 Filter country United Arab Emirates (ARE) United Arab Emirates \n",
"2 Filter country Argentina (ARG) Argentina \n",
"3 Filter country Australia (AUS) Australia \n",
"4 Filter country Austria (AUT) Austria \n",
"\n",
" CountryCode \n",
"0 AND \n",
"1 ARE \n",
"2 ARG \n",
"3 AUS \n",
"4 AUT "
]
},
"execution_count": 497,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[ COUNTRIES ][['Country','CountryCode']] = data[ COUNTRIES ]['Names'].str.extract(r'(.*) \\((.*)\\)',expand=True)\n",
"data[ COUNTRIES ].head()"
]
},
{
"cell_type": "code",
"execution_count": 498,
"metadata": {},
"outputs": [],
"source": [
"def get_annotated_timing_data(stage,vtype='car', timerank='time'):\n",
" ''' Return a timing dataset that's ready to use. '''\n",
" \n",
" df = get_timing_data(stage, vtype, timerank)\n",
" df[TIMING].insert(2,'Road Position', df[TIMING].sort_values('00_dss',ascending=True)['00_dss'].rank())\n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 499,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pos</th>\n",
" <th>Bib</th>\n",
" <th>Road Position</th>\n",
" <th>Refuel</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" <th>00_dss</th>\n",
" <th>01_wp1</th>\n",
" <th>02_wp2</th>\n",
" <th>03_ass1</th>\n",
" <th>04_wp4</th>\n",
" <th>05_wp5</th>\n",
" <th>06_wp6</th>\n",
" <th>07_ass</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>306</td>\n",
" <td>11.0</td>\n",
" <td>False</td>\n",
" <td>S. LOEB D. ELENA PH-SPORT</td>\n",
" <td>PEUGEOT</td>\n",
" <td>10:09:30</td>\n",
" <td>00:31:03</td>\n",
" <td>00:59:29</td>\n",
" <td>01:29:58</td>\n",
" <td>NaT</td>\n",
" <td>02:31:06</td>\n",
" <td>03:11:08</td>\n",
" <td>03:54:53</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>301</td>\n",
" <td>4.0</td>\n",
" <td>False</td>\n",
" <td>N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA</td>\n",
" <td>TOYOTA</td>\n",
" <td>07:54:00</td>\n",
" <td>00:28:49</td>\n",
" <td>00:56:57</td>\n",
" <td>01:26:33</td>\n",
" <td>01:56:13</td>\n",
" <td>02:35:08</td>\n",
" <td>03:16:42</td>\n",
" <td>04:02:20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3.0</td>\n",
" <td>303</td>\n",
" <td>7.0</td>\n",
" <td>False</td>\n",
" <td>J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>08:03:00</td>\n",
" <td>00:30:27</td>\n",
" <td>01:00:33</td>\n",
" <td>01:30:33</td>\n",
" <td>02:02:29</td>\n",
" <td>02:39:15</td>\n",
" <td>03:22:41</td>\n",
" <td>04:10:08</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.0</td>\n",
" <td>302</td>\n",
" <td>6.0</td>\n",
" <td>False</td>\n",
" <td>G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R...</td>\n",
" <td>TOYOTA</td>\n",
" <td>08:00:00</td>\n",
" <td>00:31:34</td>\n",
" <td>01:00:55</td>\n",
" <td>01:30:23</td>\n",
" <td>02:02:26</td>\n",
" <td>02:39:30</td>\n",
" <td>03:22:28</td>\n",
" <td>04:10:48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>307</td>\n",
" <td>2.0</td>\n",
" <td>False</td>\n",
" <td>N. ROMA A. HARO BRAVO X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>07:36:00</td>\n",
" <td>00:29:48</td>\n",
" <td>00:58:13</td>\n",
" <td>01:32:02</td>\n",
" <td>02:03:26</td>\n",
" <td>02:39:27</td>\n",
" <td>03:22:42</td>\n",
" <td>04:10:50</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pos Bib Road Position Refuel \\\n",
"0 1.0 306 11.0 False \n",
"1 2.0 301 4.0 False \n",
"2 3.0 303 7.0 False \n",
"3 4.0 302 6.0 False \n",
"4 5.0 307 2.0 False \n",
"\n",
" Crew Brand 00_dss \\\n",
"0 S. LOEB D. ELENA PH-SPORT PEUGEOT 10:09:30 \n",
"1 N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA TOYOTA 07:54:00 \n",
"2 J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM MINI 08:03:00 \n",
"3 G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R... TOYOTA 08:00:00 \n",
"4 N. ROMA A. HARO BRAVO X-RAID TEAM MINI 07:36:00 \n",
"\n",
" 01_wp1 02_wp2 03_ass1 04_wp4 05_wp5 06_wp6 07_ass \n",
"0 00:31:03 00:59:29 01:29:58 NaT 02:31:06 03:11:08 03:54:53 \n",
"1 00:28:49 00:56:57 01:26:33 01:56:13 02:35:08 03:16:42 04:02:20 \n",
"2 00:30:27 01:00:33 01:30:33 02:02:29 02:39:15 03:22:41 04:10:08 \n",
"3 00:31:34 01:00:55 01:30:23 02:02:26 02:39:30 03:22:28 04:10:48 \n",
"4 00:29:48 00:58:13 01:32:02 02:03:26 02:39:27 03:22:42 04:10:50 "
]
},
"execution_count": 499,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t_data = get_annotated_timing_data(STAGE,vtype='car', timerank='time')[TIMING]\n",
"t_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 500,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pos</th>\n",
" <th>Road Position</th>\n",
" <th>Refuel</th>\n",
" <th>Bib</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>11.0</td>\n",
" <td>False</td>\n",
" <td>306</td>\n",
" <td>S. LOEB D. ELENA PH-SPORT</td>\n",
" <td>PEUGEOT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>4.0</td>\n",
" <td>False</td>\n",
" <td>301</td>\n",
" <td>N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA</td>\n",
" <td>TOYOTA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3.0</td>\n",
" <td>7.0</td>\n",
" <td>False</td>\n",
" <td>303</td>\n",
" <td>J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" <td>False</td>\n",
" <td>302</td>\n",
" <td>G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R...</td>\n",
" <td>TOYOTA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>2.0</td>\n",
" <td>False</td>\n",
" <td>307</td>\n",
" <td>N. ROMA A. HARO BRAVO X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pos Road Position Refuel Bib \\\n",
"0 1.0 11.0 False 306 \n",
"1 2.0 4.0 False 301 \n",
"2 3.0 7.0 False 303 \n",
"3 4.0 6.0 False 302 \n",
"4 5.0 2.0 False 307 \n",
"\n",
" Crew Brand \n",
"0 S. LOEB D. ELENA PH-SPORT PEUGEOT \n",
"1 N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA TOYOTA \n",
"2 J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM MINI \n",
"3 G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R... TOYOTA \n",
"4 N. ROMA A. HARO BRAVO X-RAID TEAM MINI "
]
},
"execution_count": 500,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"not_timing_cols = ['Pos','Road Position','Refuel','Bib','Crew','Brand']\n",
"\n",
"driver_data = t_data[ not_timing_cols ]\n",
"driver_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The number of waypoints differs across stages. If we cast the wide format waypoint data into a long form, we can more conveniently merge waypoint timing data from separate stages into the same dataframe."
]
},
{
"cell_type": "code",
"execution_count": 501,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pos</th>\n",
" <th>Road Position</th>\n",
" <th>Refuel</th>\n",
" <th>Bib</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" <th>Section</th>\n",
" <th>Time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>11.0</td>\n",
" <td>False</td>\n",
" <td>306</td>\n",
" <td>S. LOEB D. ELENA PH-SPORT</td>\n",
" <td>PEUGEOT</td>\n",
" <td>00_dss</td>\n",
" <td>10:09:30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>4.0</td>\n",
" <td>False</td>\n",
" <td>301</td>\n",
" <td>N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA</td>\n",
" <td>TOYOTA</td>\n",
" <td>00_dss</td>\n",
" <td>07:54:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3.0</td>\n",
" <td>7.0</td>\n",
" <td>False</td>\n",
" <td>303</td>\n",
" <td>J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>00_dss</td>\n",
" <td>08:03:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" <td>False</td>\n",
" <td>302</td>\n",
" <td>G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R...</td>\n",
" <td>TOYOTA</td>\n",
" <td>00_dss</td>\n",
" <td>08:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.0</td>\n",
" <td>2.0</td>\n",
" <td>False</td>\n",
" <td>307</td>\n",
" <td>N. ROMA A. HARO BRAVO X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>00_dss</td>\n",
" <td>07:36:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pos Road Position Refuel Bib \\\n",
"0 1.0 11.0 False 306 \n",
"1 2.0 4.0 False 301 \n",
"2 3.0 7.0 False 303 \n",
"3 4.0 6.0 False 302 \n",
"4 5.0 2.0 False 307 \n",
"\n",
" Crew Brand Section Time \n",
"0 S. LOEB D. ELENA PH-SPORT PEUGEOT 00_dss 10:09:30 \n",
"1 N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA TOYOTA 00_dss 07:54:00 \n",
"2 J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM MINI 00_dss 08:03:00 \n",
"3 G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R... TOYOTA 00_dss 08:00:00 \n",
"4 N. ROMA A. HARO BRAVO X-RAID TEAM MINI 00_dss 07:36:00 "
]
},
"execution_count": 501,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.melt(t_data.head(),\n",
" id_vars=not_timing_cols,\n",
" var_name='Section', value_name='Time').head()"
]
},
{
"cell_type": "code",
"execution_count": 502,
"metadata": {},
"outputs": [],
"source": [
"def _timing_long(df, nodss=True):\n",
" ''' Cast timing data to long data frame. '''\n",
" \n",
" df = pd.melt(df,\n",
" id_vars=[c for c in df.columns if not any(_c in c for _c in ['dss','wp','ass'])],\n",
" var_name='Section', value_name='Time')\n",
"\n",
" if nodss:\n",
" return df[df['Section']!='00_dss']\n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 503,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pos</th>\n",
" <th>Bib</th>\n",
" <th>Road Position</th>\n",
" <th>Refuel</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" <th>Section</th>\n",
" <th>Time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>65</th>\n",
" <td>1.0</td>\n",
" <td>306</td>\n",
" <td>11.0</td>\n",
" <td>False</td>\n",
" <td>S. LOEB D. ELENA PH-SPORT</td>\n",
" <td>PEUGEOT</td>\n",
" <td>01_wp1</td>\n",
" <td>00:31:03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>66</th>\n",
" <td>2.0</td>\n",
" <td>301</td>\n",
" <td>4.0</td>\n",
" <td>False</td>\n",
" <td>N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA</td>\n",
" <td>TOYOTA</td>\n",
" <td>01_wp1</td>\n",
" <td>00:28:49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>67</th>\n",
" <td>3.0</td>\n",
" <td>303</td>\n",
" <td>7.0</td>\n",
" <td>False</td>\n",
" <td>J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>01_wp1</td>\n",
" <td>00:30:27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68</th>\n",
" <td>4.0</td>\n",
" <td>302</td>\n",
" <td>6.0</td>\n",
" <td>False</td>\n",
" <td>G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R...</td>\n",
" <td>TOYOTA</td>\n",
" <td>01_wp1</td>\n",
" <td>00:31:34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>5.0</td>\n",
" <td>307</td>\n",
" <td>2.0</td>\n",
" <td>False</td>\n",
" <td>N. ROMA A. HARO BRAVO X-RAID TEAM</td>\n",
" <td>MINI</td>\n",
" <td>01_wp1</td>\n",
" <td>00:29:48</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pos Bib Road Position Refuel \\\n",
"65 1.0 306 11.0 False \n",
"66 2.0 301 4.0 False \n",
"67 3.0 303 7.0 False \n",
"68 4.0 302 6.0 False \n",
"69 5.0 307 2.0 False \n",
"\n",
" Crew Brand Section \\\n",
"65 S. LOEB D. ELENA PH-SPORT PEUGEOT 01_wp1 \n",
"66 N. AL-ATTIYAH M. BAUMEL TOYOTA GAZOO RACING SA TOYOTA 01_wp1 \n",
"67 J. PRZYGONSKI T. COLSOUL ORLEN X-RAID TEAM MINI 01_wp1 \n",
"68 G. DE VILLIERS D. VON ZITZEWITZ TOYOTA GAZOO R... TOYOTA 01_wp1 \n",
"69 N. ROMA A. HARO BRAVO X-RAID TEAM MINI 01_wp1 \n",
"\n",
" Time \n",
"65 00:31:03 \n",
"66 00:28:49 \n",
"67 00:30:27 \n",
"68 00:31:34 \n",
"69 00:29:48 "
]
},
"execution_count": 503,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t_data_long = _timing_long(t_data)\n",
"t_data_long.head()"
]
},
{
"cell_type": "code",
"execution_count": 504,
"metadata": {},
"outputs": [],
"source": [
"def get_long_annotated_timing_data(stage,vtype='car', timerank='time'):\n",
" ''' Get annotated timing dataframe and convert it to long format. '''\n",
" _tmp = get_annotated_timing_data(stage,vtype, timerank)\n",
" _tmp[TIMING] = _timing_long(_tmp[TIMING])\n",
" \n",
" #Find the total seconds for each split / waypoint duration\n",
" _tmp[TIMING]['TimeInS'] = _tmp[TIMING]['Time'].dt.total_seconds()\n",