Skip to content

Instantly share code, notes, and snippets.

@franloza
Created December 10, 2018 14:49
Show Gist options
  • Save franloza/7e63d5875a23e310501c48f88f9629a1 to your computer and use it in GitHub Desktop.
Save franloza/7e63d5875a23e310501c48f88f9629a1 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Preprocessing dataset: ATP Tennis Rankings, Results, and Stats\n",
"## Source: https://github.com/JeffSackmann/tennis_atp\n",
"### Predictive modeling. Master in Big Data Analysis. 2018/2019\n",
"### Authors: Francisco J. Lozano, Antonio Miranda, Diego Suárez"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tourney_id 2016-M020\n",
"tourney_name Brisbane\n",
"surface Hard\n",
"draw_size 32\n",
"tourney_level A\n",
"tourney_date 2.01601e+07\n",
"match_num 300\n",
"winner_id 105683\n",
"winner_seed 4\n",
"winner_entry NaN\n",
"winner_name Milos Raonic\n",
"winner_hand R\n",
"winner_ht 196\n",
"winner_ioc CAN\n",
"winner_age 25.0212\n",
"winner_rank 14\n",
"winner_rank_points 2170\n",
"loser_id 103819\n",
"loser_seed 1\n",
"loser_entry NaN\n",
"loser_name Roger Federer\n",
"loser_hand R\n",
"loser_ht 185\n",
"loser_ioc SUI\n",
"loser_age 34.4066\n",
"loser_rank 3\n",
"loser_rank_points 8265\n",
"score 6-4 6-4\n",
"best_of 3\n",
"round F\n",
"minutes 87\n",
"w_ace 6\n",
"w_df 6\n",
"w_svpt 60\n",
"w_1stIn 34\n",
"w_1stWon 28\n",
"w_2ndWon 14\n",
"w_SvGms 10\n",
"w_bpSaved 1\n",
"w_bpFaced 1\n",
"l_ace 7\n",
"l_df 3\n",
"l_svpt 61\n",
"l_1stIn 34\n",
"l_1stWon 25\n",
"l_2ndWon 14\n",
"l_SvGms 10\n",
"l_bpSaved 3\n",
"l_bpFaced 5\n",
"Name: 0, dtype: object"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import warnings\n",
"from tqdm import tqdm\n",
"warnings.filterwarnings('ignore')\n",
"data = pd.concat([pd.read_csv(\"data/atp_matches_2016.csv\"),pd.read_csv(\"data/atp_matches_2017.csv\")])\n",
"data.iloc[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- tourney_id. A character id that uniquely identifies each tournament\n",
"- tourney_name. A character tournament name\n",
"- surface. A character description of the court surface (Carpet, Clay, Grass, or Hard)\n",
"- draw_size. A numeric value indicating the draw size\n",
"- tourney_level. A character description of the tournament level (A, C, D, F, G, M)\n",
"- match_num. A numeric indicating the order of matches\n",
"- winner_id. A numeric id identifying the player who won the match\n",
"- winner_seed. A numeric value for the winner's seeding\n",
"- winner_entry. A character value indicating the winner's entry type (WC = Wild card, Q = Qualifier, LL = Lucky loser, or PR = Protected ranking)\n",
"- winner_name. A character of the winner's name\n",
"- winner_hand. A character value indicated the handedness of the winner\n",
"- winner_ht. A numeric value of the winner's height in cm\n",
"- winner_ioc. A character of the winner's country of origin\n",
"- winner_age. A numeric of the winner's age at the time of the match\n",
"- winner_rank. A numeric of the winner's rank at the time of the match\n",
"- winner_rank_points. A numeric of the winner's 52-week ranking points at the time of the match\n",
"- loser_id. A numeric id identifying the player who won the match\n",
"- loser_seed. A numeric value for the loser's seeding\n",
"- loser_entry. A character value indicating the loser's entry type (WC = Wild card, Q = Qualifier, LL = Lucky loser, or PR = Protected ranking)\n",
"- loser_name. A character of the loser's name\n",
"- loser_hand. A character value indicated the handedness of the loser\n",
"- loser_ht. A numeric value of the loser's height in cm\n",
"- loser_ioc. A character of the loser's country of origin\n",
"- loser_age. A numeric of the loser's age at the time of the match\n",
"- loser_rank. A numeric of the loser's rank at the time of the match\n",
"- loser_rank_points. A numeric of the loser's 52-week ranking points at the time of the match\n",
"- score. A character of the match score\n",
"- best_of. A numeric value indicating the match format (3 or 5)\n",
"- round. A character indicating the round of the match\n",
"- minutes. A numeric value for the duration of the match in minutes\n",
"- w_ace. A numeric value for the winner's number of aces\n",
"- w_df. A numeric value for the winner's number of double faults\n",
"- w_svpt. A numeric value for the winner's number of service points\n",
"- w_1stIn. A numeric value for the winner's number of first serves in\n",
"- w_1stWon. A numeric value for the winner's number of first service points won\n",
"- w_2ndWon. A numeric value for the winner's number of second service points won\n",
"- w_SvGms. A numeric value for the winner's number of service games\n",
"- w_bpSaved. A numeric value for the winner's number of breakpoints saves\n",
"- w_bpFaced. A numeric value for the winner's number of breakpoints faced\n",
"- l_ace. A numeric value for the loser's number of aces\n",
"- l_df. A numeric value for the loser's number of double faults\n",
"- l_svpt. A numeric value for the loser's number of service points\n",
"- l_1stIn. A numeric value for the loser's number of first serves in\n",
"- l_1stWon. A numeric value for the loser's number of first service points won\n",
"- l_2ndWon. A numeric value for the loser's number of second service points won\n",
"- l_SvGms. A numeric value for the loser's number of service games\n",
"- l_bpSaved. A numeric value for the loser's number of breakpoints saves\n",
"- l_bpFaced. A numeric value for the loser's number of breakpoints faced"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Number of rows: 5890'"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"Number of rows: \" + str(data.shape[0])"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tourney_id 63\n",
"tourney_name 63\n",
"surface 63\n",
"draw_size 63\n",
"tourney_level 63\n",
"tourney_date 63\n",
"match_num 63\n",
"winner_id 63\n",
"winner_seed 3309\n",
"winner_entry 5224\n",
"winner_name 63\n",
"winner_hand 67\n",
"winner_ht 1422\n",
"winner_ioc 63\n",
"winner_age 71\n",
"winner_rank 96\n",
"winner_rank_points 96\n",
"loser_id 63\n",
"loser_seed 4476\n",
"loser_entry 4802\n",
"loser_name 63\n",
"loser_hand 80\n",
"loser_ht 1912\n",
"loser_ioc 63\n",
"loser_age 78\n",
"loser_rank 147\n",
"loser_rank_points 147\n",
"score 63\n",
"best_of 63\n",
"round 63\n",
"minutes 135\n",
"w_ace 120\n",
"w_df 120\n",
"w_svpt 120\n",
"w_1stIn 120\n",
"w_1stWon 120\n",
"w_2ndWon 120\n",
"w_SvGms 120\n",
"w_bpSaved 120\n",
"w_bpFaced 120\n",
"l_ace 120\n",
"l_df 120\n",
"l_svpt 120\n",
"l_1stIn 120\n",
"l_1stWon 120\n",
"l_2ndWon 120\n",
"l_SvGms 120\n",
"l_bpSaved 120\n",
"l_bpFaced 120\n",
"dtype: int64"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"new_data = data.drop([\"winner_seed\", \"winner_entry\", \"winner_ht\",\n",
" \"loser_seed\", \"loser_entry\", \"loser_ht\"], axis=1).dropna()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Number of rows: 5606'"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"Number of rows: \" + str(new_data.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firstly, we need to get rid of information of winner and losers from columns as set it as a new column (label). To do it, we are going to set as player 1, the one as higher ranking, and the second player as the one with lower ranking"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'tourney_id, tourney_name, surface, drap1_size, tourney_level, tourney_date, match_num, p1_id, p1_name, p1_hand, p1_ioc, p1_age, p1_rank, p1_rank_points, p2_id, p2_name, p2_hand, p2_ioc, p2_age, p2_rank, p2_rank_points, score, best_of, round, minutes, p1_ace, p1_df, p1_svpt, p1_1stIn, p1_1stWon, p1_2ndWon, p1_SvGms, p1_bpSaved, p1_bpFaced, p2_ace, p2_df, p2_svpt, p2_1stIn, p2_1stWon, p2_2ndWon, p2_SvGms, p2_bpSaved, p2_bpFaced, p1_win'"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data[\"p1_win\"] = True\n",
"new_columns = [col.replace(\"winner_\",\"p1_\").replace(\"w_\",\"p1_\").replace(\"loser_\",\"p2_\").replace(\"l_\",\"p2_\")\n",
" for col in new_data.columns]\n",
"new_data.columns = new_columns\n",
"\", \". join(new_columns)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"p1_stats_columns = [\"p1_id\", \"p1_name\", \"p1_hand\", \"p1_ioc\", \"p1_age\", \"p1_rank\", \"p1_rank_points\",\n",
" \"p1_ace\", \"p1_df\", \"p1_svpt\", \"p1_1stIn\", \"p1_1stWon\", \"p1_2ndWon\", \"p1_SvGms\",\n",
" \"p1_bpSaved\", \"p1_bpFaced\"]\n",
"p2_stats_columns = [\"p2_id\", \"p2_name\", \"p2_hand\", \"p2_ioc\", \"p2_age\", \"p2_rank\", \"p2_rank_points\",\n",
" \"p2_ace\", \"p2_df\", \"p2_svpt\", \"p2_1stIn\", \"p2_1stWon\", \"p2_2ndWon\", \"p2_SvGms\",\n",
" \"p2_bpSaved\", \"p2_bpFaced\"]\n",
"\n",
"for idx, match in new_data.iterrows():\n",
" if match[\"p1_rank\"] > match[\"p2_rank\"]:\n",
" #Swap player\n",
" new_data.loc[idx, \"p1_win\"] = False\n",
" p1_stats = new_data.loc[idx, p1_stats_columns]\n",
" new_data.loc[idx, p1_stats_columns] = new_data.loc[idx, p2_stats_columns].values\n",
" new_data.loc[idx, p2_stats_columns] = p1_stats.values"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>tourney_id</th>\n",
" <th>tourney_name</th>\n",
" <th>surface</th>\n",
" <th>drap1_size</th>\n",
" <th>tourney_level</th>\n",
" <th>tourney_date</th>\n",
" <th>match_num</th>\n",
" <th>p1_id</th>\n",
" <th>p1_name</th>\n",
" <th>p1_hand</th>\n",
" <th>...</th>\n",
" <th>p2_ace</th>\n",
" <th>p2_df</th>\n",
" <th>p2_svpt</th>\n",
" <th>p2_1stIn</th>\n",
" <th>p2_1stWon</th>\n",
" <th>p2_2ndWon</th>\n",
" <th>p2_SvGms</th>\n",
" <th>p2_bpSaved</th>\n",
" <th>p2_bpFaced</th>\n",
" <th>p1_win</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2016-M020</td>\n",
" <td>Brisbane</td>\n",
" <td>Hard</td>\n",
" <td>32.0</td>\n",
" <td>A</td>\n",
" <td>20160104.0</td>\n",
" <td>300.0</td>\n",
" <td>105683.0</td>\n",
" <td>Milos Raonic</td>\n",
" <td>R</td>\n",
" <td>...</td>\n",
" <td>7.0</td>\n",
" <td>3.0</td>\n",
" <td>61.0</td>\n",
" <td>34.0</td>\n",
" <td>25.0</td>\n",
" <td>14.0</td>\n",
" <td>10.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2017-M020</td>\n",
" <td>Brisbane</td>\n",
" <td>Hard</td>\n",
" <td>32.0</td>\n",
" <td>A</td>\n",
" <td>20170102.0</td>\n",
" <td>300.0</td>\n",
" <td>105777.0</td>\n",
" <td>Grigor Dimitrov</td>\n",
" <td>R</td>\n",
" <td>...</td>\n",
" <td>4.0</td>\n",
" <td>0.0</td>\n",
" <td>69.0</td>\n",
" <td>49.0</td>\n",
" <td>36.0</td>\n",
" <td>9.0</td>\n",
" <td>12.0</td>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows × 44 columns</p>\n",
"</div>"
],
"text/plain": [
" tourney_id tourney_name surface drap1_size tourney_level tourney_date \\\n",
"0 2016-M020 Brisbane Hard 32.0 A 20160104.0 \n",
"0 2017-M020 Brisbane Hard 32.0 A 20170102.0 \n",
"\n",
" match_num p1_id p1_name p1_hand ... p2_ace p2_df \\\n",
"0 300.0 105683.0 Milos Raonic R ... 7.0 3.0 \n",
"0 300.0 105777.0 Grigor Dimitrov R ... 4.0 0.0 \n",
"\n",
" p2_svpt p2_1stIn p2_1stWon p2_2ndWon p2_SvGms p2_bpSaved p2_bpFaced \\\n",
"0 61.0 34.0 25.0 14.0 10.0 3.0 5.0 \n",
"0 69.0 49.0 36.0 9.0 12.0 2.0 5.0 \n",
"\n",
" p1_win \n",
"0 False \n",
"0 False \n",
"\n",
"[2 rows x 44 columns]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data.loc[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We add a new label for creating a regression problem: Difference in points"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"new_data.loc[:, \"diff_points\"] =\\\n",
" abs((new_data[\"p1_1stWon\"] + new_data[\"p1_2ndWon\"]) - (new_data[\"p2_1stWon\"] + new_data[\"p1_2ndWon\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we sort the dataset by date yo have our stats dataset ready to explore. We will get 5-matches and 20-matches\n",
"rolling statistics for each player to construct the final dataset"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"new_data = new_data.sort_values(by=[\"tourney_date\", \"match_num\"], ascending=True).reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>tourney_name</th>\n",
" <th>tourney_date</th>\n",
" <th>match_num</th>\n",
" <th>p1_name</th>\n",
" <th>p2_name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Doha</td>\n",
" <td>20160104.0</td>\n",
" <td>270.0</td>\n",
" <td>Rafael Nadal</td>\n",
" <td>Pablo Carreno Busta</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Brisbane</td>\n",
" <td>20160104.0</td>\n",
" <td>271.0</td>\n",
" <td>Denis Istomin</td>\n",
" <td>Mikhail Kukushkin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Chennai</td>\n",
" <td>20160104.0</td>\n",
" <td>271.0</td>\n",
" <td>Ramkumar Ramanathan</td>\n",
" <td>Daniel Gimeno Traver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Doha</td>\n",
" <td>20160104.0</td>\n",
" <td>271.0</td>\n",
" <td>Aslan Karatsev</td>\n",
" <td>Robin Haase</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Brisbane</td>\n",
" <td>20160104.0</td>\n",
" <td>272.0</td>\n",
" <td>Dusan Lajovic</td>\n",
" <td>Radek Stepanek</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" tourney_name tourney_date match_num p1_name \\\n",
"0 Doha 20160104.0 270.0 Rafael Nadal \n",
"1 Brisbane 20160104.0 271.0 Denis Istomin \n",
"2 Chennai 20160104.0 271.0 Ramkumar Ramanathan \n",
"3 Doha 20160104.0 271.0 Aslan Karatsev \n",
"4 Brisbane 20160104.0 272.0 Dusan Lajovic \n",
"\n",
" p2_name \n",
"0 Pablo Carreno Busta \n",
"1 Mikhail Kukushkin \n",
"2 Daniel Gimeno Traver \n",
"3 Robin Haase \n",
"4 Radek Stepanek "
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data[[\"tourney_name\", \"tourney_date\", \"match_num\", \"p1_name\", \"p2_name\"]].head(5)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'tourney_id, tourney_name, surface, drap1_size, tourney_level, tourney_date, match_num, p1_id, p1_name, p1_hand, p1_ioc, p1_age, p1_rank, p1_rank_points, p2_id, p2_name, p2_hand, p2_ioc, p2_age, p2_rank, p2_rank_points, score, best_of, round, minutes, p1_ace, p1_df, p1_svpt, p1_1stIn, p1_1stWon, p1_2ndWon, p1_SvGms, p1_bpSaved, p1_bpFaced, p2_ace, p2_df, p2_svpt, p2_1stIn, p2_1stWon, p2_2ndWon, p2_SvGms, p2_bpSaved, p2_bpFaced, p1_win, diff_points'"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\", \".join(new_data.columns)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"# A priori stats\n",
"a_priori_columns = ['tourney_id', 'tourney_name', 'surface', 'drap1_size', 'tourney_level', 'tourney_date', 'match_num',\n",
" 'p1_id', 'p1_name', 'p1_hand', 'p1_ioc', 'p1_age', 'p1_rank', 'p1_rank_points',\n",
" 'p2_id', 'p2_name', 'p2_hand', 'p2_ioc', 'p2_age', 'p2_rank', 'p2_rank_points']\n",
"final_dataset = new_data[a_priori_columns].copy()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"def get_players_stats(new_data, match_id, window_sizes):\n",
" stats = {}\n",
" match = new_data.loc[match_id]\n",
" \n",
" # Get windows for player 1 and player 2 (unbounded yet)\n",
" mask_p1 = (new_data.index < match_id) & ((new_data.p1_id == match.p1_id) | (new_data.p2_id == match.p1_id))\n",
" p1_window = new_data[mask_p1]\n",
" p1_window.loc[:, \"p2_win\"] = ~p1_window.loc[:, \"p1_win\"].values\n",
" mask_p2 = (new_data.index < match_id) & ((new_data.p1_id == match.p2_id) | (new_data.p2_id == match.p2_id))\n",
" p2_window = new_data[mask_p2]\n",
" p2_window.loc[:, \"p2_win\"] = ~p2_window.loc[:, \"p1_win\"].values\n",
" \n",
" # Set stats for windows\n",
" stats_columns = [\"id\", \"name\", \"hand\", \"ioc\", \"age\", \"rank\", \"rank_points\",\n",
" \"ace\", \"df\", \"svpt\", \"1stIn\", \"1stWon\", \"2ndWon\", \"SvGms\",\n",
" \"bpSaved\", \"bpFaced\",\"win\"]\n",
" p1_window_stats = pd.DataFrame(index=p1_window.index, columns=stats_columns)\n",
" p1_window_stats.loc[p1_window[match.p1_id == p1_window.p1_id].index,:]= \\\n",
" p1_window.loc[match.p1_id == p1_window.p1_id, map(lambda x: \"p1_\"+x, stats_columns)].values\n",
" p1_window_stats.loc[p1_window[match.p1_id == p1_window.p2_id].index,:]= \\\n",
" p1_window.loc[match.p1_id == p1_window.p2_id, map(lambda x: \"p2_\"+x, stats_columns)].values\n",
" \n",
" p2_window_stats = pd.DataFrame(index=p2_window.index, columns=stats_columns)\n",
" p2_window_stats.loc[p2_window[match.p2_id == p2_window.p1_id].index,:]= \\\n",
" p2_window.loc[match.p2_id == p2_window.p1_id, map(lambda x: \"p1_\"+x, stats_columns)].values\n",
" p2_window_stats.loc[p2_window[match.p2_id == p2_window.p2_id].index,:]= \\\n",
" p2_window.loc[match.p2_id == p2_window.p2_id, map(lambda x: \"p2_\"+x, stats_columns)].values\n",
" \n",
" for window_size in window_sizes:\n",
" # Stats for player 1\n",
" p1_last_matches = p1_window_stats.tail(window_size)\n",
" if p1_last_matches.empty:\n",
" stats[\"p1_win_prob_{}w\".format(window_size)] = np.nan\n",
" stats[\"p1_ace_prob_{}w\".format(window_size)] = np.nan\n",
" stats[\"p1_df_prob_{}w\".format(window_size)] = np.nan\n",
" stats[\"p1_svptWon_prob_{}w\".format(window_size)] = np.nan\n",
" #stats[\"p1_bpSaved_prob_{}w\".format(window_size)] = np.nan\n",
" else:\n",
" # Get Percetage of matches won in last windows_size matches\n",
" stats[\"p1_win_prob_{}w\".format(window_size)] = p1_last_matches.win.sum() / p1_last_matches.shape[0]\n",
" # Get percentage of aces/point served in last windows_size matches (aces / svpt)\n",
" stats[\"p1_ace_prob_{}w\".format(window_size)] = p1_last_matches.ace.sum() / p1_last_matches.svpt.sum()\n",
" # Get percentage of double faults/point served in last windows size matches (df / svpt)\n",
" stats[\"p1_df_prob_{}w\".format(window_size)] = p1_last_matches.df.sum() / p1_last_matches.svpt.sum()\n",
" # Get percentage of points won/point served in last windows size matches ((1stWon + 2ndWon) / svpt)\n",
" stats[\"p1_svptWon_prob_{}w\".format(window_size)] = \\\n",
" (p1_last_matches[\"1stWon\"] + p1_last_matches[\"2ndWon\"]).sum() / p1_last_matches.svpt.sum()\n",
" # Get percentage of breakpoint saved (bpSaved / bpFaced)\n",
" #stats[\"p1_bpSaved_prob_{}w\".format(window_size)] = p1_last_matches.bpSaved.sum() / p1_last_matches.bpFaced.sum()\n",
"\n",
" # Stats for player 2\n",
" p2_last_matches = p2_window_stats.tail(window_size)\n",
" if p2_last_matches.empty:\n",
" stats[\"p2_win_prob_{}w\".format(window_size)] = np.nan\n",
" stats[\"p2_ace_prob_{}w\".format(window_size)] = np.nan\n",
" stats[\"p2_df_prob_{}w\".format(window_size)] = np.nan\n",
" stats[\"p2_svptWon_prob_{}w\".format(window_size)] = np.nan\n",
" #stats[\"p2_bpSaved_prob_{}w\".format(window_size)] = np.nan\n",
" else:\n",
" # Get Percetage of matches won in last windows_size matches\n",
" stats[\"p2_win_prob_{}w\".format(window_size)] = p2_last_matches.win.sum() / p2_last_matches.shape[0]\n",
" # Get percentage of aces/point served in last windows_size matches (aces / svpt)\n",
" stats[\"p2_ace_prob_{}w\".format(window_size)] = p2_last_matches.ace.sum() / p2_last_matches.svpt.sum()\n",
" # Get percentage of double faults/point served in last windows size matches (df / svpt)\n",
" stats[\"p2_df_prob_{}w\".format(window_size)] = p2_last_matches.df.sum() / p2_last_matches.svpt.sum()\n",
" # Get percentage of points won/point served in last windows size matches ((1stWon + 2ndWon) / svpt)\n",
" stats[\"p2_svptWon_prob_{}w\".format(window_size)] = \\\n",
" (p2_last_matches[\"1stWon\"] + p2_last_matches[\"2ndWon\"]).sum() / p2_last_matches.svpt.sum()\n",
" # Get percentage of breakpoint saved (bpSaved / bpFaced)\n",
" #stats[\"p2_bpSaved_prob_{}w\".format(window_size)] = p2_last_matches.bpSaved.sum() / p2_last_matches.bpFaced.sum()\n",
"\n",
" # Get Percentage of matches won in surface in last windows_size matches (played in that surface)\n",
" p1_surface_matches = p1_window_stats.loc[new_data.loc[p1_window_stats.index, \"surface\"] == match.surface].tail(window_size)\n",
" if p1_surface_matches.empty:\n",
" stats[\"p1_surface_win_prob_{}w\".format(window_size)] = pd.np.nan\n",
" else:\n",
" stats[\"p1_surface_win_prob_{}w\".format(window_size)] = \\\n",
" p1_surface_matches.win.sum() / p1_surface_matches.shape[0]\n",
" p2_surface_matches = p2_window_stats.loc[new_data.loc[p2_window_stats.index, \"surface\"] == match.surface].tail(window_size)\n",
" if p2_surface_matches.empty:\n",
" stats[\"p2_surface_win_prob_{}w\".format(window_size)] = pd.np.nan\n",
" else:\n",
" stats[\"p2_surface_win_prob_{}w\".format(window_size)] = \\\n",
" p2_surface_matches.win.sum() / p2_surface_matches.shape[0]\n",
"\n",
" return pd.Series(stats)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"p1_win_prob_20w 0.850000\n",
"p1_ace_prob_20w 0.126904\n",
"p1_df_prob_20w 0.038434\n",
"p1_svptWon_prob_20w 0.689630\n",
"p2_win_prob_20w 0.700000\n",
"p2_ace_prob_20w 0.066336\n",
"p2_df_prob_20w 0.024179\n",
"p2_svptWon_prob_20w 0.642901\n",
"p1_surface_win_prob_20w 0.850000\n",
"p2_surface_win_prob_20w 0.700000\n",
"dtype: float64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"match_id = 2760-1\n",
"get_players_stats(new_data, match_id, [20])"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"5606it [3:16:46, 1.48it/s] \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>p1_win_prob_20w</th>\n",
" <th>p1_ace_prob_20w</th>\n",
" <th>p1_df_prob_20w</th>\n",
" <th>p1_svptWon_prob_20w</th>\n",
" <th>p2_win_prob_20w</th>\n",
" <th>p2_ace_prob_20w</th>\n",
" <th>p2_df_prob_20w</th>\n",
" <th>p2_svptWon_prob_20w</th>\n",
" <th>p1_surface_win_prob_20w</th>\n",
" <th>p2_surface_win_prob_20w</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5576</th>\n",
" <td>0.500000</td>\n",
" <td>0.123810</td>\n",
" <td>0.041905</td>\n",
" <td>0.654603</td>\n",
" <td>0.800000</td>\n",
" <td>0.106942</td>\n",
" <td>0.052533</td>\n",
" <td>0.653533</td>\n",
" <td>0.450000</td>\n",
" <td>0.80</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5577</th>\n",
" <td>0.700000</td>\n",
" <td>0.032805</td>\n",
" <td>0.016402</td>\n",
" <td>0.610169</td>\n",
" <td>0.450000</td>\n",
" <td>0.082393</td>\n",
" <td>0.030474</td>\n",
" <td>0.645598</td>\n",
" <td>0.700000</td>\n",
" <td>0.55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5578</th>\n",
" <td>0.125000</td>\n",
" <td>0.082852</td>\n",
" <td>0.030829</td>\n",
" <td>0.595376</td>\n",
" <td>0.750000</td>\n",
" <td>0.168580</td>\n",
" <td>0.045921</td>\n",
" <td>0.672508</td>\n",
" <td>0.200000</td>\n",
" <td>0.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5579</th>\n",
" <td>0.550000</td>\n",
" <td>0.061159</td>\n",
" <td>0.025751</td>\n",
" <td>0.634657</td>\n",
" <td>0.550000</td>\n",
" <td>0.052599</td>\n",
" <td>0.048215</td>\n",
" <td>0.631183</td>\n",
" <td>0.550000</td>\n",
" <td>0.40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5580</th>\n",
" <td>0.350000</td>\n",
" <td>0.050960</td>\n",
" <td>0.050960</td>\n",
" <td>0.614163</td>\n",
" <td>0.900000</td>\n",
" <td>0.060876</td>\n",
" <td>0.024624</td>\n",
" <td>0.715458</td>\n",
" <td>0.300000</td>\n",
" <td>0.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5581</th>\n",
" <td>0.600000</td>\n",
" <td>0.106911</td>\n",
" <td>0.050756</td>\n",
" <td>0.667387</td>\n",
" <td>0.700000</td>\n",
" <td>0.089120</td>\n",
" <td>0.053241</td>\n",
" <td>0.646412</td>\n",
" <td>0.600000</td>\n",
" <td>0.65</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5582</th>\n",
" <td>0.450000</td>\n",
" <td>0.064821</td>\n",
" <td>0.041536</td>\n",
" <td>0.660793</td>\n",
" <td>0.600000</td>\n",
" <td>0.079812</td>\n",
" <td>0.039645</td>\n",
" <td>0.661450</td>\n",
" <td>0.450000</td>\n",
" <td>0.55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5583</th>\n",
" <td>0.850000</td>\n",
" <td>0.151185</td>\n",
" <td>0.039718</td>\n",
" <td>0.678411</td>\n",
" <td>0.750000</td>\n",
" <td>0.038462</td>\n",
" <td>0.031805</td>\n",
" <td>0.661243</td>\n",
" <td>0.750000</td>\n",
" <td>0.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5584</th>\n",
" <td>0.450000</td>\n",
" <td>0.092008</td>\n",
" <td>0.056414</td>\n",
" <td>0.642713</td>\n",
" <td>0.900000</td>\n",
" <td>0.095176</td>\n",
" <td>0.029987</td>\n",
" <td>0.679270</td>\n",
" <td>0.450000</td>\n",
" <td>0.85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5585</th>\n",
" <td>0.700000</td>\n",
" <td>0.238825</td>\n",
" <td>0.033206</td>\n",
" <td>0.712005</td>\n",
" <td>0.800000</td>\n",
" <td>0.116556</td>\n",
" <td>0.050331</td>\n",
" <td>0.691391</td>\n",
" <td>0.700000</td>\n",
" <td>0.80</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5586</th>\n",
" <td>0.750000</td>\n",
" <td>0.106184</td>\n",
" <td>0.021237</td>\n",
" <td>0.675203</td>\n",
" <td>0.450000</td>\n",
" <td>0.121136</td>\n",
" <td>0.044164</td>\n",
" <td>0.656782</td>\n",
" <td>0.750000</td>\n",
" <td>0.40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5587</th>\n",
" <td>0.111111</td>\n",
" <td>0.077586</td>\n",
" <td>0.029310</td>\n",
" <td>0.603448</td>\n",
" <td>0.450000</td>\n",
" <td>0.081574</td>\n",
" <td>0.031945</td>\n",
" <td>0.641187</td>\n",
" <td>0.166667</td>\n",
" <td>0.55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5588</th>\n",
" <td>0.900000</td>\n",
" <td>0.057026</td>\n",
" <td>0.024440</td>\n",
" <td>0.709437</td>\n",
" <td>0.550000</td>\n",
" <td>0.052300</td>\n",
" <td>0.047889</td>\n",
" <td>0.632640</td>\n",
" <td>0.900000</td>\n",
" <td>0.40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5589</th>\n",
" <td>0.450000</td>\n",
" <td>0.067288</td>\n",
" <td>0.039973</td>\n",
" <td>0.661559</td>\n",
" <td>0.700000</td>\n",
" <td>0.088585</td>\n",
" <td>0.055291</td>\n",
" <td>0.639715</td>\n",
" <td>0.400000</td>\n",
" <td>0.65</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5590</th>\n",
" <td>0.850000</td>\n",
" <td>0.144371</td>\n",
" <td>0.043046</td>\n",
" <td>0.665563</td>\n",
" <td>0.450000</td>\n",
" <td>0.097120</td>\n",
" <td>0.056932</td>\n",
" <td>0.649699</td>\n",
" <td>0.750000</td>\n",
" <td>0.40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5591</th>\n",
" <td>0.650000</td>\n",
" <td>0.232242</td>\n",
" <td>0.033972</td>\n",
" <td>0.708462</td>\n",
" <td>0.750000</td>\n",
" <td>0.108073</td>\n",
" <td>0.020182</td>\n",
" <td>0.679688</td>\n",
" <td>0.650000</td>\n",
" <td>0.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5592</th>\n",
" <td>0.700000</td>\n",
" <td>0.085645</td>\n",
" <td>0.056092</td>\n",
" <td>0.636912</td>\n",
" <td>0.500000</td>\n",
" <td>0.099791</td>\n",
" <td>0.055129</td>\n",
" <td>0.640614</td>\n",
" <td>0.700000</td>\n",
" <td>0.40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5593</th>\n",
" <td>0.200000</td>\n",
" <td>0.080916</td>\n",
" <td>0.027481</td>\n",
" <td>0.609160</td>\n",
" <td>0.700000</td>\n",
" <td>0.225707</td>\n",
" <td>0.033825</td>\n",
" <td>0.716482</td>\n",
" <td>0.285714</td>\n",
" <td>0.70</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5594</th>\n",
" <td>0.700000</td>\n",
" <td>0.086562</td>\n",
" <td>0.054479</td>\n",
" <td>0.638015</td>\n",
" <td>0.181818</td>\n",
" <td>0.084306</td>\n",
" <td>0.025940</td>\n",
" <td>0.626459</td>\n",
" <td>0.700000</td>\n",
" <td>0.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5595</th>\n",
" <td>0.800000</td>\n",
" <td>0.104332</td>\n",
" <td>0.051861</td>\n",
" <td>0.643075</td>\n",
" <td>0.750000</td>\n",
" <td>0.089382</td>\n",
" <td>0.054589</td>\n",
" <td>0.646071</td>\n",
" <td>0.800000</td>\n",
" <td>0.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5596</th>\n",
" <td>0.850000</td>\n",
" <td>0.147901</td>\n",
" <td>0.041972</td>\n",
" <td>0.668221</td>\n",
" <td>0.900000</td>\n",
" <td>0.125860</td>\n",
" <td>0.027510</td>\n",
" <td>0.712517</td>\n",
" <td>0.750000</td>\n",
" <td>0.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5597</th>\n",
" <td>0.600000</td>\n",
" <td>0.077766</td>\n",
" <td>0.039144</td>\n",
" <td>0.653967</td>\n",
" <td>0.900000</td>\n",
" <td>0.088599</td>\n",
" <td>0.032967</td>\n",
" <td>0.670330</td>\n",
" <td>0.550000</td>\n",
" <td>0.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5598</th>\n",
" <td>0.700000</td>\n",
" <td>0.031008</td>\n",
" <td>0.019678</td>\n",
" <td>0.606440</td>\n",
" <td>0.850000</td>\n",
" <td>0.117147</td>\n",
" <td>0.050393</td>\n",
" <td>0.700262</td>\n",
" <td>0.700000</td>\n",
" <td>0.85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5599</th>\n",
" <td>0.900000</td>\n",
" <td>0.129913</td>\n",
" <td>0.027315</td>\n",
" <td>0.712858</td>\n",
" <td>0.900000</td>\n",
" <td>0.082517</td>\n",
" <td>0.035664</td>\n",
" <td>0.669930</td>\n",
" <td>0.900000</td>\n",
" <td>0.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5600</th>\n",
" <td>0.850000</td>\n",
" <td>0.114381</td>\n",
" <td>0.051505</td>\n",
" <td>0.698997</td>\n",
" <td>0.750000</td>\n",
" <td>0.088411</td>\n",
" <td>0.054958</td>\n",
" <td>0.637993</td>\n",
" <td>0.850000</td>\n",
" <td>0.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5601</th>\n",
" <td>0.900000</td>\n",
" <td>0.085592</td>\n",
" <td>0.034237</td>\n",
" <td>0.673324</td>\n",
" <td>0.850000</td>\n",
" <td>0.108911</td>\n",
" <td>0.050825</td>\n",
" <td>0.695710</td>\n",
" <td>0.900000</td>\n",
" <td>0.85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5602</th>\n",
" <td>0.850000</td>\n",
" <td>0.085417</td>\n",
" <td>0.035417</td>\n",
" <td>0.665278</td>\n",
" <td>0.600000</td>\n",
" <td>0.106011</td>\n",
" <td>0.051366</td>\n",
" <td>0.665027</td>\n",
" <td>0.850000</td>\n",
" <td>0.60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5603</th>\n",
" <td>0.400000</td>\n",
" <td>0.042774</td>\n",
" <td>0.020739</td>\n",
" <td>0.604666</td>\n",
" <td>0.650000</td>\n",
" <td>0.131547</td>\n",
" <td>0.033496</td>\n",
" <td>0.690621</td>\n",
" <td>0.550000</td>\n",
" <td>0.65</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5604</th>\n",
" <td>0.850000</td>\n",
" <td>0.087695</td>\n",
" <td>0.033311</td>\n",
" <td>0.669613</td>\n",
" <td>0.650000</td>\n",
" <td>0.132064</td>\n",
" <td>0.033784</td>\n",
" <td>0.691032</td>\n",
" <td>0.850000</td>\n",
" <td>0.65</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5605</th>\n",
" <td>0.550000</td>\n",
" <td>0.104794</td>\n",
" <td>0.050725</td>\n",
" <td>0.655518</td>\n",
" <td>0.350000</td>\n",
" <td>0.041746</td>\n",
" <td>0.021505</td>\n",
" <td>0.597090</td>\n",
" <td>0.550000</td>\n",
" <td>0.55</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5606 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" p1_win_prob_20w p1_ace_prob_20w p1_df_prob_20w p1_svptWon_prob_20w \\\n",
"0 NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN \n",
"5 NaN NaN NaN NaN \n",
"6 NaN NaN NaN NaN \n",
"7 NaN NaN NaN NaN \n",
"8 NaN NaN NaN NaN \n",
"9 NaN NaN NaN NaN \n",
"10 NaN NaN NaN NaN \n",
"11 NaN NaN NaN NaN \n",
"12 NaN NaN NaN NaN \n",
"13 NaN NaN NaN NaN \n",
"14 NaN NaN NaN NaN \n",
"15 NaN NaN NaN NaN \n",
"16 NaN NaN NaN NaN \n",
"17 NaN NaN NaN NaN \n",
"18 NaN NaN NaN NaN \n",
"19 NaN NaN NaN NaN \n",
"20 NaN NaN NaN NaN \n",
"21 NaN NaN NaN NaN \n",
"22 NaN NaN NaN NaN \n",
"23 NaN NaN NaN NaN \n",
"24 NaN NaN NaN NaN \n",
"25 NaN NaN NaN NaN \n",
"26 NaN NaN NaN NaN \n",
"27 NaN NaN NaN NaN \n",
"28 NaN NaN NaN NaN \n",
"29 NaN NaN NaN NaN \n",
"... ... ... ... ... \n",
"5576 0.500000 0.123810 0.041905 0.654603 \n",
"5577 0.700000 0.032805 0.016402 0.610169 \n",
"5578 0.125000 0.082852 0.030829 0.595376 \n",
"5579 0.550000 0.061159 0.025751 0.634657 \n",
"5580 0.350000 0.050960 0.050960 0.614163 \n",
"5581 0.600000 0.106911 0.050756 0.667387 \n",
"5582 0.450000 0.064821 0.041536 0.660793 \n",
"5583 0.850000 0.151185 0.039718 0.678411 \n",
"5584 0.450000 0.092008 0.056414 0.642713 \n",
"5585 0.700000 0.238825 0.033206 0.712005 \n",
"5586 0.750000 0.106184 0.021237 0.675203 \n",
"5587 0.111111 0.077586 0.029310 0.603448 \n",
"5588 0.900000 0.057026 0.024440 0.709437 \n",
"5589 0.450000 0.067288 0.039973 0.661559 \n",
"5590 0.850000 0.144371 0.043046 0.665563 \n",
"5591 0.650000 0.232242 0.033972 0.708462 \n",
"5592 0.700000 0.085645 0.056092 0.636912 \n",
"5593 0.200000 0.080916 0.027481 0.609160 \n",
"5594 0.700000 0.086562 0.054479 0.638015 \n",
"5595 0.800000 0.104332 0.051861 0.643075 \n",
"5596 0.850000 0.147901 0.041972 0.668221 \n",
"5597 0.600000 0.077766 0.039144 0.653967 \n",
"5598 0.700000 0.031008 0.019678 0.606440 \n",
"5599 0.900000 0.129913 0.027315 0.712858 \n",
"5600 0.850000 0.114381 0.051505 0.698997 \n",
"5601 0.900000 0.085592 0.034237 0.673324 \n",
"5602 0.850000 0.085417 0.035417 0.665278 \n",
"5603 0.400000 0.042774 0.020739 0.604666 \n",
"5604 0.850000 0.087695 0.033311 0.669613 \n",
"5605 0.550000 0.104794 0.050725 0.655518 \n",
"\n",
" p2_win_prob_20w p2_ace_prob_20w p2_df_prob_20w p2_svptWon_prob_20w \\\n",
"0 NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN \n",
"5 NaN NaN NaN NaN \n",
"6 NaN NaN NaN NaN \n",
"7 NaN NaN NaN NaN \n",
"8 NaN NaN NaN NaN \n",
"9 NaN NaN NaN NaN \n",
"10 NaN NaN NaN NaN \n",
"11 NaN NaN NaN NaN \n",
"12 NaN NaN NaN NaN \n",
"13 NaN NaN NaN NaN \n",
"14 NaN NaN NaN NaN \n",
"15 NaN NaN NaN NaN \n",
"16 NaN NaN NaN NaN \n",
"17 NaN NaN NaN NaN \n",
"18 NaN NaN NaN NaN \n",
"19 NaN NaN NaN NaN \n",
"20 NaN NaN NaN NaN \n",
"21 NaN NaN NaN NaN \n",
"22 NaN NaN NaN NaN \n",
"23 NaN NaN NaN NaN \n",
"24 NaN NaN NaN NaN \n",
"25 NaN NaN NaN NaN \n",
"26 NaN NaN NaN NaN \n",
"27 NaN NaN NaN NaN \n",
"28 NaN NaN NaN NaN \n",
"29 NaN NaN NaN NaN \n",
"... ... ... ... ... \n",
"5576 0.800000 0.106942 0.052533 0.653533 \n",
"5577 0.450000 0.082393 0.030474 0.645598 \n",
"5578 0.750000 0.168580 0.045921 0.672508 \n",
"5579 0.550000 0.052599 0.048215 0.631183 \n",
"5580 0.900000 0.060876 0.024624 0.715458 \n",
"5581 0.700000 0.089120 0.053241 0.646412 \n",
"5582 0.600000 0.079812 0.039645 0.661450 \n",
"5583 0.750000 0.038462 0.031805 0.661243 \n",
"5584 0.900000 0.095176 0.029987 0.679270 \n",
"5585 0.800000 0.116556 0.050331 0.691391 \n",
"5586 0.450000 0.121136 0.044164 0.656782 \n",
"5587 0.450000 0.081574 0.031945 0.641187 \n",
"5588 0.550000 0.052300 0.047889 0.632640 \n",
"5589 0.700000 0.088585 0.055291 0.639715 \n",
"5590 0.450000 0.097120 0.056932 0.649699 \n",
"5591 0.750000 0.108073 0.020182 0.679688 \n",
"5592 0.500000 0.099791 0.055129 0.640614 \n",
"5593 0.700000 0.225707 0.033825 0.716482 \n",
"5594 0.181818 0.084306 0.025940 0.626459 \n",
"5595 0.750000 0.089382 0.054589 0.646071 \n",
"5596 0.900000 0.125860 0.027510 0.712517 \n",
"5597 0.900000 0.088599 0.032967 0.670330 \n",
"5598 0.850000 0.117147 0.050393 0.700262 \n",
"5599 0.900000 0.082517 0.035664 0.669930 \n",
"5600 0.750000 0.088411 0.054958 0.637993 \n",
"5601 0.850000 0.108911 0.050825 0.695710 \n",
"5602 0.600000 0.106011 0.051366 0.665027 \n",
"5603 0.650000 0.131547 0.033496 0.690621 \n",
"5604 0.650000 0.132064 0.033784 0.691032 \n",
"5605 0.350000 0.041746 0.021505 0.597090 \n",
"\n",
" p1_surface_win_prob_20w p2_surface_win_prob_20w \n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"5 NaN NaN \n",
"6 NaN NaN \n",
"7 NaN NaN \n",
"8 NaN NaN \n",
"9 NaN NaN \n",
"10 NaN NaN \n",
"11 NaN NaN \n",
"12 NaN NaN \n",
"13 NaN NaN \n",
"14 NaN NaN \n",
"15 NaN NaN \n",
"16 NaN NaN \n",
"17 NaN NaN \n",
"18 NaN NaN \n",
"19 NaN NaN \n",
"20 NaN NaN \n",
"21 NaN NaN \n",
"22 NaN NaN \n",
"23 NaN NaN \n",
"24 NaN NaN \n",
"25 NaN NaN \n",
"26 NaN NaN \n",
"27 NaN NaN \n",
"28 NaN NaN \n",
"29 NaN NaN \n",
"... ... ... \n",
"5576 0.450000 0.80 \n",
"5577 0.700000 0.55 \n",
"5578 0.200000 0.75 \n",
"5579 0.550000 0.40 \n",
"5580 0.300000 0.90 \n",
"5581 0.600000 0.65 \n",
"5582 0.450000 0.55 \n",
"5583 0.750000 0.75 \n",
"5584 0.450000 0.85 \n",
"5585 0.700000 0.80 \n",
"5586 0.750000 0.40 \n",
"5587 0.166667 0.55 \n",
"5588 0.900000 0.40 \n",
"5589 0.400000 0.65 \n",
"5590 0.750000 0.40 \n",
"5591 0.650000 0.75 \n",
"5592 0.700000 0.40 \n",
"5593 0.285714 0.70 \n",
"5594 0.700000 0.25 \n",
"5595 0.800000 0.75 \n",
"5596 0.750000 0.90 \n",
"5597 0.550000 0.90 \n",
"5598 0.700000 0.85 \n",
"5599 0.900000 0.90 \n",
"5600 0.850000 0.75 \n",
"5601 0.900000 0.85 \n",
"5602 0.850000 0.60 \n",
"5603 0.550000 0.65 \n",
"5604 0.850000 0.65 \n",
"5605 0.550000 0.55 \n",
"\n",
"[5606 rows x 10 columns]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"window_sizes = [20] \n",
"new_stats = []\n",
"for idx, match in tqdm(new_data.iterrows()):\n",
" new_stats.append(get_players_stats(new_data, idx, window_sizes))\n",
"pd.DataFrame(new_stats)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"pd.concat([final_dataset, pd.DataFrame(new_stats), new_data[[\"p1_win\", \"diff_points\"]].astype(int)], axis=1).to_csv(\"atp_matches_with_stats_2016_17.csv\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Columns description\n",
"- tourney_id. A character id that uniquely identifies each tournament\n",
"- tourney_name. A character tournament name\n",
"- surface. A character description of the court surface (Carpet, Clay, Grass, or Hard)\n",
"- draw_size. A numeric value indicating the draw size\n",
"- tourney_level. A character description of the tournament level (A, C, D, F, G, M)\n",
"- tourney_date. A numeric indicating the starting date of the tourney.\n",
"- match_num. A numeric indicating the order of matches\n",
"- p1_id. A numeric id identifying the player with higher ranking\n",
"- p1_name. A character of the player with higher ranking's name\n",
"- p1_hand. A character value indicated the handedness of the player with higher ranking\n",
"- p1_ioc. A character of the player with higher ranking's country of origin\n",
"- p1_age. A numeric of the player with higher ranking's age at the time of the match\n",
"- p1_rank. A numeric of the player with higher ranking's rank at the time of the match\n",
"- p1_rank_points. A numeric of the winner's 52-week ranking points at the time of the match\n",
"- p2_id. A numeric id identifying the player with higher ranking\n",
"- p2_name. A character of the player with lower ranking's name\n",
"- p2_hand. A character value indicated the handedness of the player with higher ranking\n",
"- p2_ioc. A character of the player with lower ranking's country of origin\n",
"- p2_age. A numeric of the player with lower ranking's age at the time of the match\n",
"- p2_rank. A numeric of the player with lower ranking's rank at the time of the match\n",
"- p2_rank_points. A numeric of the lower's 52-week ranking points at the time of the match\n",
"- p1_win_prob_20w: Percentage of matches won by the player with higher ranking in the last 20 matches\n",
"- p1_ace_prob_20w: Percentage of aces by service done by the player with higher ranking in the last 20 matches\n",
"- p1_df_prob_20w: Percentage of double faults by service done by the player with higher ranking in the last 20 matches\n",
"- p1_svptWon_prob_20w: Percentage of services won by the player with higher ranking in the last 20 matches\n",
"- p2_win_prob_20w: Percentage of matches won by the player with lower ranking in the last 20 matches\n",
"- p2_ace_prob_20w: Percentage of aces by service done by the player with lower ranking in the last 20 matches\n",
"- p2_df_prob_20w: Percentage of double faults by service done by the player with lower ranking in the last 20 matches\n",
"- p2_svptWon_prob_20w: Percentage of services won by the player with lower ranking in the last 20 matches\n",
"- p1_surface_win_prob_20w: Percentage of matches won by the player with higher ranking in the last 20 matches played in the same surface\n",
"- p2_surface_win_prob_20w: Percentage of matches won by the player with higher ranking in the last 20 matches played in the same surface\n",
"- p1_win: If the player with higher ranking won the match (1)\n",
"- diff_points: Number of difference in services points won by each player"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment