Skip to content

Instantly share code, notes, and snippets.

@czsheets
Last active October 9, 2018 12:58
Show Gist options
  • Save czsheets/6a1885fc2dc3596df1405a9af845a562 to your computer and use it in GitHub Desktop.
Save czsheets/6a1885fc2dc3596df1405a9af845a562 to your computer and use it in GitHub Desktop.
How many playoff games can you win without winning the World Series (and vice versa)?

From the Riddler Express:

The Major League Baseball playoffs are about to begin. Based on the current playoff format, what is the best possible winning percentage a team can have in the playoffs without winning the World Series? And what is the worst possible winning percentage a team can have in the playoffs and still win the World Series?

The current format is:

  • one play-in game for two non-division-winning wild card teams
  • best 3-out-of 5 division series
  • best 4-out-of-7 league championship series
  • best 4-out-of-7 World Series

The minimum number of wins (if you win your division and don't need the play-in game) is 11, so if you end up playing the maximum in each series, the minimum winning percentage is:

(min_possible = (3 + 4 + 4)/(5 + 7 + 7))
## [1] 0.5789474

Conversely, the maximum winning percentage without winning the final best of seven World Series is

(max_possible = (1 + 3 + 4 + 3)/(1 + 3 + 4 + 7))
## [1] 0.7333333

The bonus question asks how close any actual teams have come. For that we'll use SQL through the SQLDF packages, baseball data from the Lahman database, and dplyr for some quick calculations.

library(Lahman)
library(sqldf)
library(dplyr)

Unfortunately my Lahman data only goes to 2016:

tail(SeriesPost[,-8])
##      yearID round teamIDwinner lgIDwinner teamIDloser lgIDloser wins ties
##  311   2016 ALDS2          CLE         AL         BOS        AL    3    0
##  312   2016  NLWC          SFN         NL         NYN        NL    1    0
##  313   2016  NLCS          CHN         NL         LAN        NL    4    0
##  314   2016 NLDS1          CHN         NL         SFN        NL    3    0
##  315   2016 NLDS2          LAN         NL         WAS        NL    3    0
##  316   2016    WS          CHN         NL         CLE        AL    4    0

So we'll use 2017 from Baseball Reference

dat_2017 <- data.frame(yearID = rep(2017, 9), 
                       round = c('WS','ALCS', 'NLCS', 'ALDS1','ALDS2','NLDS1','NLDS2','ALWC','NLWC'),
                       teamIDwinner = c('HOU','HOU','LAN','NYA','HOU','LAN','CHN','NYA','ARI'),
                       lgIDwinner = c('AL','AL','NL','AL','AL','NL','NL','AL','NL'),
                       teamIDloser = c('LAN','NYA','CHN','CLE','BOS','ARI','WAS','MIN','COL'),
                       lgIDloser = c('NL','AL','NL','AL','AL','NL','NL','AL','NL'),
                       wins =   c(4,4,4,3,3,3,3,1,1),
                       losses = c(3,3,1,2,1,0,2,0,0),
                       ties = 0)

The data is in 'wide' format, so we'll make it long and bring in the 2017 results. I also add a marker for the World Series winner for each year.

playoffs <- sqldf("
            select yearID, round, teamIDwinner as team, 
            cast(wins as float) wins, 
            cast(losses as float) losses,
            case when round = 'WS' then 1 else 0 end WS_win
            from SeriesPost
            where yearID >= 1903
         union all
            select yearID, round, teamIDloser as team, 
            cast(losses as float) wins, cast(wins as float) losses,  0 as WS_win
            from SeriesPost
            where yearID >= 1903
         union all
            select yearID, round, teamIDwinner as team, 
            cast(wins as float) wins, 
            cast(losses as float) losses,
            case when round = 'WS' then 1 else 0 end WS_win
            from dat_2017
         union all
            select yearID, round, teamIDloser as team, 
            cast(losses as float) wins, cast(wins as float) losses,  0 as WS_win
            from dat_2017
            order by yearID desc, WS_win desc, wins desc") 

head(playoffs)
##    yearID round team wins losses WS_win
##  1   2017    WS  HOU    4      3      1
##  2   2017  ALCS  HOU    4      3      0
##  3   2017  NLCS  LAN    4      1      0
##  4   2017 ALDS1  NYA    3      2      0
##  5   2017 ALDS2  HOU    3      1      0
##  6   2017 NLDS1  LAN    3      0      0

With that we can calculate the total winning percentage for the post-season for each team that year. I also included the team wins and losses for the season (from the Teams table) for later comparison, which again required entering the 2017 values manually.

win_pct <- sqldf("
           select a.yearID, a.team, sum(a.wins) wins, sum(a.losses) losses, 
              round(sum(a.wins)/cast((sum(a.wins)+sum(a.losses)) as float),3) win_pct, 
              b.WS_win, c.seas_wins, c.seas_losses
           from playoffs a
           inner join (select yearID, team, max(WS_win) WS_win 
           from playoffs
           group by yearID, team) b
           on a.yearID = b.yearID
              and a.team = b.team
           left join (select yearID, teamID, lgID, 
                      cast(W as float) seas_wins, cast(L as float) as seas_losses from Teams) c
           on a.yearID = c.yearID
              and a.team = c.teamID
           group by a.yearID, a.team
           order by a.yearID desc")

win_pct$seas_wins[1:10] <- c(93,93,92,102,87,101,104,85,91,97)
win_pct$seas_losses[1:10] <- 162 - win_pct$seas_wins[1:10]

win_pct <- win_pct %>% mutate(seas_w_pct = round(seas_wins/(seas_wins + seas_losses),3))

head(win_pct)
##   yearID team wins losses win_pct WS_win seas_wins seas_losses seas_w_pct
## 1   2017  ARI    1      3   0.250      0        93          69      0.574
## 2   2017  BOS    1      3   0.250      0        93          69      0.574
## 3   2017  CHN    4      6   0.400      0        92          70      0.568
## 4   2017  CLE    2      3   0.400      0       102          60      0.630
## 5   2017  COL    0      1   0.000      0        87          75      0.537
## 6   2017  HOU   11      7   0.611      1       101          61      0.623

If we look at all time playoffs (including every year since 1903, where the only playoffs were the World Series until 1969), three things stand out:

  • one team alone has the highest winning percentage without winning the World Series (2014 Royals)
  • that team had a lower in-season winning percentage than any of the other teams on the list
  • the minimum winning percentage is shared by all teams that won a full (seven game) World Series
sqldf("select *
      from win_pct
      where (win_pct = (select max(win_pct) from win_pct where WS_win= 0)
        and WS_win = 0)
      or (win_pct = (select min(win_pct) from win_pct where WS_win= 1)
        and WS_win = 1)
      order by WS_win, seas_w_pct desc")
##    yearID team wins losses win_pct WS_win seas_wins seas_losses seas_w_pct
## 1    2014  KCA   11      4   0.733      0        89          73      0.549
## 2    1909  PIT    4      3   0.571      1       110          42      0.724
## 3    1912  BOS    4      3   0.571      1       105          47      0.691
## 4    1931  SLN    4      3   0.571      1       101          53      0.656
## 5    1940  CIN    4      3   0.571      1       100          53      0.654
## 6    1955  BRO    4      3   0.571      1        98          55      0.641
## 7    1968  DET    4      3   0.571      1       103          59      0.636
## 8    1956  NYA    4      3   0.571      1        97          57      0.630
## 9    1947  NYA    4      3   0.571      1        97          57      0.630
## 10   1946  SLN    4      3   0.571      1        98          58      0.628
## 11   1967  SLN    4      3   0.571      1       101          60      0.627
## 12   1934  SLN    4      3   0.571      1        95          58      0.621
## 13   1925  PIT    4      3   0.571      1        95          58      0.621
## 14   1960  PIT    4      3   0.571      1        95          59      0.617
## 15   1957  ML1    4      3   0.571      1        95          59      0.617
## 16   1952  NYA    4      3   0.571      1        95          59      0.617
## 17   1965  LAN    4      3   0.571      1        97          65      0.599
## 18   1958  NYA    4      3   0.571      1        92          62      0.597
## 19   1924  WS1    4      3   0.571      1        92          62      0.597
## 20   1962  NYA    4      3   0.571      1        96          66      0.593
## 21   1926  SLN    4      3   0.571      1        89          65      0.578
## 22   1945  DET    4      3   0.571      1        88          65      0.575
## 23   1964  SLN    4      3   0.571      1        93          69      0.574
## 24   1985  KCA    8      6   0.571      1        91          71      0.562

If we focus on the Royals that year, we can see that they had a remarkable playoff run. Starting with the play-in game, they won every single game until they got to the World Series. And even there they lost the 7th and final game by one run.

sqldf("select yearID, round, team, wins, losses 
      from playoffs
      where yearID = 2014 and team = 'KCA' 
      order by losses, wins ")
##    yearID round team wins losses
##  1   2014  ALWC  KCA    1      0
##  2   2014 ALDS1  KCA    3      0
##  3   2014  ALCS  KCA    4      0
##  4   2014    WS  KCA    3      4

This wasn't how they played all year: with a .55 season winning percentage and only 89 wins on the year, they clearly got hot at the right time.

sqldf("select *
      from win_pct
      where (win_pct = (select max(win_pct) from win_pct where WS_win= 0)
        and WS_win = 0)")
##    yearID team wins losses win_pct WS_win seas_wins seas_losses seas_w_pct
##  1   2014  KCA   11      4   0.733      0        89          73      0.549

Ironically, in the current playoff format, one of the two teams with the lowest winning percentage among World Series winners is the 2011 St. Louis Cardinals, the local rival for the Royals. Their seasons were remarkably similar, separated by one game in the regular season. The difference was that while the Cardinals lost 3 more games during the playoffs, they won the one extra game they needed:

sqldf("select *
      from win_pct
      where (win_pct = (select max(win_pct) from win_pct where WS_win= 0)
        and WS_win = 0)
      or (win_pct = (select min(win_pct) from win_pct where WS_win= 1 and wins > 8)
        and WS_win = 1)
      order by WS_win, seas_w_pct")
##    yearID team wins losses win_pct WS_win seas_wins seas_losses seas_w_pct
##  1   2014  KCA   11      4   0.733      0        89          73      0.549
##  2   2011  SLN   11      7   0.611      1        90          72      0.556
##  3   2017  HOU   11      7   0.611      1       101          61      0.623

The following is a plot of the outcome. The graph plots regular season winning percentage against post-season winning percentage. The features are as follows:

  • color: green = World Series winner, red = non-winner (World Series or earlier)
  • size: total number of games played
  • shading: total wins

The 2014 Royals stand out in the top left corner, as the large solid red circle. Given their playoff winning percentage, by pretty much any nearest-neighbors algorithm they'd be mistakenly classified as World Series winners.

All of the teams that won zero games line up lightly along the bottom; those who swept the World Series (with no prior games) make up the small, lightly colored top row. In the middle we can see that teams with a higher winning percentage during the year are a little more likely to win the World Series, but there's plenty of variation.

Code is on Github

ggplot image

published = True
From the [Riddler Express](https://fivethirtyeight.com/features/two-paths-diverged-in-a-city-and-i-i-took-the-block-less-traveled-by/):
*The Major League Baseball playoffs are about to begin. Based on the current playoff format, what is the best possible winning percentage a team can have in the playoffs without winning the World Series? And what is the worst possible winning percentage a team can have in the playoffs and still win the World Series?*
The current format is:
* one play-in game for two non-division-winning wild card teams
* best 3-out-of 5 division series
* best 4-out-of-7 league championship series
* best 4-out-of-7 World Series
The minimum number of wins (if you win your division and don't need the play-in game) is 11, so if you end up playing the maximum in each series, the minimum winning percentage is:
```{r}
(min_possible = (3 + 4 + 4)/(5 + 7 + 7))
```
Conversely, the maximum winning percentage without winning the final best of seven World Series is
```{r}
(max_possible = (1 + 3 + 4 + 3)/(1 + 3 + 4 + 7))
```
The bonus question asks how close any actual teams have come. For that we'll use SQL through the `SQLDF` packages, baseball data from the `Lahman` database, and `dplyr` for some quick calculations.
```{r include = FALSE}
library(Lahman)
library(sqldf)
library(dplyr)
```
Unfortunately my `Lahman` data only goes to 2016:
```{r}
tail(SeriesPost[,-8])
```
So we'll use 2017 from [Baseball Reference](https://www.baseball-reference.com/postseason/)
```{r}
dat_2017 <- data.frame(yearID = rep(2017, 9),
round = c('WS','ALCS', 'NLCS', 'ALDS1','ALDS2','NLDS1','NLDS2','ALWC','NLWC'),
teamIDwinner = c('HOU','HOU','LAN','NYA','HOU','LAN','CHN','NYA','ARI'),
lgIDwinner = c('AL','AL','NL','AL','AL','NL','NL','AL','NL'),
teamIDloser = c('LAN','NYA','CHN','CLE','BOS','ARI','WAS','MIN','COL'),
lgIDloser = c('NL','AL','NL','AL','AL','NL','NL','AL','NL'),
wins = c(4,4,4,3,3,3,3,1,1),
losses = c(3,3,1,2,1,0,2,0,0),
ties = 0)
```
The data is in 'wide' format, so we'll make it long and bring in the 2017 results:
```{r}
playoffs <- sqldf("
select yearID, round, teamIDwinner as team,
cast(wins as float) wins,
cast(losses as float) losses,
case when round = 'WS' then 1 else 0 end WS_win
from SeriesPost
where yearID >= 1903
union all
select yearID, round, teamIDloser as team,
cast(losses as float) wins, cast(wins as float) losses, 0 as WS_win
from SeriesPost
where yearID >= 1903
union all
select yearID, round, teamIDwinner as team,
cast(wins as float) wins,
cast(losses as float) losses,
case when round = 'WS' then 1 else 0 end WS_win
from dat_2017
union all
select yearID, round, teamIDloser as team,
cast(losses as float) wins, cast(wins as float) losses, 0 as WS_win
from dat_2017
order by yearID desc, WS_win desc, wins desc")
head(playoffs)
```
With that we can calculate the total winning percentage for the post-season for each team that year. I also included the team wins and losses for the season (from the `Teams` table) for later comparison, which required again entering the 2017 values manually.
```{r}
win_pct <- sqldf("
select a.yearID, a.team, sum(a.wins) wins, sum(a.losses) losses,
round(sum(a.wins)/cast((sum(a.wins)+sum(a.losses)) as float),3) win_pct,
b.WS_win, c.seas_wins, c.seas_losses
from playoffs a
inner join (select yearID, team, max(WS_win) WS_win
from playoffs
group by yearID, team) b
on a.yearID = b.yearID
and a.team = b.team
left join (select yearID, teamID, lgID,
cast(W as float) seas_wins, cast(L as float) as seas_losses from Teams) c
on a.yearID = c.yearID
and a.team = c.teamID
group by a.yearID, a.team
order by a.yearID desc")
win_pct$seas_wins[1:10] <- c(93,93,92,102,87,101,104,85,91,97)
win_pct$seas_losses[1:10] <- 162 - win_pct$seas_wins[1:10]
win_pct <- win_pct %>% mutate(seas_w_pct = round(seas_wins/(seas_wins + seas_losses),3))
head(win_pct)
```
If we look at all time playoffs (including every year since 1903, where the only playoffs were the World Series until 1969), three things stand out:
* one team alone has the highest winning percentage without winning the World Series (2014 Royals)
* that team had a lower in-season winning percentage than any of the other teams on the list
* the minimum winning percentage is shared by all teams that won a full (seven game) World Series
```{r}
sqldf("select *, seas_w_pct
from win_pct
where (win_pct = (select max(win_pct) from win_pct where WS_win= 0)
and WS_win = 0)
or (win_pct = (select min(win_pct) from win_pct where WS_win= 1)
and WS_win = 1)
order by WS_win, seas_w_pct desc")
```
If we focus on the Royals that year, we can see that they had a remarkable playoff run. Starting with the play-in game, they won every single game until they got to the World Series. And even there they lost the 7th and final game by one run.
```{r}
sqldf("select yearID, round, team, wins, losses
from playoffs
where yearID = 2014 and team = 'KCA'
order by losses, wins ")
```
This wasn't how they played all year: with a .55 season winning percentage and only 89 wins on the year, they clearly got hot at the right time.
```{r}
sqldf("select *, round(seas_wins/(seas_wins + seas_losses),3) seas_w_pct
from win_pct
where (win_pct = (select max(win_pct) from win_pct where WS_win= 0)
and WS_win = 0)")
```
Ironically, in the current playoff format, one of the two teams with the lowest winning percentage among World Series winners is the 2011 St. Louis Cardinals, the local rival for the Royals. Their seasons were remarkably similar, separated by one game in the regular season. The difference was that while the Cardinals lost 3 more games during the playoffs, they won the one extra game they needed:
```{r}
sqldf("select *, round(seas_wins/(seas_wins + seas_losses),3) seas_w_pct
from win_pct
where (win_pct = (select max(win_pct) from win_pct where WS_win= 0)
and WS_win = 0)
or (win_pct = (select min(win_pct) from win_pct where WS_win= 1 and wins > 8)
and WS_win = 1)
order by WS_win")
```
```{r}
library(ggplot2)
win_pct %>% ggplot(.,aes(y = win_pct, x = I(seas_wins/(seas_wins + seas_losses)), col = as.factor(WS_win),
size = I((wins + losses)/2), alpha = wins)) +
geom_jitter() +
xlab('Season win %') + ylab('Postseason win %')
```
```{r}
win_pct %>% filter(yearID >= 2012) %>%
ggplot(.,aes(y = win_pct, x = I(seas_wins/(seas_wins + seas_losses)), col = as.factor(WS_win),
size = I((wins + losses)/2), alpha = wins)) +
geom_jitter() +
xlab('Season win %') + ylab('Postseason win %')
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment