One of my favorite baseball websites, baseballmusings.com, recently had a post about the chances of a player winning the triple crown (leading the league in home runs, RBIs and batting average).
The author provided calculations for these, and estimated that probability of achieving the triple crown by multiplying the individual probabilities together. This led me to assess the association between these statistics, as it seemed that there was a fairly strong association between leading the league in home runs and RBIs (both generally signs of power hitters, who probably get lineup spots with opportunities to drive in runners), and that perhaps they shouldn't be considered as independent. The analysis using RStudio and intepretation is below.
Using Lahman
package for baseball data, sqldf
for data manipulation
library(Lahman)
library(sqldf)
HR and RBI are available in Batting dataframe, but batting average is more difficult to calculate. We'll use calculated stats from battingStats() table, convert to recognizable format for sqldf
package
batting_stats <- battingStats()
Bring all stats together up to 2016 season, limit to seasons starting in 1919: play before the "live ball era" was very different, with few HRs and different relationship between HR and RBI
batting_data <- sqldf("select a.playerID, a.yearID, a.teamID, a.lgID, a.HR, a.RBI, b.BA
from Batting a
inner join batting_stats b
on a.playerID = b.playerID
and a.yearID = b.yearID
and a.lgID = b.lgID
and a.teamID = b.teamID
inner join (select playerID, yearID, lgID, sum(AB) AB_tot
from Batting
group by playerID, yearID, lgID) c
on a.playerID = c.playerID
and a.yearID = c.yearID
and a.lgID = c.lgID
where a.yearID >= 1919
and a.lgID in ('NL','AL')
and c.AB_tot >=400")
Identify the leading values for HR, RBI and BA for each league in each year
HR_RBI_BA_lead <- sqldf("select yearID, lgID, max(HR) HR_max, max(RBI) RBI_max, max(BA) BA_max
from batting_data
group by yearID, lgID
order by yearID, lgID")
Identify the seasons where the batter led in both HR and RBI
HR_RBI_leaders <- sqldf("select a.playerID, a.yearID, a.lgID, a.HR, a.RBI, a.BA,
case when a.BA = b.BA_max then 'Triple Crown' else '' end Batting
from batting_data a
inner join HR_RBI_BA_lead b
on a.yearID = b.yearID
and a.lgID = b.lgID
and a.HR = b.HR_max
and a.RBI = b.RBI_max
order by a.yearID, a.lgID")
Identify seasons where the batter led in HR and BA
HR_BA_leaders <- sqldf("select a.playerID, a.yearID, a.lgID, a.HR, a.RBI, a.BA,
case when a.RBI = b.RBI_max then 'Triple Crown' else '' end RBI
from batting_data a
inner join HR_RBI_BA_lead b
on a.yearID = b.yearID
and a.lgID = b.lgID
and a.HR = b.HR_max
and a.BA = b.BA_max
order by a.yearID, a.lgID")
Identify seasons where the batter led in RBI and BA
RBI_BA_leaders <- sqldf("select a.playerID, a.yearID, a.lgID, a.HR, a.RBI, a.BA,
case when a.HR = b.HR_max then 'Triple Crown' else '' end HR
from batting_data a
inner join HR_RBI_BA_lead b
on a.yearID = b.yearID
and a.lgID = b.lgID
and a.RBI = b.RBI_max
and a.BA = b.BA_max
order by a.yearID, a.lgID")
Print out the correlation between leading in HR, RBI, or BA. Much stronger association between leading in HR and RBI (nearly 50%) than for associations with batting average (8-12%).
paste0("The league leader in HRs and BA was the same ",nrow(HR_RBI_leaders),"/",nrow(HR_RBI_BA_lead),' times (%',
as.integer(nrow(HR_RBI_leaders)/nrow(HR_RBI_BA_lead)*100),")")
paste0("The league leader in HRs and BA was the same ",nrow(HR_BA_leaders),"/",nrow(HR_RBI_BA_lead),' times (%',
as.integer(nrow(HR_BA_leaders)/nrow(HR_RBI_BA_lead)*100),")")
paste0("The league leader in RBIs and BA was the same ",nrow(RBI_BA_leaders),"/",nrow(HR_RBI_BA_lead),' times (%',
as.integer(nrow(RBI_BA_leaders)/nrow(HR_RBI_BA_lead)*100),")")
Starting in 1919, the results are:
- The league leader in HRs and BA was the same 96/196 times (%48)
- The league leader in HRs and BA was the same 17/196 times (%8)
- The league leader in RBIs and BA was the same 24/196 times (%12)
We see a strong association between leading the league in home runs and RBIs. In contrast, a batter has nearly the same chance of leading the league in only HR and BA as they do of winning the overall triple crown. It may not be appropriate to consider the chances of leading the league in HR and RBI as independent events.