Skip to content

Instantly share code, notes, and snippets.

@czsheets
Last active September 9, 2018 11:23
Show Gist options
  • Save czsheets/ab6ab4cbb3a8cb3dd21ea86b54167ff4 to your computer and use it in GitHub Desktop.
Save czsheets/ab6ab4cbb3a8cb3dd21ea86b54167ff4 to your computer and use it in GitHub Desktop.
Checking assumptions about baseball's triple crown statistics
published: true

One of my favorite baseball websites, baseballmusings.com, recently had a post about the chances of a player winning the triple crown (leading the league in home runs, RBIs and batting average).

The author provided calculations for these, and estimated that probability of achieving the triple crown by multiplying the individual probabilities together. This led me to assess the association between these statistics, as it seemed that there was a fairly strong association between leading the league in home runs and RBIs (both generally signs of power hitters, who probably get lineup spots with opportunities to drive in runners), and that perhaps they shouldn't be considered as independent. The analysis using RStudio and intepretation is below.

Using Lahman package for baseball data, sqldf for data manipulation

library(Lahman)
library(sqldf)

HR and RBI are available in Batting dataframe, but batting average is more difficult to calculate. We'll use calculated stats from battingStats() table, convert to recognizable format for sqldf package

batting_stats <- battingStats()

Bring all stats together up to 2016 season, limit to seasons starting in 1919: play before the "live ball era" was very different, with few HRs and different relationship between HR and RBI

batting_data <- sqldf("select a.playerID, a.yearID, a.teamID, a.lgID, a.HR, a.RBI, b.BA
                    from Batting a
                    inner join batting_stats b
                    on a.playerID = b.playerID
                      and a.yearID = b.yearID
                      and a.lgID = b.lgID
                      and a.teamID = b.teamID
                    inner join (select playerID, yearID, lgID, sum(AB) AB_tot
                                from Batting
                                group by playerID, yearID, lgID) c
                    on a.playerID = c.playerID
                      and a.yearID = c.yearID
                      and a.lgID = c.lgID
                    where a.yearID >= 1919
                      and a.lgID in ('NL','AL')
                      and c.AB_tot >=400")

Identify the leading values for HR, RBI and BA for each league in each year

HR_RBI_BA_lead <- sqldf("select yearID, lgID, max(HR) HR_max, max(RBI) RBI_max, max(BA) BA_max 
                     from batting_data 
                     group by yearID, lgID 
                     order by yearID, lgID")

Identify the seasons where the batter led in both HR and RBI

HR_RBI_leaders <- sqldf("select a.playerID, a.yearID, a.lgID, a.HR, a.RBI, a.BA,
                    case when a.BA = b.BA_max then 'Triple Crown' else '' end Batting
                  from batting_data a
                  inner join HR_RBI_BA_lead b
                  on a.yearID = b.yearID
                    and a.lgID = b.lgID
                    and a.HR = b.HR_max
                    and a.RBI = b.RBI_max
                  order by a.yearID, a.lgID")

Identify seasons where the batter led in HR and BA

HR_BA_leaders <- sqldf("select a.playerID, a.yearID, a.lgID, a.HR, a.RBI, a.BA,
                    case when a.RBI = b.RBI_max then 'Triple Crown' else '' end RBI
                        from batting_data a
                        inner join HR_RBI_BA_lead b
                        on a.yearID = b.yearID
                        and a.lgID = b.lgID
                        and a.HR = b.HR_max
                        and a.BA = b.BA_max
                        order by a.yearID, a.lgID")

Identify seasons where the batter led in RBI and BA

RBI_BA_leaders <- sqldf("select a.playerID, a.yearID, a.lgID, a.HR, a.RBI, a.BA,
                    case when a.HR = b.HR_max then 'Triple Crown' else '' end HR
                       from batting_data a
                       inner join HR_RBI_BA_lead b
                       on a.yearID = b.yearID
                       and a.lgID = b.lgID
                       and a.RBI = b.RBI_max
                       and a.BA = b.BA_max
                       order by a.yearID, a.lgID")

Print out the correlation between leading in HR, RBI, or BA. Much stronger association between leading in HR and RBI (nearly 50%) than for associations with batting average (8-12%).

paste0("The league leader in HRs and BA was the same ",nrow(HR_RBI_leaders),"/",nrow(HR_RBI_BA_lead),' times (%',
       as.integer(nrow(HR_RBI_leaders)/nrow(HR_RBI_BA_lead)*100),")")

paste0("The league leader in HRs and BA was the same ",nrow(HR_BA_leaders),"/",nrow(HR_RBI_BA_lead),' times (%',
       as.integer(nrow(HR_BA_leaders)/nrow(HR_RBI_BA_lead)*100),")")

paste0("The league leader in RBIs and BA was the same ",nrow(RBI_BA_leaders),"/",nrow(HR_RBI_BA_lead),' times (%',
      as.integer(nrow(RBI_BA_leaders)/nrow(HR_RBI_BA_lead)*100),")")

Starting in 1919, the results are:

  • The league leader in HRs and BA was the same 96/196 times (%48)
  • The league leader in HRs and BA was the same 17/196 times (%8)
  • The league leader in RBIs and BA was the same 24/196 times (%12)

We see a strong association between leading the league in home runs and RBIs. In contrast, a batter has nearly the same chance of leading the league in only HR and BA as they do of winning the overall triple crown. It may not be appropriate to consider the chances of leading the league in HR and RBI as independent events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment