Skip to content

Instantly share code, notes, and snippets.

@bayesball
Last active June 10, 2020 02:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bayesball/613ff5510b7e5441079de83bc7cfb56a to your computer and use it in GitHub Desktop.
Save bayesball/613ff5510b7e5441079de83bc7cfb56a to your computer and use it in GitHub Desktop.
---
title: "Welcome to R - Part 3"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
<!-- wp:heading {"level":3} -->
### Introduction
<!-- /wp:heading -->
<!-- wp:paragraph -->
In Welcome to R - Part 2, I imported a Fan Graphs table that listed a number of statistics for the group of 2019 starting pitchers. I will continue with this example to illustrate some graphs available in base R. We'll see which variables among BABIP, K9, BB9, HR9 appear to explain most of the variation in the WAR values of these 62 starters.
<!-- /wp:paragraph -->
<!-- wp:heading {"level":3} -->
### Questions
<!-- /wp:heading -->
<!-- wp:paragraph -->
We are working with a Fan Graphs data frame of 62 starting pitchers for the 2019 season. This data frame includes (1) BABIP, batting average on balls in play, (2) K9, average number of strikeouts per nine innings, (3) BB9, average number of walks per nine innings, (4) HR9, average number of home runs allowed per nine innings, and (5) WAR, the total wins above replacement. Which variables among BABIP, K9, BB9, and HR9 are most helpful for explaining the variability in the WAR values for these 62 pitchers?</p>
<!-- /wp:paragraph -->
### Reading in Data and Some Data Cleaning
```{r}
library(readr)
FG <- read_csv("~/Downloads/FanGraphs_Leaderboard.csv")
```
```{r}
names(FG)[c(9, 10, 11, 13, 14, 15)] <-
c("K9", "BB9", "HR9", "LOB", "GB", "HR_FB")
```
Change pct variables to numeric.
```{r}
FG$LOB <- as.numeric(gsub("%", "", FG$LOB))
FG$GB <- as.numeric(gsub("%", "", FG$GB))
FG$HR_FB <- as.numeric(gsub("%", "", FG$HR_FB))
```
<!-- wp:heading {"level":3} -->
### Comparing Groups
<!-- /wp:heading -->
<!-- wp:paragraph -->
As in the previous post, I define a new variable Group which is defined to be "High" if the WAR value is larger than 3.3 or "Low" otherwise.
<!-- /wp:paragraph -->
```{r}
FG$Group <- ifelse(FG$WAR > 3.3, "Top", "Bottom")
```
Which variables among BABIP, K9, BB9, HR9 are helpful in distinguishing the Top and Bottom WAR pitchers?
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
Let's focus on the strikeout variable K9. One basic graph, created by the R function stripchart(), plots parallel one-dimensional scatterplots of K9 for the Bottom and Top WAR groups. Note that in the code I indicate that the variables K9 and Group are in the data frame FG by the data = FG argument and I jitter the plotted points by the method = "jitter" argument. Note that the Top WAR group tends to strike out more batters than the Bottom WAR group.</p>
```{r}
stripchart(K9 ~ Group,
data = FG,
method = "jitter")
```
<!-- wp:paragraph -->
One can get a better idea how K9 is related to Group by plotting parallel boxplots of K9 for the two groups. We use the function boxplot() which has a similar syntax to stripchart(). By use of the horizontal = TRUE argument, the boxes are displayed in a horizontal fashion.
<!-- /wp:paragraph -->
```{r}
boxplot(K9 ~ Group,
data = FG,
horizontal = TRUE)
```
Each box shows the location of the median (solid vertical line) and the quartiles (vertical lines on the left and right of box) for the K9 values for a particular group. The dotted lines extend to the extreme values (LO and HI) for the K9 variable. Special outliers are plotted separately as dots. We see that the median K9 values for the Top WAR and Bottom WAR groups are respectively 10 and 8. So the Top WAR group tends to strike out 2 more batters per 9 innings than the Bottom WAR group. There are actually three outliers -- these correspond to pitchers in the Low WAR group that strike out a lot of batters. (It would be interesting to identify these pitchers.)</p>
### Relating Numeric Variables
<!-- /wp:heading -->
<!-- wp:paragraph -->
A scatterplot is a basic graph where one plots ordered pairs of two numeric variables. Since we are dealing with 5 variables (BABIP, K9, BB9, HR9, WAR), one might be interested in constructing scatterplots for all pairs among these variables. This is conveniently done by using the plot() function where the argument is the data frame with only these variables. There are 20 plots shown -- note that each plot of two variables is shown twice.
<!-- /wp:paragraph -->
```{r}
plot(FG[, c("BABIP", "K9", "BB9", "HR9", "WAR")])
```
What variable among BABIP, K9, BB9 and HR9 has the strongest relationship with WAR? To answer this question, we focus on the bottom row of scatterplots. As expected K9 has a relatively strong positive relationship with WAR, both BB9 and HR9 have negative relationships with WAR, and BABIP and WAR appear to have a weak relationship. One can quantify these relationships by computing correlations by use of the cor() function with the same argument.
```{r}
cor(FG[, c("BABIP", "K9", "BB9", "HR9", "WAR")])
```
Correlations fall between -1 and 1 and the strongest relationships correspond to correlation values close to -1 or 1. Again looking at the bottom row, K9 has the strongest relationship with WAR with a 0.7 correlation, followed by HR9 (-0.51), BB9 (-0.42), and BABIP (-0.10).</p>
### Describing the Relationship
<!-- /wp:heading -->
<!-- wp:paragraph -->
Let's focus on the relationship between K9 and WAR -- we produce a scatterplot by the plot() function.
<!-- /wp:paragraph -->
```{r}
plot(FG$K9, FG$WAR)
```
Since the pattern of the relationship looks pretty linear, let's fit a line. We can use the workhorse function lm() that we use for fitting regression models. The basic syntax is y ~ x where x and y are both variables in a data frame. We store all of the calculations in the variable fit.</p>
```{r}
fit <- lm(WAR ~ K9, data = FG)
```
We can add the least-squares fit to the scatterplot by the abline() function with argument fit. I also indicate by the argument col = "red" that I want the plotted line to be red.
```{r}
plot(FG$K9, FG$WAR)
abline(fit, col = "red")
```
### Looking Deeper
What is interesting to me about this scatterplot is not the positive relationship that is obvious, but the particular points (pitchers) that deviate far from the line. One can measure a deviation by a residual which is the vertical distance of the point from the line. The lm() object stored in the variable fit has many components -- one of them is a vector of residuals. We plot the K9 values against the residuals and add a horizontal line at 0.
<!-- /wp:paragraph -->
```{r}
plot(FG$BABIP, fit$residuals)
abline(a = 0, b = 0, col = "red")
```
I've drawn a box around two points that correspond to large negative residuals. These two pitchers have much smaller WAR values than would be predicted by their high strikeout rates. Again, if we looked further, I'd be interested in the identities of these two pitchers.</p>
### Summary and What's Next
- The purpose of this post was to introduce popular R functions for studying relationships from the base package. Once the reader has a handle for reading in data and working with data frame, then it is pretty easy to use functions like hist(), stripchart(), boxplot(), and plot() to create these graphs.
- Currently, I am planning a couple more posts in this Welcome to R series. I generally use the ggplot2 package in my plots and so it would be worthwhile to introduce this package and show how one can graph three or four variables at once. Also I'd like to talk about the data.table package which really is a nice extension of the bracket notation that we used for vectors and data frames.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment