Skip to content

Instantly share code, notes, and snippets.

@bayesball
Last active June 2, 2020 21:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bayesball/d268bf46079454ab605cb403b9bf562f to your computer and use it in GitHub Desktop.
Save bayesball/d268bf46079454ab605cb403b9bf562f to your computer and use it in GitHub Desktop.
R Markdown file -- Welcome to R, Part 1
---
title: "Welcome to R - Part 1"
author: "Jim Albert"
date: "6/2/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### The RStudio Interface
I describe in my blog the RStudio interface. You can run this Markdown document and produce html output by pressing the Knit button above.
### Vectors
Look at famous pitcher Roy Halladay who was recently inducted in the baseball Hall of Fame.
Collect his season WAR values from Baseball-Reference.
I collected his WAR statistics from 1999 through 2013 and place them in a vector and store into a variable called WAR.
```{r}
WAR <- c(2.6, -2.8, 3.0, 7.3, 8.1, 2.4,
5.5, 5.3, 3.5, 6.2, 6.9, 8.5,
8.8, 0.8, -1.1)
```
I create another vector of corresponding seasons. The function seq() is helpful in creating a sequence of seasons from 1999 to 2013.
```{r}
Season <- seq(1999, 2013)
```
Create a character vector Team containing the teams that Halladay played for those 15 seasons. The function rep() will repeat an argument several times.
```{r}
Team <- c(rep("TOR", 11), rep("PHI", 4))
```
### Vector Indices
To illustrate different operations with vectors, we define a new vector variable GS containing the number of games started for Halladay for these 15 seasons.
```{r}
GS <- c(18, 13, 16, 34, 36, 21, 19, 32,
31, 33, 32, 33, 32, 25, 13)
```
One uses square bracket notation([]) to choose particular values of a vector.
For example, the 3rd value of GS is
```{r}
GS[3]
```
the 2nd, 3rd, and 7th values of GS are
```{r}
GS[c(2, 3, 7)]
```
A convenient way to find a subset of a vector that satisfies a condition is by logical operators.
This defines a logical vector of the indices of GS where Halladay started more than 30 games.
```{r}
GS > 30
```
This is helpful when you want to find the seasons where Halladay started more than 30 games.
```{r}
Season[GS > 30]
```
What was the season when Halladay obtained his maximum WAR?
First we identify by a logical value the WAR value which is the max:
```{r}
WAR == max(WAR)
```
(Note the use of a double equals sign to indicate logical equality.)
Then we use this logical vector to obtain the corresponding season:
```{r}
Season[WAR == max(WAR)]
```
### Operations with Vectors
One can perform vector arithmetic in R where you have two vectors and you perform element-by-element arithmetic.
For example, suppose you want to compute a new measure, WAR_per_GS defined by WAR by GS. We can do this easily from the vectors WAR and GS:
```{r}
WAR_per_GS <- WAR / GS
WAR_per_GS
```
There are many R functions that will take a vector as an argument and output a single value. For example, we can obtain Halladay cumulative WAR for these 15 seasons by use of the sum() function.
```{r}
sum(WAR)
```
Let's compare Halladay's cumulative WAR for his two teams:
```{r}
sum(WAR[Team == "TOR"])
sum(WAR[Team == "PHI"])
```
There are many similar functions like max(), min(), mean(), sd(), etc.
### Graphs of Vector Data
A basic graph of the WAR values is a one-dimensional scatterplot created by the function stripchart(). The pch = 19 will choose a solid circle as the plotting point.
```{r}
stripchart(WAR, pch = 19)
```
By the way, another common graph of numeric data is by a histogram. We create a histogram of Halladay's WAR values by the hist() function.
```{r}
hist(WAR)
```
How did Halladay's WAR values change over his career? We can answer this by a scatterplot with Season on the horizontal scale and WAR on the vertical.
```{r}
plot(Season, WAR)
```
What teams did Halladay play for? We first create a table of frequencies using the table() function and then construct a bar graph of the table using the barplot() function.
```{r}
table(Team)
```
```{r}
barplot(table(Team))
```
Did Halladay play better for Philly or Toronto? We'll answer this question by constructing a stripchart of WAR by Team.
```{r}
stripchart(WAR ~ Team, pch = 19)
```
### Things to Try on Your Own
1. Collect some Mike Trout data. Create three vectors containing the numbers of AB, numbers of HR, and Season for his nine seasons from 2011 to 2019.
2. Using logical variables, find the seasons when Trout hit over 35 home runs.
3. Using logical variables, find the season when Trout hit the most home runs.
4. Use vector operations to create a new variable HR_Rate equal to the number of home runs divided by the number of at-bats.
5. Construct a one-dimensional scatterplot (R calls this a stripchart) of HR_Rate.
6. Construct a scatterplot of Season (horizontal) against HR_Rate (vertical).
7. Find Trout's career home runs by use of the sum() function.
8. Find the sum of Trout's home runs after 2015 (use logical operators) and the sum of home runs 2015 or earlier.
9. Find Trout's cumulative home run rate (total HR divided by total AB).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment