perthr/hbda-r-intro.Rmd

## hbda-r-intro.Rmd
---
title: "Beginner R Workshop"
output: html_document
date: October 27, 2016
author: Keegan Korthauer and Joseph N. Paulson
---

Today's workshop will be a gentle introduction to the
[R programming language](https://cran.r-project.org/).  We have provided RStudio Server instances that you can access on the web, but we'll also briefly overview how you can install R and RStudio on your own computer.  We'll start with some basics about programming and then get some hands-on experience with analyzing a real-world messy dataset that we'll collect from everyone live.  We hope that you'll get a feel for what R can do as well as learn where you can learn more to use it on your own (great resources are listed at the end).

## Getting Started with R

Although R is technically a programming language, it was developed specifically for analyzing data.  It has many built-in tools for common analysis tasks, as well as countless more available for add-on that have been developed by the R community.  Since it is a language, however, you are not limited to carrying out tasks or analyses that someone has already implemented.  R is widely-used, free, and open-source with great community support.

## Launching RStudio Server

During this workshop, you can use RStudio Server, which uses an interface to interact with R running in the cloud.  To launch RStudio Server, open a browser and navigate to the IP address provided to you.  This should lead you to a login page, where you wil also enter the username and password provided to you.

![Login screen for RStudio Server](rstudio-login.png)

## Getting R/RStudio set up on your own computer

You can skip this step for the purposes of today's workshop if you are using the RStudio Server provided.

### Installing R
To set up your own computer, the first step is to install R. You can download and install R from
the [Comprehensive R Archive Network](https://cran.r-project.org/)
(CRAN). It is relatively straightforward, but if you need further help
you can try the following resources:

* [Installing R on Windows](https://github.com/genomicsclass/windows#installing-r)
* [Installing R on Mac](http://youtu.be/Icawuhf0Yqo)
* [Installing R on Ubuntu](http://cran.r-project.org/bin/linux/ubuntu/README)

### Installing RStudio

The next step is to install RStudio, a program for viewing and running R scripts. Technically you can run all the code shown here without installing RStudio, but we highly recommend this integrated
development environment (IDE). Instructions are
[here](http://www.rstudio.com/products/rstudio/download/) and for
Windows we have special
[instructions](https://github.com/genomicsclass/windows).

## The Console

Now that you have opened up Rstudio Server (or downloaded and installed R on your own computer), you are ready to start working with data. Whichever approach you are using to interact with R, you should identify the console.

![RStudio with console](console-screen-shot.png)

When you type a line of code into the consult and hit enter the command gets _executed_. For example, try using R as a calculator by typing

```{r}
2+3
```

We can also assign values to variables. Try the following

```{r}
x <- 2
y <- 3
x + y
```

Note also the window above the console.  This is where you can store lines of code to be executed in a particular sequence, and these can be saved in a script (a text file with a ".R" extension) so that you can reproduce your results later or run your code on another dataset.

## The R environment

When you download R from CRAN you get what we call _base_ R. This includes several _functions_ that are considered fundamental for data analysis. It also includes several example datasets. These datasets are particularly useful as examples when we are learning to use the available functions. You can see all the available dataset by executing the function `data` like this:

```{r,eval=FALSE}
data()
```

Because in R functions are objects, we need the two parenthesis to let R know that we want the function to be executed as opposed to showing us the code for the function. Type the following and note the difference:

```{r,eval=FALSE}
data
```

To see na example of functions at work, we will use to `co2` dataset to illustrate the function `plot`, one of the base functions. We can plot Mauna Loa Atmospheric CO2 Concentration over time like this:

```{r}
plot(co2)
```

Note that R's base functionality is bare bones. Note that data science applications are broad, the statistical toolbox is extensive, and most users need only a small fraction of all the available functionality. Therefore, a better approach is to make specific functionality available _on demand_, just like apps for smartphones.  R does this using _packages_, also called _libraries_.

Some packages are considered popular enough that they are included with the base download.
For example, the software implementing the method of survival analysis are in the `survival` package. To bring that functionality to your current session we type

```{r,eval=FALSE}
library(survival)
```

However, CRAN has over 4,000 packages that are not included in the base installation. You can install these using the `install.packages` function.

### Installing Packages

To use an add-on package that is not included with base R, you'll first need to install it. The first R command we will run is `install.packages`.  Packages can be retrieved from several different repositories.  One popular repository is CRAN, where
packages are vetted: they are checked for common errors and they must have a dedicated maintainer. There are other repositories, some with more vetting, such as [Bioconductor](http://www.bioconductor.org), and no vetting, such as GitHub. You can easily install CRAN packages from within R if you know the name of the packages. As an example, if you want to install the package `dplyr`, you would use:

```{r,eval=FALSE}
install.packages("dplyr")
```

We can then load the package into our R sessions using the `library` function:

```{r, warning=FALSE}
library(dplyr)
```

This step only needs to be carried out once on your machine.  This is because once you install the package, it
remains in place and only needs to be loaded with `library`. If you
try to load a package and get an error, it probably means you need to
install it first.  Note that there are reasons to reinstall packages that already exist in your library (e.g., to
receive updated versions of packages).

### Getting help

A key feature you need to know about R is that you can get help for a function using `help` or `?`, like this:
```{r,eval=FALSE}
?install.packages
help("install.packages")
```

These pages are quite detailed and also include examples at the end.

### Comments
The hash character represents comments, so text following these
characters is not interpreted:

```{r}
##This is just a comment
```

When writing your own R scripts, it is strongly recommended that you write out comments
that explain what each section of code is doing. This is very helpful both for collaborators, and for
your future self who may have to review, run, or edit your code.

## General Programming Principles

Although there are different styles and languages of programming, in essence a piece of code is just a very detailed set of instructions.  Each language has its own set of rules and syntax.  According to Wikipedia, syntax is

>"the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language."

Here are some general tips and pitfalls to avoid that will be useful when writing R code

####**1. Case matters**: variable names, keywords, functions, and package names are all case-sensitive

```{r, error=TRUE}
x <- 2
X + 8
```

####**2. Avoid using spaces**: variable names cannot contain spaces

```{r, error=TRUE}
my variable <- 10
```

####**3. Use comments liberally**: your future self and others will thank you


```{r}
# define scalar variables x and y
x <- 2
y <- 3

# add variables x and y
x + y
```

####**4. Pay attention to classes**: character strings, numerics, factors, matrices, lists, data.frames, etc., all behave differently in R

```{r, error=TRUE}
myNumber <- factor(10)
str(myNumber)
myNumber^2
as.numeric(myNumber)^2
as.character(myNumber)^2
as.numeric(as.character(myNumber))^2
```

####**5. Search the documentation for answers**: when something unexpected happens, try to find out why by reading the documentation

```{r}
mean(c(3,4,5,NA))
?mean
mean(c(3,4,5,NA), na.rm=TRUE)
```

####**6. It's OK to make mistakes**: expert R programmers run into (and learn from) errors all the time

Don't panic about those error messages!


## Hands-on Example: Heights and Shoe Sizes

To demonstrate some of the basic principles of reading in data, getting it in a format for analysis (often called data "wrangling"), and carrying out some basic analyses, we'll study the relationship between heights and shoe sizes.  Rather than use an exisiting dataset, we'll make things more exciting by collecting our own data live.

### Data collection

To add yourself to our live dataset of heights and shoe sizes, go to [http://tinyurl.com/introRsurvey](http://tinyurl.com/introRsurvey) and enter your height (in inches), your shoe size (US), and your sex.  These will be aggregated into a .csv spreadsheet, which we will analyze in the next few sections.

![Data collection survey](survey-screen-shot.png)

### Importing Data into R

The first step when preparing to analyze data is to read in the data into R. There are several ways to do this, but we are going to focus on reading in data stored in an external Comma-Separated Value (CSV) file. This can be done with the help of the `read.csv` function.

Small datasets such as the one used as an example here are often stored as Excel files.  Although there are R packages designed to read Excel (xls) format, you generally want
to avoid this and save files as comma delimited (Comma-Separated
Value/CSV) or tab delimited (Tab-Separated Value/TSV/TXT) files.
These plain-text formats are often easier for sharing, as commercial software is not required for viewing or
working with the data.

If your data is not in CSV format there are many other helpful functions that will read your data into R, such as `read.table`, `read.delim`, `download.file`.  Check out their help pages to learn more.

If you are reading in a file stored on your computer, the first step is to find the file containing your data and know its *path*.

#### Paths and the Working Directory

When you are working in R it is useful to know your _working directory_. This is the directory or folder in which R will save or look for files by default. You can see your working directory by typing:

```{r, eval=FALSE}
getwd()
```

You can also change your working directory using the function `setwd`. Or you can change it through RStudio by clicking on "Session".

The functions that read and write files (there are several in R) assume you mean to look for files or write files in the working directory. Our recommended approach for beginners will have you reading and writing to the working directory. However, you can also type the [full path](http://www.computerhope.com/jargon/a/absopath.htm), which will work independently of the working directory.

#### Reading in files from the web

Since we will analyze a dataset from the web instead of one that is saved on our computer, we have to carry out an additional step to download the file.  Here we make use of the `getURL` function from the `RCurl` package to download the contents of the file located at [http://tinyurl.com/introRsurveyresults](http://tinyurl.com/introRsurveyresults).  This produces a character string, which we send to `read.csv` to convert the Comma-Separated Value file to a data.frame.

```{r, eval=TRUE}
library(RCurl)
tmp <- getURL("http://tinyurl.com/introRsurveyresults", followlocation=1L, ssl.verifypeer=0L)
dat <- read.csv(text=tmp)
```

We have put the content of what comes out of `read.csv` into an _object_. We picked the object name `dat`.  But was exactly is in `dat`? To check its contents, we use the `str()` function (which stands for 'structure')

```{r, eval=TRUE}
str(dat)
```

Here we see that this object is a `data.frame`. These are one of the most widely used data types in R. They are particularly useful for storing tables.  We can also print out the top of the data frame using the `head()` function

```{r, eval=TRUE}
head(dat)
```

#### Class types

There are many different data types in R, but a list of the more common ones include:

- `data.frame`
- `vector`
- `matrix`
- `list`
- `factor`
- `character`
- `numeric`
- `integer`
- `double`

Each of them has their own properties and reading up on them will give you a better understanding of the underlying
R infrastructure. See the respective help files for additional information. To see what type of _class_ an object is
one can use the `class` function.

```{r,eval=TRUE}
class(dat)
```

#### Renaming columns

Before we continue it will be convenient to change the names of our columns to something more convenient.

```{r}
names(dat) <- c("time","height", "shoe.size", "sex")
```

#### Extracting columns

To extract columns from the data.frame we use the `$` character like this:

```{r, eval=FALSE}
dat$sex
```

This now gives us a vector. We can access elements of the vector using the `[` symbol:

```{r}
dat$sex[2]
```

#### Vectors

Vectors are a sequence of data elements of the same type (class). Many of the operations used to analyze data are applied to vectors. In R, vectors can be numeric, characters or logical.

The most basic way to creat a vector is with the function `c`
```{r}
x <- c(1,2,3,4,5)
```

Two very common ways of generating vectors are using `:` or the `seq` function:

```{r}
x <- 1:5
x <- seq(1,5)
```

Vectors can have names

```{r}
names(x) <- letters[1:5]
x
```

### Data Wrangling

In the real world, data science projects rarely involve data that can be easily imported ready for analysis. According to Wikipedia:

>Data munging or data wrangling is loosely the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

Now we want to describe the heights and shoe sizes. We could simply report the list of numbers. But there is a problem. Take a look at the entries:
```{r,eval=FALSE}
View(dat)
```

Notice not all the height entries are numbers. Furthermore, they are not all in inches. This will cause problems if we try to, for example, make a simple boxplot of the heights.

```{r,error=TRUE}
boxplot(dat$height)
```

Also note that there was varation in how respondents entered "Male" or "Female" responses. This will cause problems if we want to stratify by sex.

```{r,error=TRUE}
table(dat$sex)
```

So what to do? We need to wrangle.

#### Coercion

What happened in our example to cause the height data to be stored as something other than numeric values?  Vectors need to be homogenous (each element within the vector needs to have the same type). But when R is instructed to create a vector of different types, it does not give an error. Instead it tries to _coerce_ values to be the same. Here is an example:

```{r}
height <- c(60, 59, 55, "5'5", 70)
height
```

Note that no warning or error was given. It simply changed everything to a character. This is important to know because sometimes we make a mistake in entering data and receive no error message.

Examine our data frame again using `str` to see that coercion occurred.

```{r}
str(dat)
```

#### Converting Height variable to numeric

In order to proceed with analysis, we need to address the issues with our height and shoe size variables being non-numeric.  Unfortunately, there is no prebuilt function that we can run or one-size-fits-all approach to solving this problem.  So we will write our own function to convert these values.

#### Functions

Up to now we have used prebuilt functions. However, many times we have to construct our own. We can do this in R using the `function`:

```{r}
avg <- function(x){
  return( sum(x) / length(x) )
}
avg( 1:5 )
```

Here we construct a more complicated function that changes 5'4 or 5'4" to `5*12+4`
```{r}
fixheight <- function(x){
  x <- as.character(x)
  # first remove \" if present
  x <- strsplit(x, c("\""))[[1]]
  y <- strsplit(x, "'")
  ret <- sapply(y, function(z){
    ifelse( length(z)>1, as.numeric(z[1])*12 + as.numeric(z[2]) ,
            as.numeric(z[1]))
  })
  # convert any values greater than 84 'inches' (assume it's cm)
  ret <- ifelse(ret > 84,yes = ret/2.54,no = ret)
  # convert any values less than 3 'inches' (assume it's in m)
  ret <- ifelse(ret < 3, yes = (100*ret)/2.54,no = ret)
  return(ret)
}
```

We can now test the function to make sure it outputs numeric values (in inches)
```{r}
fixheight( "70")
fixheight( "5'10")
fixheight( "5'10\"")
sapply(c("5'9","70","5'11","6'2"), fixheight)
fixheight("1.89")
fixheight("189")
sapply(c("5'9","70","5'11","6'2","1.89",1.89,189), fixheight)
```

Finally we can apply this function to our data.  We use the `sapply` function to apply our `fixheight` function to each value of `dat$height` and store this in a new column in our data frame.

```{r}
dat$height_numeric <- sapply(dat$height, fixheight)
```


#### Collapsing categories of sex variable

Now we need to write a custom function to change the values of `sex` to two categories: Male, Female, or NA (missing).

```{r}
fixsex <- function(x){
   # convert to character
   x <- as.character(x)
   # strip whitespace
   x <- trimws(x)

   if(substr(x,1,1) %in% c("f", "F")){
     return(factor("F"))
   }else if(substr(x,1,1) %in% c("m", "M")){
     return(factor("M"))
   }else{
     return(NA)
   }
}
```

We can now test the function to make sure it outputs two sex categories
```{r}
fixsex("female")
fixsex("Female")
fixsex("M")
fixsex("Prefer not to answer")
```

Finally we can apply this function to our data.  We use the `sapply` function to apply our `fixsex` function to each value of `dat$sex` and store this in a new column in our data frame.

```{r}
dat$sex_factor <- sapply(dat$sex, fixsex)
```

Check that this produces the desired results.

```{r}
table(dat$sex_factor)
sum(is.na(dat$sex_factor))
```

### Exploratory data analysis

Now that our data is in a standard format, we are ready to make some exploratory plots.

#### Boxplot

A boxplot is useful to explore univariate characteristics of a numeric variable.  Here we'll create a boxplot to visualize the distribution of height values.  Do you see any outliers?

```{r}
boxplot(dat$height_numeric, ylab="Height (inches)")
```

We can also make a boxplot to look at the distribution of shoe sizes.

```{r}
boxplot(dat$shoe.size, ylab="Shoe Size")
```

#### Scatterplot

The methods described below relate to _univariate_ variables. In the
biomedical sciences, it is common to be interested in the relationship
between two or more variables.

A scatterplot is useful to explore the relationship between two numeric variables.  Here we'll create a scatterplot to visualize the relationship between shoe size and height.

```{r}
plot(dat$shoe.size, dat$height_numeric, xlab="Shoe Size", ylab="Height (inches)")
```

How would you describe the relationship between shoe size and height?

#### -side note- apply functions

The set of `*apply` functions are better than _for loops_ for a number of reasons that we won't go into, but read the help files later onabout `lapply`,`sapply`,`tapply`,`mapply`, `apply`. In short, they are functions that will loop over values in
an optimized fashion. In the previous sections we used the `sapply` function repeatedly. It is perhaps the most often used of the `apply` functions and quite useful as a for loop replacement. We will describe the `apply` and `sapply` functions and recommend the reader to learn more about the others. These functions are nice in that they do not require pre-allocation of a vector to store results.

- `apply`: Apply Functions Over Array Margins

The `apply` function simply applies a function over an array and returns a vector of results. For example, if we
have a matrix of numbers and want the column or row sums we could use the `apply` function.

```{r}
x <- matrix(rnorm(100),nrow = 10,ncol=10)
row.sums <- apply(x,1,sum)
col.sums <- apply(x,2,sum)
head(row.sums)
head(col.sums)
```

- `sapply`: Apply a Function over a List or Vector

The `sapply` function loops through a set of values provided and returns a vector, matrix (if appropriately sized), or list.
```{r}
row.sums2 = sapply(1:10,function(i){
  sum(x[i,])
})
# notice the values are the same
head(row.sums2 - row.sums)
```

- `lapply`
- `tapply`
- `mapply`

#### Stratification

We know that men and women tend to have very different heights. It's often useful
to see how values differ between classes of measurements (for example M/F). To start
we will show two ways to split values that can then be either plotted or for which
summary statistics can be calculated.

The `by` function is simply a wrapper to the `tapply` function and takes a `data frame`, `matrix` or `vector` and will stratify by an index and perform a defined function. For example, we can calculate the average height or shoe size split by sex.

```{r}
by(dat$height_numeric,dat$sex_factor,mean)
by(dat$shoe.size,dat$sex_factor,mean)
```

Another option is to `split` the data by a set of indices or factors. The `split`
function divides a `vector` or `data.frame` into the groups of the index or set of factors.

```{r}
vals <- split(dat$height_numeric,dat$sex_factor)
sapply(vals,head)
```

### Plotting by factor

We can now replot the boxplots and scatterplots using `~`, `with`, `subset` functions. Don't forget to look them up.

The two functions below are equivalent and we plot them side by side. The `~` function argument is available in certain functions and will automatically interpret the meaning.

Plotting the shoe sizes:
```{r}
boxplot(split(dat$shoe.size,dat$sex_factor),main="Shoe size")
boxplot(shoe.size~sex_factor,main="Shoe size",data=dat)
```

Plotting the heights by sex:
```{r}
boxplot(split(dat$height_numeric,dat$sex_factor),main="Shoe size")
boxplot(height_numeric~sex_factor,main="Shoe size",data=dat)
```

We next plot the subsetted data using the `with` function. We could
do this any number of ways.

```{r}
with(data = subset(dat,subset = (sex_factor=="F")),expr = {
  plot(x = shoe.size,y = height_numeric)
  abline(lm(height_numeric~shoe.size))
  })
with(data = subset(dat,subset = (sex_factor=="M")),expr = {
  plot(x = shoe.size,y = height_numeric)
  abline(lm(height_numeric~shoe.size))
  })

# this is equivalent to first subsetting the data and plotting
maledat <- subset(dat,subset=(sex_factor=="M"))
plot(x = maledat$shoe.size,y = maledat$height_numeric)
abline(lm(height_numeric~shoe.size,data=maledat))
```

### Linear regression

With this information, we might perform a statistical test and ask if
there is a significant relationship between shoe size and height. We can either subset by sex or include sex as a covariate if we want.

```{r}
results <- with(dat,
    list(fit1 <- lm(height_numeric ~ shoe.size),
         fit2 <- lm(height_numeric ~ shoe.size + sex_factor)
         )
    )
results
```

The output above simply show the _Call_ made by the function _(formula = ...)_ and the fitted coefficient estimates. Using the `summary` function we can see multiple other fitted values and test for significance, including goodness of fit statistics, residuals. Applying the `plot` function to the output of `lm` will provide several diagnostic plots as well: residuals against fitted values, a Scale-Location plot of sqrt(| residuals |) against fitted values, a Normal Q-Q plot, a plot of Cook's distances versus row labels, a plot of residuals against leverages, and a plot of Cook's distances against leverage/(1-leverage). Because the output of `lm` is an _S3_ class (special data structure used throughout), to see the specific options see `?plot.lm`.

_broom_ is one particularly useful package for extracting the statistics from the `lm` or `glm` fitted results. Other useful functions include, `t.test`, `kruskal.test`, `aov`, `anova`, and `wilcox.test`.

### Correlation

Correlations are ubiquitous in biological and other applications. If we can describe one variable as another; are simply curious if two variables are correlated we can make use of the `cor` function. However, correlation is often misunderstood for a number of reasons. Correlation also misses large average shifts, so caution is advised.

Let us generate 100 random draws from a normal distribution with `rnorm`. We do not expect this to be correlated, and we see that it is not.
```{r}
set.seed(1)
cor(x<-rnorm(100),y<-rnorm(100))
cor.test(x,y)
```

What if we have one pair of samples that are extreme outliers though?
Notice what happens to the _pearson_ correlation.

```{r}
y[1] <- x[1] <- 10
plot(x,y)
cor.test(x,y)
cor.test(x,y,method="spearman")
```

## Writing out data

We've cleaned up the data. Now it's time to save all our work. We can write it out with either the `write` or `write.table` function. For simplicity we highlight `write.table`. Notice we write the file out and then read it back in and our cleaned columns have been preserved!

```{r}
tmp = "~/Desktop/tmp_file.csv"
write.table(dat,file=tmp)
x = read.table(tmp)
head(x)
```

## Resources

Much of the material in this workshop was drawn from Rafael Irizarry's _Introduction to Data Science_ course.

A good place to start learning more about R for those with limited programming is to complete one of the following R programming courses:

* DataCamp's [R course](https://www.datacamp.com/courses/free-introduction-to-r)
* edX's [Introduction to R Programming](https://www.edx.org/course/introduction-r-programming-microsoft-dat204x-0)

Another great tutorial is the [swirl](http://swirlstats.com/) tutorial, which teaches you R programming interactively, at your own pace and in the R console. Once you have R installed, you can install `swirl` and run it the following way:

```{r, eval=FALSE}
install.packages("swirl")
library(swirl)
swirl()
```

[try R](http://tryr.codeschool.com/) interactive class from Code School.

There are also many open and free resources and reference
guides for R. Two examples are:

* [Quick-R](http://www.statmethods.net/): a quick online reference for data input, basic statistics and plots
* R reference card (PDF)[https://cran.r-project.org/doc/contrib/Short-refcard.pdf] by Tom Short

#### More advanced R Resources (from Roger Peng)

Available from CRAN (http://cran.r-project.org)

-   An Introduction to R

-   Writing R Extensions

-   R Data Import/Export

-   R Installation and Administration (mostly for building R from
    sources)

-   R Internals (not for the faint of heart)


#### Some Useful Books on S/R

Standard texts

-   Chambers (2008). *Software for Data Analysis*, Springer. (your
    textbook)

-   Chambers (1998). *Programming with Data*, Springer.

-   Venables & Ripley (2002). *Modern Applied Statistics with S*,
    Springer.

-   Venables & Ripley (2000). *S Programming*, Springer.

-   Pinheiro & Bates (2000). *Mixed-Effects Models in S and S-PLUS*,
    Springer.

-   Murrell (2005). *R Graphics*, Chapman & Hall/CRC Press.

Other resources

-   Springer has a series of books called *Use R!*.

-   A longer list of books is at
    http://www.r-project.org/doc/bib/R-books.html