marioa/R-lesson.md

## R-lesson.md

      
    Raw
  

              R-lesson.md
            
          
    R Ecology Lesson

Contents


Getting started
Introducing R and programming basics
Introduction to R
Managing data in R
Data frames
dplyr
ggplot2

Getting started

Learning Objectives


R projects

Organising files and directories related to a particular set of analyses.


Define the following (as they apply to R):

Script,
function,
working directory,
assign,
object,
variable


Describe the purpose of:

an RStudio script,
console,
environment, and
plot windows


Assigning values to variables
Call functions with zero or more named or unnamed arguments
Use the built-in RStudio help interface to search for more information on R functions
Ask for help from the R user community, providing sufficient information for the problem to be reproduced and troubleshooted

Introducing R and programming basics

RStudio


working directories

use relative paths thereafter


Start R-studio,

File > New Project > New directory-> data-carpentry
Files tab > New folder > data
File > New File > R script > data-carpentry-script.R


General organisation of your working directory

data
scripts
documents


Move data-carpentry-script.R to scripts
R-studio

IDE
Distributed for free under the Affero GNU Public Licence & Commercial licences
Four main panes:

editor

Best for reproducibility


console
environment/history
files/plots/packages/help/viewer


Programming is like baking a case - have a series of instructions to follow

# This is a comment
print("Hello World!")


Use of the text editor and console
Use of Cnt-1 and Cnt-2
Use of Cnt-Enter
Use of the + continuation marker

Use of Esc to get out of the continuation mode


Basics of R


Open Source, released under GPL Version 2
Widely used, almost 9k contributed packages (see CRAN)
Cross platform
Not just for statistics, can be used for general programming
R works as an object-oriented & functional programming language.
Big community

R syntax


Use of # for comments in R
Use of assignment in R, x<-3 (ALT+- in Studio gives you the assignment)

Can use = but best for labelling options in functions


Use of functions sqrt(4),x<-sqrt(4), myadd<-function(x,y){return(x+y)}
Arguments can be numbers, characters, objects, optional (take default values)

round(3.14159)
args(round)
?round
help("round")
round(3.14159,2) # Using positional arguments
round(digits=2,x=3.14159)

Seeking help


?barplot or help("barplot")
Search ??barplot
Googling; Stackoverflow with [r] tag
Asking for help;

Try to come up with the smallest use case that replicates your problem, including sample data.
Useful to add the type of framework that you are using: sessionInfo().


Introduction to R

Learning objectives


Familiarize participants with R syntax
Understand the concepts of objects and assignment
Understand the concepts of vector and data types
Get exposed to a few functions

Creating objects


Can do maths on the console

3+5
12/7
weight_kg <- 55    # picking object names: start with a letter, 
                   # cannot be a reserved name (`?reserved`), 
                   # function names (lead to confusion)
(weight_kg <- 55)  # Will print the value
weight_kg          # This will print it too
2.2 * weight_kg
weight_kg <- 57.5
2.2 * weight_kg
weight_lb <- 2.2 * weight_kg
weight_kg <- 100
weight_lb           # What will this give?

# Challenge
mass <- 47.5            # mass?
age  <- 122             # age?
mass <- mass * 2.0      # mass?
age  <- age - 20        # age?
mass_index <- mass/age  # mass_index?

Vectors and data types

weight_g <- c(50, 60, 65, 82)    # elements of the same type
weight_g

animals <- c("mouse", "rat", "dog")
animals

length(weight_g)
length(animals)

class(weight_g)   # find out the class/type of the elements
class(animals)

weight_g <- c(weight_g, 90) # adding at the end of the vector
weight_g <- c(30, weight_g) # adding at the beginning of the vector
weight_g

There are six atomic vector types:

character and numeric which we have met
logical for TRUE and FALSE (the boolean data type)
integer for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
complex to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them
raw that we won’t discuss further (storing raw bytes)

Data types in R: vector, matrix, data.frame, list.
# Challenge
num_char <- c(1, 2, 3, 'a')
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c('a', 'b', 'c', TRUE)
tricky <- c(1, 2, 3, '4')


logical > integer > complex > character

Subsetting vectors

animals <- c("mouse", "rat", "dog", "cat")
animals[2] # Indices start at 1

animals[c(3, 2)]

more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals


Conditional subsetting

weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]

weight_g > 50    # returns logicals for the indices that meet the condition

weight_g[weight_g > 50] # can use to select values above 50

weight_g[weight_g < 30 | weight_g > 50] # Combining tests with | or &

weight_g[weight_g >= 30 & weight_g == 21] # Oops

animals <- c("mouse", "rat", "dog", "cat")   # Use of %>%
animals[animals == "cat" | animals == "rat"] # returns both rat and cat

animals %in% c("rat", "cat", "dog", "duck")

animals[animals %in% c("rat", "cat", "dog", "duck")]


Challenges


Can you figure out why "four" > "five" returns TRUE?
if ! is the negation operator how would you return everything that is not a rat,cat, dog or duck?

Missing data

alphabet <- c("a", "b", "c", NA, "d", "e", "f")

heights <- c(2, 4, 4, NA, 6)
mean(heights)
max(heights)
mean(heights, na.rm = TRUE)
max(heights, na.rm = TRUE)

is.na(heights)
heights[!is.na(heights)]

complete.cases(heights)
heights[complete.cases(heights)] # Missing out na.omit(heights)

Challenge


Why does the following give an error:

sample <- c(2, 4, 4, "NA", 6)
mean(sample, na.rm = TRUE)


Why does the error message say the argument is not numeric?

Managing data in R

Learning objectives


Load external data (CSV files) in memory using the survey table (surveys.csv) as an example
explore the structure and the content of a data frame in R
understand what factors are and how to manipulate them

Presentation of Survey data


Study species and weight of animals caught in plots in a survey area.
Rows represent animals
13 Columns give:

Column	     Description
record_id	 Unique id for the observation
month	     month of observation
day	         day of observation
year	     year of observation
plot_id	     ID of a particular plot
species_id	 2-letter code
sex	         sex of animal (“M”, “F”)
hindfoot_    length	length of the hindfoot in mm
weight	     weight of the animal in grams
genus	     genus of animal
species	     species of animal
taxa	     e.g. Rodent, Reptile, Bird, Rabbit
plot_type	 type of plot

Download data and add read it into Rstudio:
download.file("https://ndownloader.figshare.com/files/2292169",
              "data/portal_data_joined.csv")

surveys <- read.csv('data/portal_data_joined.csv')

surveys

head(surveys)

Have a data.frame - table, columns are vectors (the same data type) and all of the same length.
str(surveys) # Look at the structure

Challenges

Based on the output of str(surveys), can you answer the following questions?

What is the class of the object surveys?
How many rows and how many columns are in this object?
How many species have been recorded during these surveys?

Factors


Factors represent categorical data.
Labels with integers.
Look like character vectors but have integers under the hood.
Can be ordered or unordered.

sex <- factor(c("male", "female", "female", "male"))
sex   # female level 1, male level 2 - alphabetical

levels(sex) 
nlevels(sex)

food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food) # you want an order

food <- factor(food, levels=c("low", "medium", "high"))
levels(food)

min(food) ## doesn't work

food <- factor(food, levels=c("low", "medium", "high"), ordered=TRUE)
levels(food)
min(food) ## works!

# Converting factors
f <- factor(c(1, 5, 10, 2))
f
as.numeric(f)               # wrong! position of levels - dput(f)
as.numeric(as.character(f)) # works...
as.numeric(levels(f))[f]    # The recommended way.

Challenge

exprmt <- factor(c("treat1", "treat2", "treat1", "treat3", "treat1", "control",
                   "treat1", "treat2", "treat3"))
table(exprmt)
barplot(table(exprmt))

Want you to:

put control last.
colour in the bars red
label the axes "frequency","experiment"

exprmt <- factor(exprmt,levels=c("treat1","treat2","treat3","control"))
barplot(table(exprmt),col="red",xlab="experiment",ylab="frequency")

Data frames

Learning objectives


understand the concept of a data.frame
use sequences
know how to access any element of a data.frame

What are data.frames


tabular structure
collection of identical length vectors, can be of different types
can use str to look at the structure
can be constructed by hand but usually the default with read.table, read.csv
by default character data read as factors, sometimes usually don't want that

some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)

# can create data frame
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8))
str(example_data)

example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8),stringsAsFactors = FALSE)
str(example_data)

color <- c("red", "green", "blue", "yellow")
counts <- c(50, 60, 65, 82)
new_dataframe <- data.frame(colors = color, counts = counts)

class(new_dataframe)

Inspecting data.frame Objects

Use surveys to demonstrate:

Size:

dim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
nrow() - returns the number of rows
ncol() - returns the number of columns


Content:

head() - shows the first 6 rows
tail() - shows the last 6 rows


Names:

names() - returns the column names (synonym of colnames() for data.frame objects)
rownames() - returns the row names


Summary:

str() - structure of the object and information about the class, length and content of each column


Indexing, Sequences, and Subsetting

1:10
10:1

seq(1, 10, by=2) # More complex sequencing
seq(5, 10, length.out=3)
seq(50, by=5, length.out=10)
seq(1, 8, by=3) # sequence stops to stay below upper limit

surveys[1]      # first column in the data frame (as a data.frame)
surveys[1, 1]   # first element in the first column of the data frame (as a vector)
surveys[1, 6]   # first element in the 6th column (as a vector)
surveys[1:3, 7] # first three elements in the 7th column (as a vector)
surveys[3, ]    # the 3rd element for all columns (as a data.frame)
surveys[, 8]    # the entire 8th column (as a vector)
head_surveys <- surveys[1:6, ] # equivalent to head(surveys)

surveys["species_id"]       # Result is a data.frame
surveys[, "species_id"]     # Result is a vector
surveys[["species_id"]]     # Result is a vector
surveys$species_id          # Result is a vector

surveys$d  # partial matching - DANGEROUS

Challenges


The function nrow() on a data.frame returns the number of rows. Use it, in conjuction with seq() to create a new data.frame called surveys_by_10 that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)


Create a data.frame containing only the observations from row 1999 of the surveys dataset.


Notice how nrow() gave you the number of rows in a data.frame? Use nrow() instead of a row number to make a data.frame with observations from only the last row of the surveys dataset.


Now that you’ve seen how nrow() can be used to stand in for a row index, let’s combine that behaviour with the - notation above to reproduce the behaviour of head(surveys) excluding the 7th through to the final row of the surveys dataset.


dplyr

Learning objectives


Learn basic utilities of the dplyr package
Select and filter data
Be able to use magrittr pipes
Create new columns with mutate()
Use the split-apply-combine paradigm to summarize data
Export data with write.csv()

Data Manipulation using dplyr

dplyr is a package. Loaded with the library command. Need to install it first. Uses C++ under the bonnet. Can work with databases.
install.packages("dplyr")  # May have be asked to use a mirror
library("dplyr")           # load the package

select(surveys, plot_id, species_id, weight) # select columns

filter(surveys, year == 1995) # select rows


What about selecting and filtering at the same time?

use intermediate steps,
nested functions,
or pipes

surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)
  
surveys_sml <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)  # or use ->

surveys_sml

Challenge


Using pipes, subset the data to include individuals collected before 1995, and retain the columns year, sex, and weight.

surveys %>% filter(year < 1995) %>% select(year, sex, weight)

Mutate

Create new or change columns use mutate.
surveys %>%            # Change from g to kg
  mutate(weight_kg = weight/1000)
  
surveys %>%       # Limit the output
  mutate(weight_kg = weight / 1000) %>%
  head            # Note the NAs
  
surveys %>%             # remove the NAs
  filter(!is.na(weight)) %>%
  mutate(weight_kg = weight / 1000) %>%
  head

Challenge

Create a new dataframe from the survey data that meets the following criteria:

contains only the species_id column and a column that contains values that are half the hindfoot_length values (e.g. a new column hindfoot_half).
In this hindfoot_half column, there are no NA values and all values are < 30.

Hint: think about how the commands should be ordered to produce this data frame!
surveys %>%
  filter(!is.na(hindfoot_length) & hindfoot_length < 60) %>%
  mutate(hindfoot_half=0.5*hindfoot_length) %>%
  select(species_id, hindfoot_half) %>%
  head

Split-apply-combine data analysis and the summarize() function

A lot of data analysis tasks can be approached using a split-apply-combine paradigm:

split the data into groups,
apply some analysis to each group, and then
combine the results.

dplyr makes this very easy through the use of the group_by() and summarize() function.

group_by() - splits categorical data into groups.
summarize() - collapses each group into a single-row summary of that group.

surveys %>%
  group_by(sex) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE))
  

surveys %>%  # Group by sex and species
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE))
  
surveys %>%       # Remove missing weights (note na.rm removed as well)
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight))  # data.frame -> tbl_df
  

surveys %>%    # Print out more rows & remove missing sex
  filter(!is.na(weight) & sex != "") %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight)) %>%
  print(n=15)
  
surveys %>%   # Summarize by one more than one variable
  filter(!is.na(weight) & sex != "") %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight),
            min_weight = min(weight),count=n()) # n() can be used 
                                                # in mutate, 
                                                # filter, 
                                                # summarise
surveys %>%   # Can use tally
  group_by(sex) %>%
  tally()
  

Challenges


How many individuals were caught in each plot_type surveyed?

surveys %>% group_by(plot_type,species) %>% summarize(number=n())

surveys %>% group_by(plot_type,species) %>% tally()


Use group_by() and summarize() to find the mean, min, and max hindfoot length for each species (using species_id).

surveys %>% 
        filter(!is.na(hindfoot_length)) %>% 
        group_by(species_id) %>% 
        summarize(mean=mean(hindfoot_length),
                  max=max(hindfoot_length),
                  min=min(hindfoot_length))


What was the heaviest animal measured in each year? Return the columns year, genus, species_id, and weight.

surveys %>% 
        filter(!is.na(weight)) %>% 
        group_by(year) %>%
        filter(weight==max(weight)) %>%
        select(year,genus,species_id,weight) 
        # pipe through arrange() to sort

Exporting data

Export data to a data_out directory.

Create the data_output directory.

Going to use write.csv() clean the data first.
surveys_complete <- surveys %>%    # remove missing values
  filter(species_id != "",         # remove missing species_id
         !is.na(weight),           # remove missing weight
         !is.na(hindfoot_length),  # remove missing hindfoot_length
         sex != "")                # remove missing sex
         
species_counts <- surveys_complete %>%   # extract the most common species
                  group_by(species_id) %>%
                  tally %>%
                  filter(n >= 50) %>%
                  select(species_id)

surveys_complete <- surveys_complete %>% # keep most common species
                 filter(species_id %in% species_counts$species_id)
                 
dim(surveys_complete) # Should have 30463 rows and 13 columns 

# Write to CSV
write.csv(surveys_complete, file="data_output/surveys_complete.csv",
          row.names=FALSE)

ggplot2

Learning Objectives


Visualize some of the mammals data from Figshare surveys.csv.
Understand how to plot these data using R ggplot2 package.
Building step by step complex plots with the ggplot2 package.

Load the required packages:
library(ggplot2)
library(dplyr)

Plotting with ggplot2

ggplot2 can generate complex publication ready plots from data frames.
ggplot(data=surveys_complete) + # bind plot to specific data frame

# add aesthetics
ggplot(data=surveys_complete,aes(x=weight,y=hindfoot_length)))

# specify a geometry
ggplot(data=surveys_complete,aes(x=weight,y=hindfoot_length)) +
geom_point()

# Create
surveys_plot <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length))

# Draw the plot
surveys_plot + geom_point()


Typically build plots iteratively:

define a data set
choose a geometry
play with it until it is right

# add transparency to avoid over plotting
base <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length))
 
base + geom_point(alpha = 0.1)
    
# add colour
base + geom_point(alpha = 0.1, color = "blue")

# colour by species
base + geom_point(alpha = 0.1, aes(color=species_id))
    
# Boxplot
# Visualising the distribution of weight within each species.
ggplot(data = surveys_complete, aes(x = species_id, y = hindfoot_length)) +
    geom_boxplot()
    
   
# Add jitter
ggplot(data = surveys_complete, aes(x = species_id, y = hindfoot_length)) +
    geom_boxplot(alpha = 0) +
    geom_jitter(alpha = 0.3, color = "tomato")     

Challenges


Replace the box plot with a violin plot; see geom_violin().
Represent weight on the log10 scale; see scale_y_log10().
Create boxplot for hindfoot_length.

Plotting time series data

Plot the counts per year for each species.
# Group by year and species_id and count
yearly_counts <- surveys_complete %>%
                 group_by(year, species_id) %>%
                 tally

# plot                 
ggplot(data = yearly_counts, aes(x = year, y = n)) +
     geom_line()
     
# do it by species by grouping
ggplot(data = yearly_counts, aes(x = year, y = n, group = species_id)) +
    geom_line()

# Distinguish species by colour
ggplot(data = yearly_counts, 
       aes(x = year, y = n, group = species_id, 
       colour = species_id)) +
    geom_line()

Faceting

Break a single plots into multiple plots based on a factor included in the dataset.
ggplot(data = yearly_counts, aes(x = year, y = n, group = species_id, colour = species_id)) +
    geom_line() +
    facet_wrap(~ species_id)

Want to split each plot by gender. To do that we need to make counts in data frame grouped by year, species_id, and sex.
yearly_sex_counts <- surveys_complete %>%
                      group_by(year, species_id, sex) %>%
                      tally
                      
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = species_id, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id)

# Set background to white (easer to read) 
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = species_id, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       theme_bw()

# Colour by sex and not gender
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = sex, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       theme_bw()                       

Challenge


Use what you just learned to create a plot that depicts how the average weight of each species changes through the years.

yearly_sex_weight <- surveys_complete %>%
    group_by(year, sex, species_id) %>%
    summarize(avg_weight = mean(weight))
    
ggplot(data = yearly_sex_weight, 
       aes(x=year, y=avg_weight, color = species_id, 
       group = species_id)) +
       geom_line() +
       facet_grid(sex ~ .)  # rows ~ columns; a . can be used as a 
                            # placeholder that indicates only one 
                            # row or column

# One row, facet by column
ggplot(data = yearly_sex_weight, aes(x=year, y=avg_weight, color = species_id, group = species_id)) +
    geom_line() +
    facet_grid(. ~ sex)

Customisation

Look at:

https://www.rstudio.com/wp-content/uploads/2015/08/ggplot2-cheatsheet.pdf

# Add a title and labels
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = sex, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       labs(title = 'Observed species in time',
            x = 'Year of observation',
            y = 'Number of species') +
    theme_bw()
    
# Change font size and font
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = sex, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       labs(title = 'Observed species in time',
            x = 'Year of observation',
            y = 'Number of species') +
       theme_bw() +
       theme(text=element_text(size=16, family="Arial"))

# Change the orientation of the labels
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = sex, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       labs(title = 'Observed species in time',
           x = 'Year of observation',
           y = 'Number of species') +
       theme_bw() +
       theme(axis.text.x = element_text(colour="grey20", size=12, 
                                    angle=90, hjust=.5, vjust=.5),
             axis.text.y = element_text(colour="grey20", size=12),
             text=element_text(size=16, family="Arial"))
             
# Save the theme
arial_grey_theme <- theme(
     axis.text.x = element_text(colour="grey20", size=12, 
                                angle=90, hjust=.5, vjust=.5),
     axis.text.y = element_text(colour="grey20", size=12),
     text=element_text(size=16, family="Arial"))

# Reuse
ggplot(surveys_complete, 
       aes(x = species_id, y = hindfoot_length)) +
       geom_boxplot() +
       arial_grey_theme
       
# Saving a plot
my_plot <- ggplot(data = yearly_sex_counts, 
                 aes(x = year, y = n, color = sex, group = sex)) +
            geom_line() +
            facet_wrap(~ species_id) +
            labs(title = 'Observed species in time',
                 x = 'Year of observation',
                 y = 'Number of species') +
            theme_bw() +
            theme(axis.text.x = element_text(colour="grey20", 
                  size=12, angle=90, hjust=.5, vjust=.5),
            axis.text.y = element_text(colour="grey20", size=12),
            text=element_text(size=16, family="Arial"))

# Now save the plot            
ggsave("name_of_file.png", my_plot, width=15, height=10)       


This work is licensed under a 
Creative Commons Attribution 4.0 International License.