Skip to content

Instantly share code, notes, and snippets.

@marioa
Last active June 13, 2019 08:26
Show Gist options
  • Save marioa/d1627fb8b720491a591ac132a2eccf79 to your computer and use it in GitHub Desktop.
Save marioa/d1627fb8b720491a591ac132a2eccf79 to your computer and use it in GitHub Desktop.
Data Carpentry ecology R lesson instructor script

R Ecology Lesson

Contents

Getting started

Learning Objectives

  • R projects
    • Organising files and directories related to a particular set of analyses.
  • Define the following (as they apply to R):
    • Script,
    • function,
    • working directory,
    • assign,
    • object,
    • variable
  • Describe the purpose of:
    • an RStudio script,
    • console,
    • environment, and
    • plot windows
  • Assigning values to variables
  • Call functions with zero or more named or unnamed arguments
  • Use the built-in RStudio help interface to search for more information on R functions
  • Ask for help from the R user community, providing sufficient information for the problem to be reproduced and troubleshooted

Introducing R and programming basics

RStudio

  1. working directories
    • use relative paths thereafter
  2. Start R-studio,
    • File > New Project > New directory-> data-carpentry
    • Files tab > New folder > data
    • File > New File > R script > data-carpentry-script.R
  3. General organisation of your working directory
    • data
    • scripts
    • documents
  4. Move data-carpentry-script.R to scripts
  5. R-studio
    • IDE
    • Distributed for free under the Affero GNU Public Licence & Commercial licences
    • Four main panes:
      • editor
        • Best for reproducibility
      • console
      • environment/history
      • files/plots/packages/help/viewer
  6. Programming is like baking a case - have a series of instructions to follow
# This is a comment
print("Hello World!")
  • Use of the text editor and console
  • Use of Cnt-1 and Cnt-2
  • Use of Cnt-Enter
  • Use of the + continuation marker
    • Use of Esc to get out of the continuation mode

Basics of R

  • Open Source, released under GPL Version 2
  • Widely used, almost 9k contributed packages (see CRAN)
  • Cross platform
  • Not just for statistics, can be used for general programming
  • R works as an object-oriented & functional programming language.
  • Big community

R syntax

  • Use of # for comments in R
  • Use of assignment in R, x<-3 (ALT+- in Studio gives you the assignment)
    • Can use = but best for labelling options in functions
  • Use of functions sqrt(4),x<-sqrt(4), myadd<-function(x,y){return(x+y)}
  • Arguments can be numbers, characters, objects, optional (take default values)
round(3.14159)
args(round)
?round
help("round")
round(3.14159,2) # Using positional arguments
round(digits=2,x=3.14159)

Seeking help

  • ?barplot or help("barplot")
  • Search ??barplot
  • Googling; Stackoverflow with [r] tag
  • Asking for help;
    • Try to come up with the smallest use case that replicates your problem, including sample data.
    • Useful to add the type of framework that you are using: sessionInfo().

Introduction to R

Learning objectives

  • Familiarize participants with R syntax
  • Understand the concepts of objects and assignment
  • Understand the concepts of vector and data types
  • Get exposed to a few functions

Creating objects

  • Can do maths on the console
3+5
12/7
weight_kg <- 55    # picking object names: start with a letter, 
                   # cannot be a reserved name (`?reserved`), 
                   # function names (lead to confusion)
(weight_kg <- 55)  # Will print the value
weight_kg          # This will print it too
2.2 * weight_kg
weight_kg <- 57.5
2.2 * weight_kg
weight_lb <- 2.2 * weight_kg
weight_kg <- 100
weight_lb           # What will this give?

# Challenge
mass <- 47.5            # mass?
age  <- 122             # age?
mass <- mass * 2.0      # mass?
age  <- age - 20        # age?
mass_index <- mass/age  # mass_index?

Vectors and data types

weight_g <- c(50, 60, 65, 82)    # elements of the same type
weight_g

animals <- c("mouse", "rat", "dog")
animals

length(weight_g)
length(animals)

class(weight_g)   # find out the class/type of the elements
class(animals)

weight_g <- c(weight_g, 90) # adding at the end of the vector
weight_g <- c(30, weight_g) # adding at the beginning of the vector
weight_g

There are six atomic vector types:

  • character and numeric which we have met
  • logical for TRUE and FALSE (the boolean data type)
  • integer for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
  • complex to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them
  • raw that we won’t discuss further (storing raw bytes)

Data types in R: vector, matrix, data.frame, list.

# Challenge
num_char <- c(1, 2, 3, 'a')
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c('a', 'b', 'c', TRUE)
tricky <- c(1, 2, 3, '4')
  • logical > integer > complex > character

Subsetting vectors

animals <- c("mouse", "rat", "dog", "cat")
animals[2] # Indices start at 1

animals[c(3, 2)]

more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals

Conditional subsetting

weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]

weight_g > 50    # returns logicals for the indices that meet the condition

weight_g[weight_g > 50] # can use to select values above 50

weight_g[weight_g < 30 | weight_g > 50] # Combining tests with | or &

weight_g[weight_g >= 30 & weight_g == 21] # Oops

animals <- c("mouse", "rat", "dog", "cat")   # Use of %>%
animals[animals == "cat" | animals == "rat"] # returns both rat and cat

animals %in% c("rat", "cat", "dog", "duck")

animals[animals %in% c("rat", "cat", "dog", "duck")]


Challenges

  • Can you figure out why "four" > "five" returns TRUE?
  • if ! is the negation operator how would you return everything that is not a rat,cat, dog or duck?

Missing data

alphabet <- c("a", "b", "c", NA, "d", "e", "f")

heights <- c(2, 4, 4, NA, 6)
mean(heights)
max(heights)
mean(heights, na.rm = TRUE)
max(heights, na.rm = TRUE)

is.na(heights)
heights[!is.na(heights)]

complete.cases(heights)
heights[complete.cases(heights)] # Missing out na.omit(heights)

Challenge

  • Why does the following give an error:
sample <- c(2, 4, 4, "NA", 6)
mean(sample, na.rm = TRUE)
  • Why does the error message say the argument is not numeric?

Managing data in R

Learning objectives

  • Load external data (CSV files) in memory using the survey table (surveys.csv) as an example
  • explore the structure and the content of a data frame in R
  • understand what factors are and how to manipulate them

Presentation of Survey data

  • Study species and weight of animals caught in plots in a survey area.
  • Rows represent animals
  • 13 Columns give:
Column	     Description
record_id	 Unique id for the observation
month	     month of observation
day	         day of observation
year	     year of observation
plot_id	     ID of a particular plot
species_id	 2-letter code
sex	         sex of animal (“M”, “F”)
hindfoot_    length	length of the hindfoot in mm
weight	     weight of the animal in grams
genus	     genus of animal
species	     species of animal
taxa	     e.g. Rodent, Reptile, Bird, Rabbit
plot_type	 type of plot

Download data and add read it into Rstudio:

download.file("https://ndownloader.figshare.com/files/2292169",
              "data/portal_data_joined.csv")

surveys <- read.csv('data/portal_data_joined.csv')

surveys

head(surveys)

Have a data.frame - table, columns are vectors (the same data type) and all of the same length.

str(surveys) # Look at the structure

Challenges

Based on the output of str(surveys), can you answer the following questions?

  • What is the class of the object surveys?
  • How many rows and how many columns are in this object?
  • How many species have been recorded during these surveys?

Factors

  • Factors represent categorical data.
  • Labels with integers.
  • Look like character vectors but have integers under the hood.
  • Can be ordered or unordered.
sex <- factor(c("male", "female", "female", "male"))
sex   # female level 1, male level 2 - alphabetical

levels(sex) 
nlevels(sex)

food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food) # you want an order

food <- factor(food, levels=c("low", "medium", "high"))
levels(food)

min(food) ## doesn't work

food <- factor(food, levels=c("low", "medium", "high"), ordered=TRUE)
levels(food)
min(food) ## works!

# Converting factors
f <- factor(c(1, 5, 10, 2))
f
as.numeric(f)               # wrong! position of levels - dput(f)
as.numeric(as.character(f)) # works...
as.numeric(levels(f))[f]    # The recommended way.

Challenge

exprmt <- factor(c("treat1", "treat2", "treat1", "treat3", "treat1", "control",
                   "treat1", "treat2", "treat3"))
table(exprmt)
barplot(table(exprmt))

Want you to:

  • put control last.
  • colour in the bars red
  • label the axes "frequency","experiment"
exprmt <- factor(exprmt,levels=c("treat1","treat2","treat3","control"))
barplot(table(exprmt),col="red",xlab="experiment",ylab="frequency")

Data frames

Learning objectives

  • understand the concept of a data.frame
  • use sequences
  • know how to access any element of a data.frame

What are data.frames

  • tabular structure
  • collection of identical length vectors, can be of different types
  • can use str to look at the structure
  • can be constructed by hand but usually the default with read.table, read.csv
  • by default character data read as factors, sometimes usually don't want that
some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)

# can create data frame
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8))
str(example_data)

example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8),stringsAsFactors = FALSE)
str(example_data)

color <- c("red", "green", "blue", "yellow")
counts <- c(50, 60, 65, 82)
new_dataframe <- data.frame(colors = color, counts = counts)

class(new_dataframe)

Inspecting data.frame Objects

Use surveys to demonstrate:

  • Size:
    • dim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow() - returns the number of rows
    • ncol() - returns the number of columns
  • Content:
    • head() - shows the first 6 rows
    • tail() - shows the last 6 rows
  • Names:
    • names() - returns the column names (synonym of colnames() for data.frame objects)
    • rownames() - returns the row names
  • Summary:
    • str() - structure of the object and information about the class, length and content of each column

Indexing, Sequences, and Subsetting

1:10
10:1

seq(1, 10, by=2) # More complex sequencing
seq(5, 10, length.out=3)
seq(50, by=5, length.out=10)
seq(1, 8, by=3) # sequence stops to stay below upper limit

surveys[1]      # first column in the data frame (as a data.frame)
surveys[1, 1]   # first element in the first column of the data frame (as a vector)
surveys[1, 6]   # first element in the 6th column (as a vector)
surveys[1:3, 7] # first three elements in the 7th column (as a vector)
surveys[3, ]    # the 3rd element for all columns (as a data.frame)
surveys[, 8]    # the entire 8th column (as a vector)
head_surveys <- surveys[1:6, ] # equivalent to head(surveys)

surveys["species_id"]       # Result is a data.frame
surveys[, "species_id"]     # Result is a vector
surveys[["species_id"]]     # Result is a vector
surveys$species_id          # Result is a vector

surveys$d  # partial matching - DANGEROUS

Challenges

  1. The function nrow() on a data.frame returns the number of rows. Use it, in conjuction with seq() to create a new data.frame called surveys_by_10 that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)

  2. Create a data.frame containing only the observations from row 1999 of the surveys dataset.

  3. Notice how nrow() gave you the number of rows in a data.frame? Use nrow() instead of a row number to make a data.frame with observations from only the last row of the surveys dataset.

  4. Now that you’ve seen how nrow() can be used to stand in for a row index, let’s combine that behaviour with the - notation above to reproduce the behaviour of head(surveys) excluding the 7th through to the final row of the surveys dataset.

dplyr

Learning objectives

  • Learn basic utilities of the dplyr package
  • Select and filter data
  • Be able to use magrittr pipes
  • Create new columns with mutate()
  • Use the split-apply-combine paradigm to summarize data
  • Export data with write.csv()

Data Manipulation using dplyr

dplyr is a package. Loaded with the library command. Need to install it first. Uses C++ under the bonnet. Can work with databases.

install.packages("dplyr")  # May have be asked to use a mirror
library("dplyr")           # load the package

select(surveys, plot_id, species_id, weight) # select columns

filter(surveys, year == 1995) # select rows

What about selecting and filtering at the same time?

  • use intermediate steps,
  • nested functions,
  • or pipes
surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)
  
surveys_sml <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)  # or use ->

surveys_sml

Challenge

  • Using pipes, subset the data to include individuals collected before 1995, and retain the columns year, sex, and weight.
surveys %>% filter(year < 1995) %>% select(year, sex, weight)

Mutate

Create new or change columns use mutate.

surveys %>%            # Change from g to kg
  mutate(weight_kg = weight/1000)
  
surveys %>%       # Limit the output
  mutate(weight_kg = weight / 1000) %>%
  head            # Note the NAs
  
surveys %>%             # remove the NAs
  filter(!is.na(weight)) %>%
  mutate(weight_kg = weight / 1000) %>%
  head

Challenge

Create a new dataframe from the survey data that meets the following criteria:

  • contains only the species_id column and a column that contains values that are half the hindfoot_length values (e.g. a new column hindfoot_half).
  • In this hindfoot_half column, there are no NA values and all values are < 30.

Hint: think about how the commands should be ordered to produce this data frame!

surveys %>%
  filter(!is.na(hindfoot_length) & hindfoot_length < 60) %>%
  mutate(hindfoot_half=0.5*hindfoot_length) %>%
  select(species_id, hindfoot_half) %>%
  head

Split-apply-combine data analysis and the summarize() function

A lot of data analysis tasks can be approached using a split-apply-combine paradigm:

  • split the data into groups,
  • apply some analysis to each group, and then
  • combine the results.

dplyr makes this very easy through the use of the group_by() and summarize() function.

  • group_by() - splits categorical data into groups.
  • summarize() - collapses each group into a single-row summary of that group.
surveys %>%
  group_by(sex) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE))
  

surveys %>%  # Group by sex and species
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE))
  
surveys %>%       # Remove missing weights (note na.rm removed as well)
  filter(!is.na(weight)) %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight))  # data.frame -> tbl_df
  

surveys %>%    # Print out more rows & remove missing sex
  filter(!is.na(weight) & sex != "") %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight)) %>%
  print(n=15)
  
surveys %>%   # Summarize by one more than one variable
  filter(!is.na(weight) & sex != "") %>%
  group_by(sex, species_id) %>%
  summarize(mean_weight = mean(weight),
            min_weight = min(weight),count=n()) # n() can be used 
                                                # in mutate, 
                                                # filter, 
                                                # summarise
surveys %>%   # Can use tally
  group_by(sex) %>%
  tally()
  

Challenges

  • How many individuals were caught in each plot_type surveyed?
surveys %>% group_by(plot_type,species) %>% summarize(number=n())

surveys %>% group_by(plot_type,species) %>% tally()
  • Use group_by() and summarize() to find the mean, min, and max hindfoot length for each species (using species_id).
surveys %>% 
        filter(!is.na(hindfoot_length)) %>% 
        group_by(species_id) %>% 
        summarize(mean=mean(hindfoot_length),
                  max=max(hindfoot_length),
                  min=min(hindfoot_length))
  • What was the heaviest animal measured in each year? Return the columns year, genus, species_id, and weight.
surveys %>% 
        filter(!is.na(weight)) %>% 
        group_by(year) %>%
        filter(weight==max(weight)) %>%
        select(year,genus,species_id,weight) 
        # pipe through arrange() to sort

Exporting data

Export data to a data_out directory.

  • Create the data_output directory.

Going to use write.csv() clean the data first.

surveys_complete <- surveys %>%    # remove missing values
  filter(species_id != "",         # remove missing species_id
         !is.na(weight),           # remove missing weight
         !is.na(hindfoot_length),  # remove missing hindfoot_length
         sex != "")                # remove missing sex
         
species_counts <- surveys_complete %>%   # extract the most common species
                  group_by(species_id) %>%
                  tally %>%
                  filter(n >= 50) %>%
                  select(species_id)

surveys_complete <- surveys_complete %>% # keep most common species
                 filter(species_id %in% species_counts$species_id)
                 
dim(surveys_complete) # Should have 30463 rows and 13 columns 

# Write to CSV
write.csv(surveys_complete, file="data_output/surveys_complete.csv",
          row.names=FALSE)

ggplot2

Learning Objectives

  • Visualize some of the mammals data from Figshare surveys.csv.
  • Understand how to plot these data using R ggplot2 package.
  • Building step by step complex plots with the ggplot2 package.

Load the required packages:

library(ggplot2)
library(dplyr)

Plotting with ggplot2

ggplot2 can generate complex publication ready plots from data frames.

ggplot(data=surveys_complete) + # bind plot to specific data frame

# add aesthetics
ggplot(data=surveys_complete,aes(x=weight,y=hindfoot_length)))

# specify a geometry
ggplot(data=surveys_complete,aes(x=weight,y=hindfoot_length)) +
geom_point()

# Create
surveys_plot <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length))

# Draw the plot
surveys_plot + geom_point()

Typically build plots iteratively:

  • define a data set
  • choose a geometry
  • play with it until it is right
# add transparency to avoid over plotting
base <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length))
 
base + geom_point(alpha = 0.1)
    
# add colour
base + geom_point(alpha = 0.1, color = "blue")

# colour by species
base + geom_point(alpha = 0.1, aes(color=species_id))
    
# Boxplot
# Visualising the distribution of weight within each species.
ggplot(data = surveys_complete, aes(x = species_id, y = hindfoot_length)) +
    geom_boxplot()
    
   
# Add jitter
ggplot(data = surveys_complete, aes(x = species_id, y = hindfoot_length)) +
    geom_boxplot(alpha = 0) +
    geom_jitter(alpha = 0.3, color = "tomato")     

Challenges

  • Replace the box plot with a violin plot; see geom_violin().
  • Represent weight on the log10 scale; see scale_y_log10().
  • Create boxplot for hindfoot_length.

Plotting time series data

Plot the counts per year for each species.

# Group by year and species_id and count
yearly_counts <- surveys_complete %>%
                 group_by(year, species_id) %>%
                 tally

# plot                 
ggplot(data = yearly_counts, aes(x = year, y = n)) +
     geom_line()
     
# do it by species by grouping
ggplot(data = yearly_counts, aes(x = year, y = n, group = species_id)) +
    geom_line()

# Distinguish species by colour
ggplot(data = yearly_counts, 
       aes(x = year, y = n, group = species_id, 
       colour = species_id)) +
    geom_line()

Faceting

Break a single plots into multiple plots based on a factor included in the dataset.

ggplot(data = yearly_counts, aes(x = year, y = n, group = species_id, colour = species_id)) +
    geom_line() +
    facet_wrap(~ species_id)

Want to split each plot by gender. To do that we need to make counts in data frame grouped by year, species_id, and sex.

yearly_sex_counts <- surveys_complete %>%
                      group_by(year, species_id, sex) %>%
                      tally
                      
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = species_id, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id)

# Set background to white (easer to read) 
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = species_id, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       theme_bw()

# Colour by sex and not gender
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = sex, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       theme_bw()                       

Challenge

  • Use what you just learned to create a plot that depicts how the average weight of each species changes through the years.
yearly_sex_weight <- surveys_complete %>%
    group_by(year, sex, species_id) %>%
    summarize(avg_weight = mean(weight))
    
ggplot(data = yearly_sex_weight, 
       aes(x=year, y=avg_weight, color = species_id, 
       group = species_id)) +
       geom_line() +
       facet_grid(sex ~ .)  # rows ~ columns; a . can be used as a 
                            # placeholder that indicates only one 
                            # row or column

# One row, facet by column
ggplot(data = yearly_sex_weight, aes(x=year, y=avg_weight, color = species_id, group = species_id)) +
    geom_line() +
    facet_grid(. ~ sex)

Customisation

Look at:

# Add a title and labels
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = sex, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       labs(title = 'Observed species in time',
            x = 'Year of observation',
            y = 'Number of species') +
    theme_bw()
    
# Change font size and font
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = sex, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       labs(title = 'Observed species in time',
            x = 'Year of observation',
            y = 'Number of species') +
       theme_bw() +
       theme(text=element_text(size=16, family="Arial"))

# Change the orientation of the labels
ggplot(data = yearly_sex_counts, 
       aes(x = year, y = n, color = sex, group = sex)) +
       geom_line() +
       facet_wrap(~ species_id) +
       labs(title = 'Observed species in time',
           x = 'Year of observation',
           y = 'Number of species') +
       theme_bw() +
       theme(axis.text.x = element_text(colour="grey20", size=12, 
                                    angle=90, hjust=.5, vjust=.5),
             axis.text.y = element_text(colour="grey20", size=12),
             text=element_text(size=16, family="Arial"))
             
# Save the theme
arial_grey_theme <- theme(
     axis.text.x = element_text(colour="grey20", size=12, 
                                angle=90, hjust=.5, vjust=.5),
     axis.text.y = element_text(colour="grey20", size=12),
     text=element_text(size=16, family="Arial"))

# Reuse
ggplot(surveys_complete, 
       aes(x = species_id, y = hindfoot_length)) +
       geom_boxplot() +
       arial_grey_theme
       
# Saving a plot
my_plot <- ggplot(data = yearly_sex_counts, 
                 aes(x = year, y = n, color = sex, group = sex)) +
            geom_line() +
            facet_wrap(~ species_id) +
            labs(title = 'Observed species in time',
                 x = 'Year of observation',
                 y = 'Number of species') +
            theme_bw() +
            theme(axis.text.x = element_text(colour="grey20", 
                  size=12, angle=90, hjust=.5, vjust=.5),
            axis.text.y = element_text(colour="grey20", size=12),
            text=element_text(size=16, family="Arial"))

# Now save the plot            
ggsave("name_of_file.png", my_plot, width=15, height=10)       

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 4.0 International License.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment