- Getting started
- Introducing R and programming basics
- Introduction to R
- Managing data in R
- Data frames
- dplyr
- ggplot2
- R projects
- Organising files and directories related to a particular set of analyses.
- Define the following (as they apply to R):
- Script,
- function,
- working directory,
- assign,
- object,
- variable
- Describe the purpose of:
- an RStudio script,
- console,
- environment, and
- plot windows
- Assigning values to variables
- Call functions with zero or more named or unnamed arguments
- Use the built-in RStudio help interface to search for more information on R functions
- Ask for help from the R user community, providing sufficient information for the problem to be reproduced and troubleshooted
- working directories
- use relative paths thereafter
- Start R-studio,
- File > New Project > New directory->
data-carpentry
- Files tab > New folder >
data
- File > New File > R script >
data-carpentry-script.R
- File > New Project > New directory->
- General organisation of your working directory
data
scripts
documents
- Move
data-carpentry-script.R
toscripts
- R-studio
- IDE
- Distributed for free under the Affero GNU Public Licence & Commercial licences
- Four main panes:
- editor
- Best for reproducibility
- console
- environment/history
- files/plots/packages/help/viewer
- editor
- Programming is like baking a case - have a series of instructions to follow
# This is a comment
print("Hello World!")
- Use of the text editor and console
- Use of
Cnt-1
andCnt-2
- Use of
Cnt-Enter
- Use of the
+
continuation marker- Use of
Esc
to get out of the continuation mode
- Use of
- Open Source, released under GPL Version 2
- Widely used, almost 9k contributed packages (see CRAN)
- Cross platform
- Not just for statistics, can be used for general programming
- R works as an object-oriented & functional programming language.
- Big community
- Use of # for comments in R
- Use of assignment in R,
x<-3
(ALT+-
in Studio gives you the assignment)- Can use
=
but best for labelling options in functions
- Can use
- Use of functions
sqrt(4)
,x<-sqrt(4)
,myadd<-function(x,y){return(x+y)}
- Arguments can be numbers, characters, objects, optional (take default values)
round(3.14159)
args(round)
?round
help("round")
round(3.14159,2) # Using positional arguments
round(digits=2,x=3.14159)
?barplot
orhelp("barplot")
- Search
??barplot
- Googling; Stackoverflow with
[r]
tag - Asking for help;
- Try to come up with the smallest use case that replicates your problem, including sample data.
- Useful to add the type of framework that you are using:
sessionInfo()
.
- Familiarize participants with R syntax
- Understand the concepts of objects and assignment
- Understand the concepts of vector and data types
- Get exposed to a few functions
- Can do maths on the console
3+5
12/7
weight_kg <- 55 # picking object names: start with a letter,
# cannot be a reserved name (`?reserved`),
# function names (lead to confusion)
(weight_kg <- 55) # Will print the value
weight_kg # This will print it too
2.2 * weight_kg
weight_kg <- 57.5
2.2 * weight_kg
weight_lb <- 2.2 * weight_kg
weight_kg <- 100
weight_lb # What will this give?
# Challenge
mass <- 47.5 # mass?
age <- 122 # age?
mass <- mass * 2.0 # mass?
age <- age - 20 # age?
mass_index <- mass/age # mass_index?
weight_g <- c(50, 60, 65, 82) # elements of the same type
weight_g
animals <- c("mouse", "rat", "dog")
animals
length(weight_g)
length(animals)
class(weight_g) # find out the class/type of the elements
class(animals)
weight_g <- c(weight_g, 90) # adding at the end of the vector
weight_g <- c(30, weight_g) # adding at the beginning of the vector
weight_g
There are six atomic vector types:
character
andnumeric
which we have metlogical
for TRUE and FALSE (the boolean data type)integer
for integer numbers (e.g.,2L
, theL
indicates to R that it’s an integer)complex
to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about themraw
that we won’t discuss further (storing raw bytes)
Data types in R: vector
, matrix
, data.frame
, list
.
# Challenge
num_char <- c(1, 2, 3, 'a')
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c('a', 'b', 'c', TRUE)
tricky <- c(1, 2, 3, '4')
logical
>integer
>complex
>character
animals <- c("mouse", "rat", "dog", "cat")
animals[2] # Indices start at 1
animals[c(3, 2)]
more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)]
weight_g > 50 # returns logicals for the indices that meet the condition
weight_g[weight_g > 50] # can use to select values above 50
weight_g[weight_g < 30 | weight_g > 50] # Combining tests with | or &
weight_g[weight_g >= 30 & weight_g == 21] # Oops
animals <- c("mouse", "rat", "dog", "cat") # Use of %>%
animals[animals == "cat" | animals == "rat"] # returns both rat and cat
animals %in% c("rat", "cat", "dog", "duck")
animals[animals %in% c("rat", "cat", "dog", "duck")]
- Can you figure out why "four" > "five" returns TRUE?
- if
!
is the negation operator how would you return everything that is not a rat,cat, dog or duck?
alphabet <- c("a", "b", "c", NA, "d", "e", "f")
heights <- c(2, 4, 4, NA, 6)
mean(heights)
max(heights)
mean(heights, na.rm = TRUE)
max(heights, na.rm = TRUE)
is.na(heights)
heights[!is.na(heights)]
complete.cases(heights)
heights[complete.cases(heights)] # Missing out na.omit(heights)
- Why does the following give an error:
sample <- c(2, 4, 4, "NA", 6)
mean(sample, na.rm = TRUE)
- Why does the error message say the argument is not numeric?
- Load external data (CSV files) in memory using the survey table (surveys.csv) as an example
- explore the structure and the content of a data frame in R
- understand what factors are and how to manipulate them
- Study species and weight of animals caught in plots in a survey area.
- Rows represent animals
- 13 Columns give:
Column Description
record_id Unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular plot
species_id 2-letter code
sex sex of animal (“M”, “F”)
hindfoot_ length length of the hindfoot in mm
weight weight of the animal in grams
genus genus of animal
species species of animal
taxa e.g. Rodent, Reptile, Bird, Rabbit
plot_type type of plot
Download data and add read it into Rstudio:
download.file("https://ndownloader.figshare.com/files/2292169",
"data/portal_data_joined.csv")
surveys <- read.csv('data/portal_data_joined.csv')
surveys
head(surveys)
Have a data.frame
- table, columns are vectors (the same data type) and all of the same length.
str(surveys) # Look at the structure
Based on the output of str(surveys), can you answer the following questions?
- What is the class of the object surveys?
- How many rows and how many columns are in this object?
- How many species have been recorded during these surveys?
- Factors represent categorical data.
- Labels with integers.
- Look like character vectors but have integers under the hood.
- Can be ordered or unordered.
sex <- factor(c("male", "female", "female", "male"))
sex # female level 1, male level 2 - alphabetical
levels(sex)
nlevels(sex)
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food) # you want an order
food <- factor(food, levels=c("low", "medium", "high"))
levels(food)
min(food) ## doesn't work
food <- factor(food, levels=c("low", "medium", "high"), ordered=TRUE)
levels(food)
min(food) ## works!
# Converting factors
f <- factor(c(1, 5, 10, 2))
f
as.numeric(f) # wrong! position of levels - dput(f)
as.numeric(as.character(f)) # works...
as.numeric(levels(f))[f] # The recommended way.
exprmt <- factor(c("treat1", "treat2", "treat1", "treat3", "treat1", "control",
"treat1", "treat2", "treat3"))
table(exprmt)
barplot(table(exprmt))
Want you to:
- put control last.
- colour in the bars red
- label the axes "frequency","experiment"
exprmt <- factor(exprmt,levels=c("treat1","treat2","treat3","control"))
barplot(table(exprmt),col="red",xlab="experiment",ylab="frequency")
- understand the concept of a
data.frame
- use sequences
- know how to access any element of a
data.frame
- tabular structure
- collection of identical length vectors, can be of different types
- can use
str
to look at the structure - can be constructed by hand but usually the default with
read.table
,read.csv
- by default character data read as
factors
, sometimes usually don't want that
some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)
# can create data frame
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
str(example_data)
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8),stringsAsFactors = FALSE)
str(example_data)
color <- c("red", "green", "blue", "yellow")
counts <- c(50, 60, 65, 82)
new_dataframe <- data.frame(colors = color, counts = counts)
class(new_dataframe)
Use surveys
to demonstrate:
- Size:
dim()
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow()
- returns the number of rowsncol()
- returns the number of columns
- Content:
head()
- shows the first 6 rowstail()
- shows the last 6 rows
- Names:
names()
- returns the column names (synonym ofcolnames()
for data.frame objects)rownames()
- returns the row names
- Summary:
str()
- structure of the object and information about the class, length and content of each column
1:10
10:1
seq(1, 10, by=2) # More complex sequencing
seq(5, 10, length.out=3)
seq(50, by=5, length.out=10)
seq(1, 8, by=3) # sequence stops to stay below upper limit
surveys[1] # first column in the data frame (as a data.frame)
surveys[1, 1] # first element in the first column of the data frame (as a vector)
surveys[1, 6] # first element in the 6th column (as a vector)
surveys[1:3, 7] # first three elements in the 7th column (as a vector)
surveys[3, ] # the 3rd element for all columns (as a data.frame)
surveys[, 8] # the entire 8th column (as a vector)
head_surveys <- surveys[1:6, ] # equivalent to head(surveys)
surveys["species_id"] # Result is a data.frame
surveys[, "species_id"] # Result is a vector
surveys[["species_id"]] # Result is a vector
surveys$species_id # Result is a vector
surveys$d # partial matching - DANGEROUS
-
The function
nrow()
on adata.frame
returns the number of rows. Use it, in conjuction withseq()
to create a new data.frame calledsurveys_by_10
that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …) -
Create a
data.frame
containing only the observations from row 1999 of thesurveys
dataset. -
Notice how
nrow()
gave you the number of rows in a data.frame? Usenrow()
instead of a row number to make a data.frame with observations from only the last row of the surveys dataset. -
Now that you’ve seen how
nrow()
can be used to stand in for a row index, let’s combine that behaviour with the - notation above to reproduce the behaviour ofhead(surveys)
excluding the 7th through to the final row of the surveys dataset.
- Learn basic utilities of the
dplyr
package - Select and filter data
- Be able to use
magrittr
pipes - Create new columns with
mutate()
- Use the split-apply-combine paradigm to summarize data
- Export data with
write.csv()
dplyr
is a package. Loaded with the library
command. Need to install it first. Uses C++ under the bonnet. Can work with databases.
install.packages("dplyr") # May have be asked to use a mirror
library("dplyr") # load the package
select(surveys, plot_id, species_id, weight) # select columns
filter(surveys, year == 1995) # select rows
What about selecting and filtering at the same time?
- use intermediate steps,
- nested functions,
- or pipes
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
surveys_sml <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight) # or use ->
surveys_sml
- Using pipes, subset the data to include individuals collected before 1995, and retain the columns year, sex, and weight.
surveys %>% filter(year < 1995) %>% select(year, sex, weight)
Create new or change columns use mutate
.
surveys %>% # Change from g to kg
mutate(weight_kg = weight/1000)
surveys %>% # Limit the output
mutate(weight_kg = weight / 1000) %>%
head # Note the NAs
surveys %>% # remove the NAs
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000) %>%
head
Create a new dataframe from the survey data that meets the following criteria:
- contains only the
species_id
column and a column that contains values that are half thehindfoot_length
values (e.g. a new columnhindfoot_half
). - In this
hindfoot_half
column, there are no NA values and all values are< 30
.
Hint: think about how the commands should be ordered to produce this data frame!
surveys %>%
filter(!is.na(hindfoot_length) & hindfoot_length < 60) %>%
mutate(hindfoot_half=0.5*hindfoot_length) %>%
select(species_id, hindfoot_half) %>%
head
A lot of data analysis tasks can be approached using a split-apply-combine paradigm:
- split the data into groups,
- apply some analysis to each group, and then
- combine the results.
dplyr
makes this very easy through the use of the group_by()
and summarize()
function.
group_by()
- splits categorical data into groups.summarize()
- collapses each group into a single-row summary of that group.
surveys %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
surveys %>% # Group by sex and species
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
surveys %>% # Remove missing weights (note na.rm removed as well)
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) # data.frame -> tbl_df
surveys %>% # Print out more rows & remove missing sex
filter(!is.na(weight) & sex != "") %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) %>%
print(n=15)
surveys %>% # Summarize by one more than one variable
filter(!is.na(weight) & sex != "") %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight),count=n()) # n() can be used
# in mutate,
# filter,
# summarise
surveys %>% # Can use tally
group_by(sex) %>%
tally()
- How many individuals were caught in each
plot_type
surveyed?
surveys %>% group_by(plot_type,species) %>% summarize(number=n())
surveys %>% group_by(plot_type,species) %>% tally()
- Use
group_by()
andsummarize()
to find themean
,min
, andmax
hindfoot
length for each species (usingspecies_id
).
surveys %>%
filter(!is.na(hindfoot_length)) %>%
group_by(species_id) %>%
summarize(mean=mean(hindfoot_length),
max=max(hindfoot_length),
min=min(hindfoot_length))
- What was the heaviest animal measured in each year? Return the columns
year
,genus
,species_id
, andweight
.
surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>%
filter(weight==max(weight)) %>%
select(year,genus,species_id,weight)
# pipe through arrange() to sort
Export data to a data_out
directory.
- Create the
data_output
directory.
Going to use write.csv()
clean the data first.
surveys_complete <- surveys %>% # remove missing values
filter(species_id != "", # remove missing species_id
!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
sex != "") # remove missing sex
species_counts <- surveys_complete %>% # extract the most common species
group_by(species_id) %>%
tally %>%
filter(n >= 50) %>%
select(species_id)
surveys_complete <- surveys_complete %>% # keep most common species
filter(species_id %in% species_counts$species_id)
dim(surveys_complete) # Should have 30463 rows and 13 columns
# Write to CSV
write.csv(surveys_complete, file="data_output/surveys_complete.csv",
row.names=FALSE)
- Visualize some of the mammals data from Figshare
surveys.csv
. - Understand how to plot these data using R
ggplot2
package. - Building step by step complex plots with the
ggplot2
package.
Load the required packages:
library(ggplot2)
library(dplyr)
ggplot2
can generate complex publication ready plots from data frames.
ggplot(data=surveys_complete) + # bind plot to specific data frame
# add aesthetics
ggplot(data=surveys_complete,aes(x=weight,y=hindfoot_length)))
# specify a geometry
ggplot(data=surveys_complete,aes(x=weight,y=hindfoot_length)) +
geom_point()
# Create
surveys_plot <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length))
# Draw the plot
surveys_plot + geom_point()
Typically build plots iteratively:
- define a data set
- choose a geometry
- play with it until it is right
# add transparency to avoid over plotting
base <- ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length))
base + geom_point(alpha = 0.1)
# add colour
base + geom_point(alpha = 0.1, color = "blue")
# colour by species
base + geom_point(alpha = 0.1, aes(color=species_id))
# Boxplot
# Visualising the distribution of weight within each species.
ggplot(data = surveys_complete, aes(x = species_id, y = hindfoot_length)) +
geom_boxplot()
# Add jitter
ggplot(data = surveys_complete, aes(x = species_id, y = hindfoot_length)) +
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.3, color = "tomato")
- Replace the box plot with a violin plot; see
geom_violin()
. - Represent weight on the log10 scale; see
scale_y_log10()
. - Create boxplot for
hindfoot_length
.
Plot the counts per year for each species.
# Group by year and species_id and count
yearly_counts <- surveys_complete %>%
group_by(year, species_id) %>%
tally
# plot
ggplot(data = yearly_counts, aes(x = year, y = n)) +
geom_line()
# do it by species by grouping
ggplot(data = yearly_counts, aes(x = year, y = n, group = species_id)) +
geom_line()
# Distinguish species by colour
ggplot(data = yearly_counts,
aes(x = year, y = n, group = species_id,
colour = species_id)) +
geom_line()
Break a single plots into multiple plots based on a factor included in the dataset.
ggplot(data = yearly_counts, aes(x = year, y = n, group = species_id, colour = species_id)) +
geom_line() +
facet_wrap(~ species_id)
Want to split each plot by gender. To do that we need to make counts in data frame grouped by year
, species_id
, and sex
.
yearly_sex_counts <- surveys_complete %>%
group_by(year, species_id, sex) %>%
tally
ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = species_id, group = sex)) +
geom_line() +
facet_wrap(~ species_id)
# Set background to white (easer to read)
ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = species_id, group = sex)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw()
# Colour by sex and not gender
ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = sex, group = sex)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw()
- Use what you just learned to create a plot that depicts how the average weight of each species changes through the years.
yearly_sex_weight <- surveys_complete %>%
group_by(year, sex, species_id) %>%
summarize(avg_weight = mean(weight))
ggplot(data = yearly_sex_weight,
aes(x=year, y=avg_weight, color = species_id,
group = species_id)) +
geom_line() +
facet_grid(sex ~ .) # rows ~ columns; a . can be used as a
# placeholder that indicates only one
# row or column
# One row, facet by column
ggplot(data = yearly_sex_weight, aes(x=year, y=avg_weight, color = species_id, group = species_id)) +
geom_line() +
facet_grid(. ~ sex)
Look at:
# Add a title and labels
ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = sex, group = sex)) +
geom_line() +
facet_wrap(~ species_id) +
labs(title = 'Observed species in time',
x = 'Year of observation',
y = 'Number of species') +
theme_bw()
# Change font size and font
ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = sex, group = sex)) +
geom_line() +
facet_wrap(~ species_id) +
labs(title = 'Observed species in time',
x = 'Year of observation',
y = 'Number of species') +
theme_bw() +
theme(text=element_text(size=16, family="Arial"))
# Change the orientation of the labels
ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = sex, group = sex)) +
geom_line() +
facet_wrap(~ species_id) +
labs(title = 'Observed species in time',
x = 'Year of observation',
y = 'Number of species') +
theme_bw() +
theme(axis.text.x = element_text(colour="grey20", size=12,
angle=90, hjust=.5, vjust=.5),
axis.text.y = element_text(colour="grey20", size=12),
text=element_text(size=16, family="Arial"))
# Save the theme
arial_grey_theme <- theme(
axis.text.x = element_text(colour="grey20", size=12,
angle=90, hjust=.5, vjust=.5),
axis.text.y = element_text(colour="grey20", size=12),
text=element_text(size=16, family="Arial"))
# Reuse
ggplot(surveys_complete,
aes(x = species_id, y = hindfoot_length)) +
geom_boxplot() +
arial_grey_theme
# Saving a plot
my_plot <- ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = sex, group = sex)) +
geom_line() +
facet_wrap(~ species_id) +
labs(title = 'Observed species in time',
x = 'Year of observation',
y = 'Number of species') +
theme_bw() +
theme(axis.text.x = element_text(colour="grey20",
size=12, angle=90, hjust=.5, vjust=.5),
axis.text.y = element_text(colour="grey20", size=12),
text=element_text(size=16, family="Arial"))
# Now save the plot
ggsave("name_of_file.png", my_plot, width=15, height=10)
This work is licensed under a Creative Commons Attribution 4.0 International License.