Skip to content

Instantly share code, notes, and snippets.

@gustavoalbuquerquebr
Last active September 6, 2020 23:18
Show Gist options
  • Save gustavoalbuquerquebr/d7758d03684e98c34bc46b752499887c to your computer and use it in GitHub Desktop.
Save gustavoalbuquerquebr/d7758d03684e98c34bc46b752499887c to your computer and use it in GitHub Desktop.
#r

Ubuntu Install

  • sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
  • sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
  • sudo apt update
  • sudo apt install r-base

General observations

  • get help by append ? to a function name, data set, symbol...
  • implicit printing, therefore, usually you don't need to write print()
  • counting stats at 1; so to get the first item in a vector x[1];
  • in function calls, you can specify the arguments by name or order, e.g.
    • plot(iris$Species, iris$Petal.Width), plot(x = iris$Species, y = iris$Petal.Width)
  • vectorized language, therefore there's much less need to iterate over objects, e.g.
    • to get double of each element of this vector x <- c(3, 5, 8), you only need x * 2
  • AND, OR comparisions can be made with single ou double signs (&, &&, |, ||), they behave in different ways

Basics

help

  • ? append a question mark to a function, data set, library name and get info about it

Comments

  • # = single-line comments
  • R doesn’t support multi-lined comments

clear console

  • cat("\014") is the code to send "CTRL + L" to the console

list available data sets

  • data() = lists all available data sets, including from libraries (if they're loaded)

view data set as table

  • View(data set)

Declare variables and functions

NOTE: it's also permissible to declare variable and functions with equal sign

  • variables: x <- 20
  • functions: myF <- function() {...}
  • vectors: x <- c(5, 8, 12)

function

functions with ellipsis

add <- function(...) {
  args <- list(...)
  sum <- 0
  
  for (n in args) {
    sum <- sum + n
  }
  
  return(sum)
}

add(2, 3, 5, 4)

variable info

  • class() or typeof() = the only difference is that class calls double 'numeric' and typeof 'double'
  • str() = short for structure, displays the internal structure of the given object

Operators

  • %%, %/% = remainder and quotient
  • : = creates the series of numbers in sequence for a vector
  • %in% = if element belongs to a vector
  • & |, && || =
    • single operators examine the vector element by element and return a vector filled with logical values (TRUE or FALSE)
    • double operators examine only the first element of each vector and return a logical value (TRUE or FALSE)
x <- c( TRUE, FALSE, TRUE );
y <- c( TRUE, FALSE, FALSE);

print(x & y) # TRUE FALSE FALSE
print(x && y) # TRUE

if

ifelse() function

  • ifelse() is a vector equivalent form of the if...else statement
x <- c(3, 5, 8, 12)

# ifelse(test, yes, no)
# returns a value with the same shape as test, usually a vector
# filled with elements selected from either yes or no
# depending on whether the element of test is TRUE or FALSE
ifelse(x %% 2 == 0, "even", "odd")

for loop

# loop 1 through 10 (inclusive)
for (n in 0:10) {
  print(n)
}

# loop vector elements
x <- c(5, 8, 12, 15)
for (n in x) {
  print(n)
}

while

x <- 1

while (x <= 10) {
  print(x)
  x = x + 1
}

switch

color <- "b"
switch(color, "r" = "red", "b" = "blue", "unknown")

get user input

  • x <- scan()

filter data

  • iris$Petal.Width [iris$Species == "setosa"]
  • plot(iris$Petal.Width [iris$Species == "setosa"| iris$Species == "virginica" ])

apply

double <- function(x) {
  return(x * 2)
}
x <- matrix(c(3, 5, 8, 12), nrow = 2)

# apply(X, MARGIN, FUN, …)
# x = matrix; MARGIN = 1 for rows, 2 for cols; FUN = function to apply 
apply(x, 2, double)

Data Types

  • In R, everything is a object

vectors

vector of single value

R doesn't have primitive data types in the way that other languages do. In R even the simplest numeric value is an example of a vector.

  • used often:
    • logical = TRUE, FALSE
    • numeric/double = can be a integer or contain a decimal value
    • character = enclosed with quotes (single or double)
  • not used often:
    • integer = declare explicitly with x <- 10L
    • complex = numbers, e.g. 3 + 2i
    • raw = created with charToRaw()

vector of multiple values

  • must contain only one data type
  • created with c()
  • e.g. numeric vectors x <- c(5, 8, 12)
  • index starts at 1

list

  • can contain many different types of elements
  • like vectors, functions, lists...
  • l <- list(c(3, 5, 8), "my string...", TRUE, list("a", "b"), myFunction)
  • x <- list(a = "aaa", b = "bbb") = can have named elements

matrix

  • matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
  • like vectors, matrices store information of the same data type
  • two-dimensional
  • e.g. m <- matrix(c(3, 5, 8, 12, 15, 18), 2, 3)
  • m[1,2 = access specific element
  • m[1,] = access all of the row 1, m[,2] = access all of the column 2

array

  • while matrices are two-dimensional, arrays can be any number of dimensions
  • store only one data type
  • array(c('green','yellow'), dim = c(2,3,2)) = creates 2 matrices with 2 rows and 3 columns each

factorm

  • created using a vector (of categoricals values), it stores the vector along with the its distincts values
v <- c("pinapple", "banana", "banana", "apple", "pinapple", "banana")
f <- factor(v)

print(f) # print vector and levels (each distinct vector value)
print(nlevels(f)) # print how many distinct vector value
print(levels(f)) # print each distinct vector value

dataFrame

  • is a form of matrix, which is tabular and can contain different data types
  • columns are variables and rows are observations
df <- data.frame(
  Name = c("John", "Matt"),
  Age = c(25, 27),
  City = c("Boston", "NY")
)

print(nrow(df))
print(ncol(df))
print(dim(df)) # get both nrow and ncol

Subsetting

  • Subsetting in R is a useful indexing feature for accessing object elements,
  • it can be used to select and filter variables and observations.

subsetting symbols

single brackets

  • [] = get a subset of length 1 or more
    • usually, object and its subset are of the same type; therefore, subset of vector will be a vector, subset of a data frame will be a data frame...
      • however, there's one inconsistency - if the subset contains only one value, R will reduce the result to the lowest dimension and then subset and container may have different type
    • both names and indices can be used
    • negative integers indicate exclusion
    • variables are interpolated

double brackets

  • [[]] = extract only one element (not necessarily just one value); i.e. vectors yield single value, data frames yield column vector
    • names or indices can be used
    • variables are interpolated
    • usually, not the same type as the object container
    • dimension of returned value isn't necessarily 1

dollar sign

  • $ = special case of [[ in which you access a single item by a name
    • therefore, iris$Species and iris[["Species"]] are equivalent
    • cannot use integer indices
    • if name contain special characters, name must be enclose in backticks

atomic vectors

a <- c(3, 5, 8,12)

# accessing with numbers
a[1]
a[c(1, 3)] # positive get multiple specified elements
a[-c(2, 4)] # negative exclude elements

# accessing with logical values
a[c(TRUE,FALSE,TRUE,FALSE)] # select elements where the value is TRUE
a[a > 5] # therefore, this is possible

recycling rule

  • if two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector
  • if longer object length is not a multiple of shorter object length, the program will throw a warning but it'll still return a result
a <- c(2, 3, 5, 8)
b <- c(1, 2)
a * b # result: 2, 6, 5, 16

lists

x <- list("a", "b", "c")

# single bracket returns a object of class 'list'
class(x[2]) # list
# double brackets returns a single element (not of class 'list')
class(x[[2]]) # character

# named lists
y <- list(f = 1:3, s = "a", t = 4:6)
y$f
y[["f"]]

matrices and arrays

m <- matrix(c(3, 5, 8, 12, 15, 18, 21, 25, 30), nrow = 3, byrow = TRUE)

m[1,] # entire first row
m[, 1] # blank subsetting selects all rows/column; here entire first column
m[2, 1] # element at second row, first column
m[1:2, 2:3] # get rows 1 from 2, their columns 2 from 3
m[c(1, 3), c(1, 3)] # get rows 1 and 3, their columns 1 and 3

# using a 2 column matrix to subset a matrix
# each row of the matrix will specify a row and a column
select <- matrix(c(1, 1, 1, 3, 3, 1, 3, 3), ncol = 2, byrow = TRUE)
m[select] # result: 3 8 21 30

Data frames and tibbles

mtcars[3] # single index will return specified column(s)
mtcars[3, 1] # two indices will behave like matrices, first is row and second is column

`hp$Name` or `hp[["Name"]]` # access by name
mtcars[3, "mpg"] # access by both, index and name; third row, column named "mpg"
mtcars$mpg[3] # access by both, name and index

# filtering by column
# column (second argument) is left blank, to return all columns
iris[iris$Species == "setosa", ]
iris[iris$Petal.Width > 0.5 & iris$Species == "setosa", ] # multiple filters

regular expression

# grepl returns a vector of logical values
g <- grepl("Toyota", rownames(mtcars))
mtcars[g, ]

# grep returns a vector with the indices that contain a match
g <- grep("Toyota", rownames(mtcars))
mtcars[g, ]

# using grepl together with dplyr
library(tidyverse)
iris %>%
  filter(grepl("setosa", Species))

Tidyverse library

  • tidyverse is a set of packages that make easier to perform everyday data analyses and work in harmony (packages share common API)

installation

  • sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev = ubuntu packages needed
    • or install.packages("tidyverse") = to install from the r script
  • library(tidyverse) = to load a library

use/load

  • library(tidyverse)
    • from now on, any tidyverse function (like dplyr::filter) can be called without dplyr::
    • you only need to append dplyr:: if there're name collisions and you need to call the function that was overwritten

pipe operator

  • %>% = simplify chaining, that is, passsing a single data to several functions
library(tidyverse)

# without pipe ('.data')
f <- filter(.data = mpg, model == "a4")
s <- select(.data = f, manufacturer, model, year)
s

# using pipe
mpg %>%
  filter(model == "a4") %>%
  select(manufacturer, model, year)

dplyr

  • manipulate data sets
library(tidyverse)

mtcars %>%
  filter(
    mpg > 20,
    cyl == 4,
    wt < 2.5,
    grepl("Toyota", rownames(mtcars))
  ) %>%
  arrange(mpg) %>%
  select (mpg, cyl, wt)

tidyr

  • helps create tidy data, that is:
    • every column is a variable
    • every row is a observation
    • every cell is a single value

gather, pivot_longer()

  • lengthens data, increasing the number of rows and decreasing the number of columns
  • gather() is retired, recommendation is to use instead pivot_longer()
library(tidyverse)

df <- data.frame(
  name = c("John", "Mary", "Jake"),
  a = c(7, 9, 18),
  b = c(18, 5, 3),
  c = c(32, 17, 35)
)

# 'key' and 'value' will be the names of the new cols 
# 'key' will be a categorical variable holding the 'multiple columns' names
# and 'value' will hold the 'multiple columns' values

df %>%
  # gather(key, value, ...multiple columns)
  gather("drug", "volume", a, b, c)

df %>%
  # pivot_longer(columns vector, names_to, values_to)
  pivot_longer(
    cols = c(a, b, c),
    names_to = "drug",
    values_to = "volume")

spread, pivot_wider()

  • widens data, increasing the number of columns and decreasing the number of rows
  • spread() is retired, recommendation is to use instead pivot_wider()
library(tidyverse)

df <- data.frame(
  name = c("John", "John", "Mary", "Mary"),
  drug = c("a", "b", "a", "b"),
  volume = c(7, 18, 9, 5)
)

# spread
df %>%
  # each individual value in key
  # will be converted to a column
  spread(key = "drug", value = "volume")

# pivot_wider()
df %>%
  # each individual value in names_from
  # will be converted to a column
  pivot_wider(names_from = "drug", values_from = "volume")

separate

  • splits a single column into multiple columns
library(tidyverse)

df <- data.frame(
  Name = c("John", "Mary"),
  Job = c("Teacher, Designer", "Manager, Developer")
)

df %>%
  separate(
    col = "Job",
    into = c("Job 1", "Job 2"), # names of the new columns to be created
    sep = ", "
  )

unite

  • combines multiple columns into on
library(tidyverse)

df <- data.frame(
  Name = c("John", "Mary"),
  Job1 = c("Teacher", "Manager"),
  Job2 = c("Designer", "Developer")
)

df %>%
  # unite(col = name of new column, ...columns to unite, sep = separator)
  unite(
    col = "Jobs",
    "Job1",
    "Job2",
    sep = ", "
  )

extract

  • given a regular expression with capturing groups, extract() turns each group into a new column
library(tidyverse)

df <- data.frame(
  Name = c(
    "John Edwards Smith",
    "Mary Kate Miller Brown",
    "Matt Richards"
  )
)

df %>%
  extract(
    col = "Name",
    into = c("First name", "Last name"),
    regex = "([A-z]*).*\\s([A-z]*)"
  )

readr

Import

RStudio
  • in the bottom-right panel, click in the file name and select 'Import', 'From text(readr)'
  • as you configure you data, the corresponding code line is shown
code
# read_csv is from readr (included in tidyverse)
library(tidyverse)

setwd("~/Dev/r/")
hp <- read_csv("hp.csv")
hp

Export

# write_csv is from readr (included in tidyverse)
library(tidyverse)

setwd("~/Dev/r/")
write_csv(iris, "iris.csv")

Descriptive statistics

  • summarize and describe a given data set

table()

  • create a frequency table from a categorical variable (column)
  • table(iris$Species)

min, median, mean, max, quantile

min(mtcars$cyl)
median(mtcars$cyl)
mean(mtcars$cyl)
max(mtcars$cyl)
quantile(mtcars$cyl)

# get all at once
summary(mtcars$cyl)

summary()

  • summary(iris), summary(iris$Petals.width) = details about an object
    • if variable is categorical, result is a frequency table
    • if variable is quantitative, result is a table containing measures of center (mean, median) and measures of spread (min, 1st qu., 3rd qu., max)

cor()

  • correlation
# correlation between weight and miles per gallon
cor(mtcars$wt, mtcars$mpg) # result: -0.86

Statistical model

  • is a set of mathematical equations based on probabilities and used to describe the relationship between two or more variables
  • purpose: description, inference (estimates the parameters of a larger population), comparison (compare if two sets of data are different in a statistically significant way) and prediction (about new, unknown observations)

linear regression

  • describes the relationship between two variables, how changes in one variable affects the other variable
  • is linear model because assume a straight line
  • both variables must be a continuous numeric value
  • the variable in the x axis is called 'explanatory variable', and the one in the y axis is called 'outcome variable'
  • linear predictor function - y = m * x + b
    • m is the slope of the line (for each unit increase in x, how much does y increase)
    • b is the y intercept (the y value when x is equal to 0)
plot(iris$Petal.Length, iris$Petal.Width)

# lm is the R function to create linear models
model <- lm(
  formula = Petal.Width ~ Petal.Length,
  data = iris
)

# draw straight line on top of the plot
lines(
  x = iris$Petal.Length,
  y = model$fitted,
  col = "red",
  lwd = 3
)

# predict new values from model
predict(
  object = model,
  newdata = data.frame(
    Petal.Length = c(2, 5, 7) # arbitrary values
  )
)

Plot

  • plot is a graphical technique for representing a data set
  • usually a graph showing the relationship between one or more variables
  • in R, plot is usually done
    • with base R, that is, without any third-party library
    • with a library called ggplot2 (included with tidyverse)

base R vs ggplot

  • base R mostly use the plot(x, y) function
    • but there're also the barplot(), hist() functions
  • ggplot always use the
    • ggplot(data = data, mapping = aes()) function,
    • appended by pipe +
    • and then layers, scales, facets and/or coordinates

basic ggplot

save plot to variable and then transforming it

library(tidyverse)

# save plot to variable
# only save it, don't display it
p <- ggplot(mtcars, aes(x = cyl)) +
  geom_bar()

# wont save flipped plot into the variable
# only displays it
p + coord_flip()

customization

library(tidyverse)

# needed for third variable in aes()
f <- factor(mtcars$am)
levels(f) <- c("Automatic", "Manual")

ggplot(mtcars, aes( x = wt, y = mpg, shape = f, color = f )) +
  geom_point() +
  labs(
    title = "WT VS MPG",
    x = "weight",
    y = "miles per gallon",
    # change legend title with the aes names
    shape = "Transmission",
    color = "Transmission"
  ) +
  theme( # theme() customize non-data components
    plot.title =
      element_text( face = "bold",
                    hjust = 0.5,
                    margin = margin(8, 0, 16, 0)),
    axis.title =
      element_text( face = "italic"),
    axis.title.x = 
      element_text( margin = margin(8, 0, 4, 0) ),
    axis.title.y = 
      element_text( margin = margin(0, 8, 0, 4) ),
    axis.ticks = element_blank() # remove ticks
  )

zoom, coord_cartesian

library(tidyverse)

ggplot(ChickWeight, aes(x = weight)) +
  geom_histogram() +
  coord_cartesian(xlim = c(200, 300)) # zoom

fill areas under plot

base R
df <- data.frame(
  Month = 1:12,
  Num = as.vector(AirPassengers)[1:12]
)

plot(df$Num, type = "l")

polygon(c(min(df$Month), df$Month, max(df$Month)), c(0, df$Num, 0), col = "steelblue")
ggplot
df <- data.frame(
  Month = 1:12,
  Num = as.vector(AirPassengers)[1:12]
)

ggplot(df, aes(x = Month, y = Num)) +
  # geom_area() + # ymin fixed to 0, which would make plot very high
  geom_ribbon(aes(ymin = 100, ymax = Num)) +
  geom_line()

Categorical univariable analysis

frequency bar chart

  • x axis: categorical variable
  • y axis: frequency/count
base R
# plot()
plot(iris$Species)

# barplot()
t <- table(iris$Species) # creates frequency table
barplot(t)
ggplot
ggplot(iris, aes(x = Species)) +
  geom_bar()

Cleveland dot plot

base R
dotchart(table(mtcars$cyl))
ggplot
ggplot(mtcars, aes(x = cyl)) +
  # stat = the statistical transformation to use on the data for this layer
  geom_point(stat = "count") +
  coord_flip()

pie chart

base R
pie(table(mtcars$cyl))
ggplot
ggplot(
  mtcars, aes(x = "", fill = as.factor(cyl))) +
  geom_bar() +
  coord_polar(theta = "y")

Quantitative univariable analysis

histogram

base R
hist(mtcars$mpg)
ggplot
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 5) # binwidth = bar widths

density plot

base R
plot(density(mtcars$mpg))
ggplot
ggplot(mtcars, aes(x = mpg)) +
  geom_density()

Categorical bivariable analysis

percent, grouped and stacked frequency bar chart

base R
t <- table(mtcars$cyl, mtcars$am)

barplot(t, beside = TRUE) # grouped

barplot(t,) # stacked

# percent
percentage <- apply(t, 2, function(x){x*100/sum(x,na.rm=T)})
barplot(percentage)
ggplot
# grouped
ggplot(
  data = mtcars,
  aes(x = factor(am), fill = factor(cyl))) +
  geom_bar(position = "dodge")

# stacked
ggplot(
  data = mtcars,
  aes(x = factor(am), fill = factor(cyl))) +
  geom_bar()

# percent
ggplot(
  data = mtcars,
  aes(x = factor(am), fill = factor(cyl))) +
  geom_bar(position = "fill")

Categ. & quant. bivariable analysis

box plot

base R
plot(ChickWeight$Diet, ChickWeight$weight)
ggplot
ggplot(ChickWeight, aes( x = Diet, y = weight)) +
  geom_boxplot()

Quantitative bivariable analysis

scatter plot

base R
plot(mtcars$wt, mtcars$mpg)
ggplot
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

3 variables (by size, color, shape; facet)

ggplot

library(tidyverse)

ggplot(mtcars, aes(x = wt, y = mpg, size = hp)) +
  geom_point()

# both col and shape will need a categorical variable
f <- as.factor(mtcars$am)
levels(f) <- c("Automatic", "Manual") # rename levels

ggplot(mtcars, aes(x = wt, y = mpg, col = f)) +
  geom_point()

ggplot(mtcars, aes(x = wt, y = mpg, shape = f)) +
  geom_point()

# both col and shape
ggplot(mtcars, aes(x = wt, y = mpg, col = f, shape = f)) +
  geom_point()

# multi-panel
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_grid(. ~ cyl)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment