gustavoalbuquerquebr/r.md

## r.md

      
    Raw
  

              r.md
            
          
Ubuntu Install
General observations
Basics

help
Comments
clear console
list available data sets
view data set as table
Declare variables and functions
function

functions with ellipsis


variable info
Operators
if

ifelse() function


for loop
while
switch
get user input
filter data
apply


Data Types

vectors

vector of single value
vector of multiple values


list
matrix
array
factorm
dataFrame


Subsetting

subsetting symbols

single brackets
double brackets
dollar sign


atomic vectors

recycling rule


lists
matrices and arrays
Data frames and tibbles


regular expression
Tidyverse library

installation
use/load
pipe operator
dplyr
tidyr

gather, pivot_longer()
spread, pivot_wider()
separate
unite
extract


readr

Import

RStudio
code


Export


Descriptive statistics

table()
min, median, mean, max, quantile

summary()


cor()


Statistical model

linear regression


Plot

base R vs ggplot
basic ggplot

save plot to variable and then transforming it
customization
zoom, coord_cartesian
fill areas under plot

base R
ggplot


Categorical univariable analysis

frequency bar chart

base R
ggplot


Cleveland dot plot

base R
ggplot


pie chart

base R
ggplot


Quantitative univariable analysis

histogram

base R
ggplot


density plot

base R
ggplot


Categorical bivariable analysis

percent, grouped and stacked frequency bar chart

base R
ggplot


Categ. & quant. bivariable analysis

box plot

base R
ggplot


Quantitative bivariable analysis

scatter plot

base R
ggplot


3 variables (by size, color, shape; facet)

ggplot


Ubuntu Install


sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
sudo apt update
sudo apt install r-base

General observations


get help by append ? to a function name, data set, symbol...
implicit printing, therefore, usually you don't need to write print()
counting stats at 1; so to get the first item in a vector x[1];
in function calls, you can specify the arguments by name or order, e.g.

plot(iris$Species, iris$Petal.Width), plot(x = iris$Species, y = iris$Petal.Width)


vectorized language, therefore there's much less need to iterate over objects, e.g.

to get double of each element of this vector x <- c(3, 5, 8), you only need x * 2


AND, OR comparisions can be made with single ou double signs (&, &&, |, ||), they behave in different ways

Basics

help


? append a question mark to a function, data set, library name and get info about it

Comments


# = single-line comments
R doesn’t support multi-lined comments

clear console


cat("\014") is the code to send "CTRL + L" to the console

list available data sets


data() = lists all available data sets, including from libraries (if they're loaded)

view data set as table


View(data set)

Declare variables and functions

NOTE: it's also permissible to declare variable and functions with equal sign

variables: x <- 20
functions: myF <- function() {...}
vectors: x <- c(5, 8, 12)

function

functions with ellipsis

add <- function(...) {
  args <- list(...)
  sum <- 0
  
  for (n in args) {
    sum <- sum + n
  }
  
  return(sum)
}

add(2, 3, 5, 4)
variable info


class() or typeof() = the only difference is that class calls double 'numeric' and typeof 'double'
str() = short for structure, displays the internal structure of the given object

Operators


%%, %/% = remainder and quotient
: = creates the series of numbers in sequence for a vector
%in% = if element belongs to a vector
& |, && || =

single operators examine the vector element by element and return a vector filled with logical values (TRUE or FALSE)
double operators examine only the first element of each vector and return a logical value (TRUE or FALSE)


x <- c( TRUE, FALSE, TRUE );
y <- c( TRUE, FALSE, FALSE);

print(x & y) # TRUE FALSE FALSE
print(x && y) # TRUE
if

ifelse() function


ifelse() is a vector equivalent form of the if...else statement

x <- c(3, 5, 8, 12)

# ifelse(test, yes, no)
# returns a value with the same shape as test, usually a vector
# filled with elements selected from either yes or no
# depending on whether the element of test is TRUE or FALSE
ifelse(x %% 2 == 0, "even", "odd")
for loop

# loop 1 through 10 (inclusive)
for (n in 0:10) {
  print(n)
}

# loop vector elements
x <- c(5, 8, 12, 15)
for (n in x) {
  print(n)
}
while

x <- 1

while (x <= 10) {
  print(x)
  x = x + 1
}
switch

color <- "b"
switch(color, "r" = "red", "b" = "blue", "unknown")
get user input


x <- scan()

filter data


iris$Petal.Width [iris$Species == "setosa"]
plot(iris$Petal.Width [iris$Species == "setosa"|  iris$Species == "virginica" ])

apply

double <- function(x) {
  return(x * 2)
}
x <- matrix(c(3, 5, 8, 12), nrow = 2)

# apply(X, MARGIN, FUN, …)
# x = matrix; MARGIN = 1 for rows, 2 for cols; FUN = function to apply 
apply(x, 2, double)
Data Types


In R, everything is a object

vectors

vector of single value

R doesn't have primitive data types in the way that other languages do. In R even the simplest numeric value is an example of a vector.

used often:

logical = TRUE, FALSE
numeric/double = can be a integer or contain a decimal value
character = enclosed with quotes (single or double)


not used often:

integer = declare explicitly with x <- 10L
complex = numbers, e.g. 3 + 2i
raw = created with charToRaw()


vector of multiple values


must contain only one data type
created with c()
e.g. numeric vectors x <- c(5, 8, 12)
index starts at 1

list


can contain many different types of elements
like vectors, functions, lists...
l <- list(c(3, 5, 8), "my string...", TRUE, list("a", "b"), myFunction)
x <- list(a = "aaa", b = "bbb") = can have named elements

matrix


matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
like vectors, matrices store information of the same data type
two-dimensional
e.g. m <- matrix(c(3, 5, 8, 12, 15, 18), 2, 3)
m[1,2 = access specific element
m[1,] = access all of the row 1, m[,2] = access all of the column 2

array


while matrices are two-dimensional, arrays can be any number of dimensions
store only one data type
array(c('green','yellow'), dim = c(2,3,2)) = creates 2 matrices with 2 rows and 3 columns each

factorm


created using a vector (of categoricals values), it stores the vector along with the its distincts values

v <- c("pinapple", "banana", "banana", "apple", "pinapple", "banana")
f <- factor(v)

print(f) # print vector and levels (each distinct vector value)
print(nlevels(f)) # print how many distinct vector value
print(levels(f)) # print each distinct vector value
dataFrame


is a form of matrix, which is tabular and can contain different data types
columns are variables and rows are observations

df <- data.frame(
  Name = c("John", "Matt"),
  Age = c(25, 27),
  City = c("Boston", "NY")
)

print(nrow(df))
print(ncol(df))
print(dim(df)) # get both nrow and ncol
Subsetting


Subsetting in R is a useful indexing feature for accessing object elements,
it can be used to select and filter variables and observations.

subsetting symbols

single brackets


[] = get a subset of length 1 or more

usually, object and its subset are of the same type; therefore, subset of vector will be a vector, subset of a data frame will be a data frame...

however, there's one inconsistency - if the subset contains only one value, R will reduce the result to the lowest dimension and then subset and container may have different type


both names and indices can be used
negative integers indicate exclusion
variables are interpolated


double brackets


[[]] = extract only one element (not necessarily just one value); i.e. vectors yield single value, data frames yield column vector

names or indices can be used
variables are interpolated
usually, not the same type as the object container
dimension of returned value isn't necessarily 1


dollar sign


$ = special case of [[ in which you access a single item by a name

therefore, iris$Species and iris[["Species"]] are equivalent
cannot use integer indices
if name contain special characters, name must be enclose in backticks


atomic vectors

a <- c(3, 5, 8,12)

# accessing with numbers
a[1]
a[c(1, 3)] # positive get multiple specified elements
a[-c(2, 4)] # negative exclude elements

# accessing with logical values
a[c(TRUE,FALSE,TRUE,FALSE)] # select elements where the value is TRUE
a[a > 5] # therefore, this is possible
recycling rule


if two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector
if longer object length is not a multiple of shorter object length, the program will throw a warning but it'll still return a result

a <- c(2, 3, 5, 8)
b <- c(1, 2)
a * b # result: 2, 6, 5, 16
lists

x <- list("a", "b", "c")

# single bracket returns a object of class 'list'
class(x[2]) # list
# double brackets returns a single element (not of class 'list')
class(x[[2]]) # character

# named lists
y <- list(f = 1:3, s = "a", t = 4:6)
y$f
y[["f"]]
matrices and arrays

m <- matrix(c(3, 5, 8, 12, 15, 18, 21, 25, 30), nrow = 3, byrow = TRUE)

m[1,] # entire first row
m[, 1] # blank subsetting selects all rows/column; here entire first column
m[2, 1] # element at second row, first column
m[1:2, 2:3] # get rows 1 from 2, their columns 2 from 3
m[c(1, 3), c(1, 3)] # get rows 1 and 3, their columns 1 and 3

# using a 2 column matrix to subset a matrix
# each row of the matrix will specify a row and a column
select <- matrix(c(1, 1, 1, 3, 3, 1, 3, 3), ncol = 2, byrow = TRUE)
m[select] # result: 3 8 21 30
Data frames and tibbles

mtcars[3] # single index will return specified column(s)
mtcars[3, 1] # two indices will behave like matrices, first is row and second is column

`hp$Name` or `hp[["Name"]]` # access by name
mtcars[3, "mpg"] # access by both, index and name; third row, column named "mpg"
mtcars$mpg[3] # access by both, name and index

# filtering by column
# column (second argument) is left blank, to return all columns
iris[iris$Species == "setosa", ]
iris[iris$Petal.Width > 0.5 & iris$Species == "setosa", ] # multiple filters
regular expression

# grepl returns a vector of logical values
g <- grepl("Toyota", rownames(mtcars))
mtcars[g, ]

# grep returns a vector with the indices that contain a match
g <- grep("Toyota", rownames(mtcars))
mtcars[g, ]

# using grepl together with dplyr
library(tidyverse)
iris %>%
  filter(grepl("setosa", Species))
Tidyverse library


tidyverse is a set of packages that make easier to perform everyday data analyses and work in harmony (packages share common API)

installation


sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev = ubuntu packages needed

or install.packages("tidyverse") = to install from the r script


library(tidyverse) = to load a library

use/load


library(tidyverse)

from now on, any tidyverse function (like dplyr::filter) can be called without dplyr::
you only need to append dplyr:: if there're name collisions and you need to call the function that was overwritten


pipe operator


%>% = simplify chaining, that is, passsing a single data to several functions

library(tidyverse)

# without pipe ('.data')
f <- filter(.data = mpg, model == "a4")
s <- select(.data = f, manufacturer, model, year)
s

# using pipe
mpg %>%
  filter(model == "a4") %>%
  select(manufacturer, model, year)
dplyr


manipulate data sets

library(tidyverse)

mtcars %>%
  filter(
    mpg > 20,
    cyl == 4,
    wt < 2.5,
    grepl("Toyota", rownames(mtcars))
  ) %>%
  arrange(mpg) %>%
  select (mpg, cyl, wt)
tidyr


helps create tidy data, that is:

every column is a variable
every row is a observation
every cell is a single value


gather, pivot_longer()


lengthens data, increasing the number of rows and decreasing the number of columns
gather() is retired, recommendation is to use instead pivot_longer()

library(tidyverse)

df <- data.frame(
  name = c("John", "Mary", "Jake"),
  a = c(7, 9, 18),
  b = c(18, 5, 3),
  c = c(32, 17, 35)
)

# 'key' and 'value' will be the names of the new cols 
# 'key' will be a categorical variable holding the 'multiple columns' names
# and 'value' will hold the 'multiple columns' values

df %>%
  # gather(key, value, ...multiple columns)
  gather("drug", "volume", a, b, c)

df %>%
  # pivot_longer(columns vector, names_to, values_to)
  pivot_longer(
    cols = c(a, b, c),
    names_to = "drug",
    values_to = "volume")
spread, pivot_wider()


widens data, increasing the number of columns and decreasing the number of rows
spread() is retired, recommendation is to use instead pivot_wider()

library(tidyverse)

df <- data.frame(
  name = c("John", "John", "Mary", "Mary"),
  drug = c("a", "b", "a", "b"),
  volume = c(7, 18, 9, 5)
)

# spread
df %>%
  # each individual value in key
  # will be converted to a column
  spread(key = "drug", value = "volume")

# pivot_wider()
df %>%
  # each individual value in names_from
  # will be converted to a column
  pivot_wider(names_from = "drug", values_from = "volume")
separate


splits a single column into multiple columns

library(tidyverse)

df <- data.frame(
  Name = c("John", "Mary"),
  Job = c("Teacher, Designer", "Manager, Developer")
)

df %>%
  separate(
    col = "Job",
    into = c("Job 1", "Job 2"), # names of the new columns to be created
    sep = ", "
  )
unite


combines multiple columns into on

library(tidyverse)

df <- data.frame(
  Name = c("John", "Mary"),
  Job1 = c("Teacher", "Manager"),
  Job2 = c("Designer", "Developer")
)

df %>%
  # unite(col = name of new column, ...columns to unite, sep = separator)
  unite(
    col = "Jobs",
    "Job1",
    "Job2",
    sep = ", "
  )
extract


given a regular expression with capturing groups, extract() turns each group into a new column

library(tidyverse)

df <- data.frame(
  Name = c(
    "John Edwards Smith",
    "Mary Kate Miller Brown",
    "Matt Richards"
  )
)

df %>%
  extract(
    col = "Name",
    into = c("First name", "Last name"),
    regex = "([A-z]*).*\\s([A-z]*)"
  )
readr

Import

RStudio


in the bottom-right panel, click in the file name and select 'Import', 'From text(readr)'
as you configure you data, the corresponding code line is shown

code

# read_csv is from readr (included in tidyverse)
library(tidyverse)

setwd("~/Dev/r/")
hp <- read_csv("hp.csv")
hp
Export

# write_csv is from readr (included in tidyverse)
library(tidyverse)

setwd("~/Dev/r/")
write_csv(iris, "iris.csv")
Descriptive statistics


summarize and describe a given data set

table()


create a frequency table from a categorical variable (column)
table(iris$Species)

min, median, mean, max, quantile

min(mtcars$cyl)
median(mtcars$cyl)
mean(mtcars$cyl)
max(mtcars$cyl)
quantile(mtcars$cyl)

# get all at once
summary(mtcars$cyl)
summary()


summary(iris), summary(iris$Petals.width) = details about an object

if variable is categorical, result is a frequency table
if variable is quantitative, result is a table containing measures of center (mean, median) and measures of spread (min, 1st qu., 3rd qu., max)


cor()


correlation

# correlation between weight and miles per gallon
cor(mtcars$wt, mtcars$mpg) # result: -0.86
Statistical model


is a set of mathematical equations based on probabilities and used to describe the relationship between two or more variables
purpose: description, inference (estimates the parameters of a larger population), comparison (compare if two sets of data are different in a statistically significant way) and prediction (about new, unknown observations)

linear regression


describes the relationship between two variables, how changes in one variable affects the other variable
is linear model because assume a straight line
both variables must be a continuous numeric value
the variable in the x axis is called 'explanatory variable', and the one in the y axis is called 'outcome variable'
linear predictor function - y = m * x + b

m is the slope of the line (for each unit increase in x, how much does y increase)
b is the y intercept (the y value when x is equal to 0)


plot(iris$Petal.Length, iris$Petal.Width)

# lm is the R function to create linear models
model <- lm(
  formula = Petal.Width ~ Petal.Length,
  data = iris
)

# draw straight line on top of the plot
lines(
  x = iris$Petal.Length,
  y = model$fitted,
  col = "red",
  lwd = 3
)

# predict new values from model
predict(
  object = model,
  newdata = data.frame(
    Petal.Length = c(2, 5, 7) # arbitrary values
  )
)
Plot


plot is a graphical technique for representing a data set
usually a graph showing the relationship between one or more variables
in R, plot is usually done

with base R, that is, without any third-party library
with a library called ggplot2 (included with tidyverse)


base R vs ggplot


base R mostly use the plot(x, y) function

but there're also the barplot(), hist() functions


ggplot always use the

ggplot(data = data, mapping = aes()) function,
appended by pipe +
and then layers, scales, facets and/or coordinates


basic ggplot

save plot to variable and then transforming it

library(tidyverse)

# save plot to variable
# only save it, don't display it
p <- ggplot(mtcars, aes(x = cyl)) +
  geom_bar()

# wont save flipped plot into the variable
# only displays it
p + coord_flip()
customization

library(tidyverse)

# needed for third variable in aes()
f <- factor(mtcars$am)
levels(f) <- c("Automatic", "Manual")

ggplot(mtcars, aes( x = wt, y = mpg, shape = f, color = f )) +
  geom_point() +
  labs(
    title = "WT VS MPG",
    x = "weight",
    y = "miles per gallon",
    # change legend title with the aes names
    shape = "Transmission",
    color = "Transmission"
  ) +
  theme( # theme() customize non-data components
    plot.title =
      element_text( face = "bold",
                    hjust = 0.5,
                    margin = margin(8, 0, 16, 0)),
    axis.title =
      element_text( face = "italic"),
    axis.title.x = 
      element_text( margin = margin(8, 0, 4, 0) ),
    axis.title.y = 
      element_text( margin = margin(0, 8, 0, 4) ),
    axis.ticks = element_blank() # remove ticks
  )
zoom, coord_cartesian

library(tidyverse)

ggplot(ChickWeight, aes(x = weight)) +
  geom_histogram() +
  coord_cartesian(xlim = c(200, 300)) # zoom
fill areas under plot

base R

df <- data.frame(
  Month = 1:12,
  Num = as.vector(AirPassengers)[1:12]
)

plot(df$Num, type = "l")

polygon(c(min(df$Month), df$Month, max(df$Month)), c(0, df$Num, 0), col = "steelblue")
ggplot

df <- data.frame(
  Month = 1:12,
  Num = as.vector(AirPassengers)[1:12]
)

ggplot(df, aes(x = Month, y = Num)) +
  # geom_area() + # ymin fixed to 0, which would make plot very high
  geom_ribbon(aes(ymin = 100, ymax = Num)) +
  geom_line()
Categorical univariable analysis

frequency bar chart


x axis: categorical variable
y axis: frequency/count

base R

# plot()
plot(iris$Species)

# barplot()
t <- table(iris$Species) # creates frequency table
barplot(t)
ggplot

ggplot(iris, aes(x = Species)) +
  geom_bar()
Cleveland dot plot

base R

dotchart(table(mtcars$cyl))
ggplot

ggplot(mtcars, aes(x = cyl)) +
  # stat = the statistical transformation to use on the data for this layer
  geom_point(stat = "count") +
  coord_flip()
pie chart

base R

pie(table(mtcars$cyl))
ggplot

ggplot(
  mtcars, aes(x = "", fill = as.factor(cyl))) +
  geom_bar() +
  coord_polar(theta = "y")
Quantitative univariable analysis

histogram

base R

hist(mtcars$mpg)
ggplot

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 5) # binwidth = bar widths
density plot

base R

plot(density(mtcars$mpg))
ggplot

ggplot(mtcars, aes(x = mpg)) +
  geom_density()
Categorical bivariable analysis

percent, grouped and stacked frequency bar chart

base R

t <- table(mtcars$cyl, mtcars$am)

barplot(t, beside = TRUE) # grouped

barplot(t,) # stacked

# percent
percentage <- apply(t, 2, function(x){x*100/sum(x,na.rm=T)})
barplot(percentage)
ggplot

# grouped
ggplot(
  data = mtcars,
  aes(x = factor(am), fill = factor(cyl))) +
  geom_bar(position = "dodge")

# stacked
ggplot(
  data = mtcars,
  aes(x = factor(am), fill = factor(cyl))) +
  geom_bar()

# percent
ggplot(
  data = mtcars,
  aes(x = factor(am), fill = factor(cyl))) +
  geom_bar(position = "fill")
Categ. & quant. bivariable analysis

box plot

base R

plot(ChickWeight$Diet, ChickWeight$weight)
ggplot

ggplot(ChickWeight, aes( x = Diet, y = weight)) +
  geom_boxplot()
Quantitative bivariable analysis

scatter plot

base R

plot(mtcars$wt, mtcars$mpg)
ggplot

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()
3 variables (by size, color, shape; facet)

ggplot

library(tidyverse)

ggplot(mtcars, aes(x = wt, y = mpg, size = hp)) +
  geom_point()

# both col and shape will need a categorical variable
f <- as.factor(mtcars$am)
levels(f) <- c("Automatic", "Manual") # rename levels

ggplot(mtcars, aes(x = wt, y = mpg, col = f)) +
  geom_point()

ggplot(mtcars, aes(x = wt, y = mpg, shape = f)) +
  geom_point()

# both col and shape
ggplot(mtcars, aes(x = wt, y = mpg, col = f, shape = f)) +
  geom_point()

# multi-panel
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_grid(. ~ cyl)