#R cheatsheet

created while studying for the Machine Learning course at FIB at Universitat Politechnica de Catalunya (BarcelonaTech), this cheatsheet contains the knowledge taught during the first half this course.

R language


for loops:

for(a in 1:10) {

function definitions:

myFunction <- function(a, b, c = 0) {
  tmp <- a + b;
  tmp - c; #the value of the last line is automatically returned

###array handling

A vector can be created using c. A sequence of integers can be created using a:b

test.vector <- c("a", "b", "c", "d");
1:10 #equivalent to c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

A element of this vector can be accesed using square brackets. Elements are counted starting from 1. Negative number can be used to exclude elements. Vectors can be used as indices.

test.vector[1] #returns "a"
test.vector[2] #returns "b"
test.vector[-2] #returns "a" "c" "d"
test.vector[c(1,3)] #returns "a" "c"
test.vector[-c(1,3)] #returns "b" "d"

An empty matrix can be created using matrix(data, nrow, ncol). Fields are accesed using square brackets or column names.

test.matrix <- matrix(NA, 5, 5);
colnames(test.matrix) <- c("col1", "col2", "col3", "result");
test.matrix[c(1,3), c(2,4)] #the submatrix composed of rows 1,3 and columns 2,4
test.matrix[,c("col1", "col2")] #the submatrix composed of all rows and columns 1 and 2
test.matrix$result #returns only the result column

##Basic function

###Data related functions

function description
a:b equivalent to seq(a, b)
seq(a,b,by) creates a sequence of numbers between a and b using step size by
rep(x, times) replicates x times times
matrix(data, nrow, ncol) creates a new matrix
data.frame(...) creates a data frame using all the parameters as columns
diag(matrix) gets/sets the diagonal of this matrix
diag(vector) returns a matrix with the given diagonal
diag(scalar) return an identity matrix with the given size
diag(scalar, nrow) returns a matrix with the given diagonal


function description
dim(x) returns the size in all dimensions
nrow(x) returns the number of rows
ncol(x) returns the number of columns
names(x) gets/sets the names of an object
colnames(x) gets/sets the names of an object
rownames(x) gets/sets the names of an object
summary(x) prints a summary

###Type handling

function description
typeof(x) gives informations about the type of a variable
as.integer(x) casts to integer
as.double(x) casts to double
as.factor(x) casts to factor
as.ordered(x) cast to ordered factor
as.character(x) casts to string casts to a data frame
is.*(x) tests if x is of the specified type
levels(factor) gets/sets the available levels of a factor
droplevels(factor) drops unused levels of a factor
cut(x, breaks, labels = NULL) creates a factor by cutting x into slices specified by breaks
unclass(x) casts a factorial into an integer; returns the levels

###File handling

function description
setwd(path) sets the working directory
getwd() gets the working directory
list.files(path = ".") lists contents of a directory
read.csv(filename, ...) reads a CSV file
save(data, file=...) writes data into file (binary)
load(file) loads a variable saved using save

###String operations

function description
paste concatenates strings
paste0 concatenates strings without separator


function description
apply(x, margin, fun) applies function fun to every column or row of x
sum(x) calculates the sum of a vector/matrix
mean(x) calculates the mean of a vector/matrix
median(x) calculates medians
quantile(x, probs = seq(0,1,0.25)) returns the quantiles of a variable
range(x) returns the range of a continous function
cor(x) calculate the correlation matrix within a data frame
choose(x) calculate binomial numbers
ginv(x) calculates the Moore-Penrose generalized inverse. library: MASS

###Other useful functions

function description
which(x) returns the indices of the TRUE values in a boolean vector
replicate(n, fun) executes fun n times and returns the return values composed as a vector
rm(x) deletes a variables
ls() enumerates all defined variables

Remove all variables: rm(list = ls())

##Handling missing data

function description
na.omit(x) removes rows with NAs from a data frame
addNA(x) for factorials: add NA as a new level


###Random Numbers

function description
set.seed(x) sets the random seed
runif(n, min = 0, max = 1) samples the uniform distribution
rbinom(n, size, prob) samples the binomial distribution
rnorm(N, mean, sd) samples the normal distribution
rmvnorm(N, mu, sigma) samples a multivariate normal distribution
rpois(n, lamba) sample the poisson distribution
sample(vector, N) draws N samples from vector


function description
chisq.test(x,y) performs a Chi-square test


function description
plot(x) plots x (in a hopefully useful way)
hist(x) prints a histogram
boxplot(x) creates a box plot
barplot(x) creates a bar plot
pie(x) creates a pie chart
pairs(x) draws pairwise scatter-plots of all variables
abline(h=...) creates a horizontal line
abline(v=..., lty="dashed") creates a dashed vertical line
curve(f) draws a function
title(new.title) sets the title of a plot
text(x, y=NULL, labels) adds text to a plot
legend adds a legend reset/close all graphic devices
par(...) sets graphical parameters

graphical parameters:

  • lty: line type
  • col: color blue, red, ..., #ffcc00 (hex. RGB)
  • bg: background color
  • cex: font size
  • axes.cex: font size on axes
  • xlog, ylog: use logarithmic axes
  • xlab, ylab: label of x-/y-axis: parallel(0), horizontal(1), perpendicular(2), vertical(3)
  • las: orientation of graphic labels ß
  • mfrow=c(x, y): subdivides area for multiple plots

Plot histogram together with normal estimation:

hist.with.normal <- function (x, xlabel=deparse(substitute(x)), ...)
  h <- hist(x,plot=F, ...)
  s <- sd(x)
  m <- mean(x)
  ylim <- range(0,h$density,dnorm(0,sd=s))
  hist(x,freq=F,ylim=ylim,xlab=xlabel, main="", ...)


library: cclust

function description
cclust(coordiates,K,iter.max=100,method="kmeans",dist="euclidean") performs a cclust clustering
clustIndex(clustering, coordinates, index="calinski") calculates Calinski index for a clustering obtained using cclust

##Model fitting

function description
table creates a cross table
prop.table creates a probability cross table


function description
lsfit(x,y) linear least squares fit y=a*x + b
lda(x, grouping, priors=... CV=...) performs LDA; x: input data, grouping: group assignments
CV: if true LOOCV is performed and returns the predictions instead of the model.
library: MASS
qda(x, grouping, priors=... CV=...) the same parameters as lda. library: MASS
partimat(x, grouping, method) applies LDA/QDA in pairs of dimensions. Plots the decision regions. library: klaR

draw the linear regression line:


LDA example:

lda.model <- lda (x=Crabs, grouping=Crabs.class)
lda.model #we can directly inspect the model
plot(lda.model) #we can plot it
#project the data into the new space
loadings <- as.matrix(Crabs) %*% as.matrix(lda.model$scaling)
ct <- table(Crabs.class, predict(lda.model, Crabs)$class)
sum(diag(prop.table(ct))) #total percent correct
prediction <- predict(lda.model, newdata=...)
prediction$class #the predicted classes
prediction$posterior #the posteriors
prediction$x #the dicriminants


function description
knn(inputs, classes, k=1) performs a k nearest neighboor imputation. read the note below! package: class (inputs, classes, k=1) performs a LOOCV for Knn

knn returns the imputed values as factors. Direct casting to numbers does not work correctly.
Instead, the results must be casted to strings first:

results <- as.double(as.character(knn(train, test)))

###Naive Bayes

model <- naiveBayes(Class ~ ., data=..., laplace=...)
predict(model, newdata)
predict(model, newdata, type = "raw") 

###(Generalized) Linear Models

function description
lm(formula) fits a linear model
lm.ridge(formula, lambda=...) fit a linear model using ridge regression. library: MASS
glm(formula, family=...) fits a generalized linear model
glm(formula, family=binomial(link=logit) performs a logistic regression
glm(formula, family=poisson(link=log) performs a logistic regression
glm(formula, family=gaussian performs a logistic regression

An additional data is necessary if the variables in the formula are not already defined.

ridge regression


GLM Example:

glm.res <- glm (y~x+y+z, family = binomial(link=logit))
summary(glm.res) #shows more informations than just "glm.res"
glm.res$coefficients #access the coefficients
exp(glm.res$coefficients["x"]) #how much do the odds change by x=x+1?
exp(Admis.logreg$coefficients) #same for all coefficients
ord <- predict(glm.res, data.frame(x=newx, y=newy, z=newz),type="response") #returns the logodds
step(glm.res) #tries to simplify  the modelby removing least important variable
