Skip to content

Instantly share code, notes, and snippets.

@hanbzu
Last active December 24, 2015 21:19
Show Gist options
  • Save hanbzu/6864862 to your computer and use it in GitHub Desktop.
Save hanbzu/6864862 to your computer and use it in GitHub Desktop.
A reference card of useful R and data analysis (with R) snippets. I used the Rmd extension, which R studio compiles into a didactic visual sheet with the outcomes of each command.
Data analysis with R -- It's better and better now
==================================================
A nice R resource, the “cookbook for R”: http://wiki.stdout.org/rcookbook/
Installing
----------
```{r installing}
# For Linux: Install package directly from CRAN accessible for all users
sudo R
install.packages("mypkg", lib="/my/own/R-packages/")
# For a single user
R # Or into R Studio R console
install.packages("mypkg", lib="/my/own/R-packages/")
# (the second argument is optional)
```
Getting help
------------
First of all:
```{r help}
help.search(“rnorm”) # Search the help for “rnorm” occurences
args(“rnorm”) # Get the arguments of “rnorm”
```
Then google, and last, post a useful question on forums. If it's about an R command:
1. What steps will reproduce the problem?
2. What is the expected output?
3. What do you see instead?
4. What version of the product (e.g. R, packages) are you using? What O.S.?
If it's about data analysis (more specialised places than stackoverflow):
1. What is the cuestion you are trying to answer?
2. What steps/tools did you use to answer it? (even code)
3. What did you expect to see? What do you see instead?
4. What other solutions have you thought about?
For reproducibility, which is not only important for getting help:
```{r reproducibility}
set.seed(1234)
```
When you need to debug:
```{r debugging}
traceback() # Stacktrace after an error.
stop(“Why did I stop?”) # Assertions, reporting fatal errors
```
More on that on http://stackoverflow.com/questions/4442518/general-suggestions-for-debugging-r
Useful R stuff
--------------
```{r useful}
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 11 18
[4,] 4 9 11 19
[5,] 5 10 15 20
# Coerce to numeric a char vector
as.numeric(c(“1”, ”7”, ”9”))
# Get unique values (single appearing values, distinct values)
x = c(1,1,2,3,4,4,4)
unique(x)
[1] 1 2 3 4
http://stackoverflow.com/questions/7755240/list-distinct-values-in-a-vector-in-r
# Another way:
tab <- table(x)
xu <- as.numeric(names(tab))
xn <- as.vector(tab)
xu; xn
# Check if a vector contains an element
http://stackoverflow.com/questions/1169248/r-function-for-testing-if-a-vector-contains-a-given-element
v <- c('a','b','c','e')
'b' %in% v # returns TRUE
match('b',v) # returns the first location of 'b', in this case: 2
v <- c('a','b','c')
any(v=='b')
# [1] TRUE
# Removing missing values in vector:
x <- c(1, 2, NA, 3, 4)
bad <- is.na(x)
x[!bad]
# Selecting only complete values in different vectors
# (or different columns in a data frame / matrix)
x <- c(1, 2, NA, 3, 4)
y <- c(“a”, NA, “b”, “c”, “d”)
good <- complete.cases(x, y)
x[good]
y[good]
# Combine two vectors into a data frame:
data.frame(starters = starters, enders = enders)
# Count number of TRUE values in TRUE, FALSE, NA array:
length(which(z))
table(z)["TRUE"]
# Keep column names when adding data through rbind
# Don’t forget that there are more efficient ways than using rbind!
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
# !!!! the stringsAsFactors part is VERY important
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
# Concatenate two vectors
c(vector1, vector2)
```
Summarising data
----------------
Before using it, data should be examined and cleaned (values that are not available, incorrect or non consistent values, columns not correctly parsed, etc). That is to say, data needs to be sumarised. *Look at data!*
```{r summarising}
dim(data) # First look at the dimensions. Shows: [1] NB_ROWS NB_COLUMS
nrow(data); ncol(data) # Another way to see row and column numbers
names(data) # Find out the variable names (column names)
quantile(data$col1) # See the ranges (identify if any value is outside of the normal range)
summary(data) # Will tell for each variable the number of occurrences and other info
sapply(data[1,], class) # Applies the class() to each element in the first row (find out which class is each column)
unique(data$col2) # Are the values shown the ones expected for this variable?
length(unique(data$col2) # Is the count of unique values something we expect?
table(data$col2) # We see how many occurrences have each of the unique values
table(data$col1, data$col2) # 2 dimensional table for each of the pair of values
table(c(0,1,2,NA,3,3), useNA = “ifany”) # Show missing values (hidden by default)
any(data$col1[1:10] > 40) # Will tell me how many of the selected rows are greater than 40
all(data$col1[1:10] > 40) # Will tell me if ALL are greater than 40
sum(is.na(reviews$time_left[1:10])) # How many missing values?
# Looking at subsets
data[data$lat > 0 & data&lon > 0, c(“lat”, “lon”)] # AND: Make sure you use a single &
data[data$lat > 0 | data&lon > 0, c(“lat”, “lon”)] # OR: Make sure you use a single |
# Means and sums of rows/columns
rowSums(); rowMeans(); colSums(); colMeans()
colMeans(reviews, na.rm = TRUE) # The mean of ALL colums that removes any NAs
```
Data munging or wrangling
-------------------------
Data munging or data wrangling is loosely the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
Mung or munge is computer jargon for "to make repeated changes which individually may be reversible, yet which ultimately result in an unintentional, irreversible destruction of large portions of the original item.
```{r munging}
# Fixing character vectors
tolower(names(cameraData)) # Or toupper()
strsplit(names(cameraData), “\\.”) # Uses two backlashes because . is a special regex char
sapply(strsplit(names(cameraData), “\\.”), function(x) { x[1] }) # Split & take first
sub(“_”, “”, names(reviews), ) # Substitution. BUT: Only the first instance!
gsub(“_”, “”, names(reviews), ) # Substitution. Every instance.
# Quantitative variables in ranges
# It repaces each of the values with the range it fits in
timeRanges <- cut(reviews$time_left, seq(0, 3600, by = 600))
table(timeRanges, useNA = “ifany”)
# Another way
library(Hmisc)
timeRanges <- cut2(reviews$time_left, g = 6) # Six ranges are used
# Adding an extra variable
reviews$timeRanges <- cut2(reviews$time_left, g = 6)
# Merging data, when variables have common names. We must choose “merge _BY_”
merge(reviews, solutions, all = TRUE) # By default
merge(reviews, solutions, by.x = “solution_id”, by.y = “id”, all = TRUE) # This way it will try to merge solution ids with ids
# Sorting values
sort(data$reviewer_id) # A column or single vector
order(data$reviewer_id) # Returns a vector with the ordered positions (the indexes)
data[order(data$reviewer_id), ] # Use those to ordering indexes to order whole dataframe
data[order(data$reviewer_id, data$id), ] # Use two criteria for ordering
# Reshaping data
melt(misShaped, id.vars = “people”, variable.name = “treatment”, value.name = “value”)
```
Exploratory graphics
--------------------
```{r boxplots}
boxplot(pData$AGEP, col = "blue")
# Show me the distributuion of AGEP broken down by DDRS
# I will see as many boxes as DDRS has single values
# That is useful to see the distribution of one variable depending on
# the differente values another variable can take
boxplot(pData$AGEP ~ as.factor(pData$DDRS), col = "blue")
# I can also use, x axis names (groups), colors and variable widths.
# The width of the box is proportional to the number of observations we have in each group
boxplot(pData$AGEP ~ as.factor(pData$DDRS), col = c("blue", "orange"), names = c("yes", "no"), varwidth = TRUE)
```
```{r barplots}
# Number of observations of each single occurrent value
barplot(table(pData$CIT), col = "blue")
```
```{r histograms}
# Histograms are in the frequency scale
# Take all values, slice in a scale and count how many fall into each slice
hist(pData$AGEP, col = "blue")
# The number of breakpoints can be set
hist(pData$AGEP, col = "blue", breaks = 100, main = "Age")
```
```{r density}
# It looks like a smoothed histogram
# I has the same shape but some boundaries error appears in boundaries
dens <- density(pData$AGEP)
plot(dens, lwd = 3, col = "blue")
# The reason why we would use densities instead of histograms is to overlay several densities
dens <- density(pData$AGEP)
densMales <- density(pData$AGEP[which(pData$SEX==1)])
plot(dens, lwd = 3, col = "blue")
lines(densMales, lwd = 3, col = "orange") # Adds more lines to the plot
```
```{r scatterplots}
# pch is character type, cex is size of the points
# You can see particular problems in scatterplots
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = "blue")
# Use color to divide by sex (in this case pData$SEX is either 1 or 2)
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = pData$SEX)
# Using size
percentMaxAge <- pData$AGEP/max(pData$AGEP)
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = percentMaxAge * 0.5, col = "blue")
# Overlaying lines/points
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = "blue")
lines(rep(100, dim(pData)[1]), pData$WAGP, col = "grey", lwd = 5) # rep repeats a value
points(seq(0, 200, length = 100), seq(0, 20e5, length = 100), col = "red", pch = 19)
# Numeric variables as factors
library(Hmisc)
# cut2 takes a continnuous variable and breaks it up into 5 equal intervals
# and then assigns each point as a factor of one of those intervals
ageGroups <- cut2(pData$AGEP, g = 5)
# We color the plot according to the 5 different age groups created
plot(pData$JWMNP, pData$WAGP, pch = 19, col = ageGroups, cex = 0.5)
# If you have a lot of points..
x <- rnorm(1e5)
y <- rnorm(1e5)
plot(x, y, pch = 19)
# Sample the values!
sampledValues <- sample(1:1e5, size = 1000, replace = FALSE)
plot(x[sampledValues], y[sampledValues], pch = 19)
# Use a smoothScatter!
# (changes color according to density)
smoothScatter(x, y)
# Use the hexbin package!
hbo <- hexbin(x, y)
plot(hbo) # Color means how many points there
# You could also use transparency (not in Jeff Leek video)
```
```{r qqplots}
# Quantile-quantile plots
# Its is useful for comparing distributions of two variables
x <- rnorm(20); y <- rnorm(20)
qqplot(x, y)
# We can also plot a line
abline(c(0, 1))
# If you want to see if your data are normally distributed
# you can make a QQ-plot with normal data and your data
```
```{r matplot_and_spagetti}
X <- matrix(rnorm(20 * 5), nrow = 20)
# This takes each column of the matrix as one specific line
matplot(X, type = "b")
```
```{r heatmaps}
# A sort of 2D histogram
image(1:10, 161:236, as.matrix(pData[1:10, 161:236]))
# images transposes data, so we would need to transpose it previously
newMatrix <- as.matrix(pData[1:10, 161:236])
newMatrix <- t(newMatrix)[,nrow(newMatrix):1]
image(161:236, 1:10, newMatrix)
```
```{r maps}
library(maps)
map("world") # Get map of the world
lat <- runif(40, -180, 180); lon <- runif(40, -90, 90) # Random lat and ling values
pints(lat, lon, col = "plue", pch = 19)
```
```{r missing_values}
# R does not plot a point if any missing value
x <- c(NA, NA, NA, 4, 5, 6, 7, 8, 9, 10)
y <- 1:10
plot(x, y, pch = 19, xlim = c(0, 11), ylim = c(0,11)) # First three not shown
# See the distribution of NAs
x <- rnorm(100)
y <- rnorm(100)
y[x < 0] <- NA
boxplot(x ~ is.na(y)) # You can see where the NA values may be located
```
Expository graphics
-------------------
The idea is to choose a few among exploratory graphics, based on the information they convey. After that, titles, axes, labels and colors must be fine-tuned in order to archieve the desired efects. A legend is also important sometimes.
```{r titles_legends }
# Axes
plot(pData$JWMNP, pData$WAGP, pch = 19, col = "blue", cex = 0.5, xlab = "Travel time (min)", ylab = "Last 12 months wages (dollars)")
# To change axis axis or label size use cex.lab = 2 or cex.axis = 1.5
# Legends, single
legend(100, 20000, legend = "All surveyed", col = "blue", pch = 19, cex = 0.5)
# Legends, several variables
plot(pData$JWMNP, pData$WAGP, pch = (19, 19), col = pData$SEX, cex = c(0.5, 0.5), xlab = "TT (min)", ylab = "Wages (dollars)")
legend(100, 20000, legend = c("men", "women"), col = c("black", "red"), pch = (19, 19), cex = c(0.5, 0.5))
# For a title use, main = "title"
```
Using multiple panels:
```{r multiple_panels }
par(nfrow = c(1, 2)) # Orientation of graphs
# ... and then plot
hist()
plot()
# Add text with the mtext command
mtext(text = "(a)", side = 3, line = 1)
```
Adding a figure caption
-----------------------
*Figure 1. Bolded text wich describes the purpose of the whole plot* If the figure has different componnents, use (a) and (b) etc, to tell about each of the figures. These captions should be enough for people to undertand the graphs without reading the whole text.
Saving graphics in different formats
------------------------------------
PDF:
```{r pdf }
pdf(file = "twoPanel.pdf", height = 4, width = 8) # Height and width in inches
# Draw plot...
dev.off() # Close device (file is saved)
```
PNG
```{r png }
png(file = "twoPanel.png", height = 480, width = 600) # Height and width in pixels
# Draw plot...
dev.off() # Close device (file is saved)
```
Sometimes it's better to fine-tune the figure interctivelly and then save it as PDF. You can do that by plotting on the screen and then:
```{r copy2pdf }
dev.copy2pdf(file = "twoPanel.pdf")
```
Hierarchical clustering
-----------------------
Find the two observations that are closest together, and then merge them and do the same process again at a higher level and so on and so forth.
How do we define "close"? It is very important to define an appropiate distance measure.
continuous vars -> auclidean distance
binary bars -> Manhattan distance
```{r cluestering_example}
set.seed(1234); par(mar = c(0, 0, 0, 0))
x <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
y <- rnorm(12, mean = rep(c(1, 2, 1), each = 4), sd = 0.2)
plot(x, y, col = "blue", pch = 19, cex = 2)
text(x + 0.05, y + 0.05, labels = as.character(1:12))
# Calculate distance
dataFrame <- data.frame(x = x, y = y)
distxy <- dist(dataFrame) # Using default distance, euclidean but can be specified (i.e. Manhattan)
hClustering <- hclust(distxy)
plot(hClustering) # It generates a 'dendogram' or a tree
```
We can cut the tree at a certain height. Choosing where to do it determines which groups we get.
And heatmaps (+complete)
```{r heatmaps}
dataFrame <- data.frame(x = x, y = y)
set.seed(143)
dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
heatmap(dataMatrix)
```
K-Distance clustering
---------------------
A partitioning approach. Fix the number of clusters in advance an calculate centroids.
```{r kdistance}
dataFrame <- data.frame(x, y)
kmeansObj <- kmeans(dataFrame, centers = 3) # I can tell it how many iterations and the proposed 1st iteration centers
names(kmeansObj)
kmeansObj$cluster # Tells what cluster does each point belong
par(mar = rep(0.2, 4)
plot(x, y, col = kmeansObj$cluster, pch = 19, cex = 2)
points(kmeansObj$centers, col = 1:3, pch = 3, cex = 3, lwd = 3)
```
And heatmaps
```{r heatmaps2}
set.seed(143)
dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
kmeansObj2 <- kmeans(dataMatrix, centers = 3)
par(mfrow = c(1, 2), mar = rep(0.2, 4))
image(t(dataMatrix)[,nrow(dataMatrix):1], yaxt = "n" ) # Scrambled version
image(t(dataMatrix)[,order(kmeansObs$cluster)], yaxt = "n" ) # Ordered
```
Dimension reduction
-------------------
Matrix data
```{r matrix_data}
set.seed(12345); par(mar = rep(0.2, 4))
dataMatrix <- matrix(rnorm(400), nrow = 40)
image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1])
```
We can also cluster it
```{r matrix_clustered}
par(mar = rep(0.2, 4)
heatmap(dataMatrix)
```
If there was a pattern we would see it in the clustering
Uncompleted
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment