Last active
December 24, 2015 21:19
-
-
Save hanbzu/6864862 to your computer and use it in GitHub Desktop.
A reference card of useful R and data analysis (with R) snippets. I used the Rmd extension, which R studio compiles into a didactic visual sheet with the outcomes of each command.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Data analysis with R -- It's better and better now | |
================================================== | |
A nice R resource, the “cookbook for R”: http://wiki.stdout.org/rcookbook/ | |
Installing | |
---------- | |
```{r installing} | |
# For Linux: Install package directly from CRAN accessible for all users | |
sudo R | |
install.packages("mypkg", lib="/my/own/R-packages/") | |
# For a single user | |
R # Or into R Studio R console | |
install.packages("mypkg", lib="/my/own/R-packages/") | |
# (the second argument is optional) | |
``` | |
Getting help | |
------------ | |
First of all: | |
```{r help} | |
help.search(“rnorm”) # Search the help for “rnorm” occurences | |
args(“rnorm”) # Get the arguments of “rnorm” | |
``` | |
Then google, and last, post a useful question on forums. If it's about an R command: | |
1. What steps will reproduce the problem? | |
2. What is the expected output? | |
3. What do you see instead? | |
4. What version of the product (e.g. R, packages) are you using? What O.S.? | |
If it's about data analysis (more specialised places than stackoverflow): | |
1. What is the cuestion you are trying to answer? | |
2. What steps/tools did you use to answer it? (even code) | |
3. What did you expect to see? What do you see instead? | |
4. What other solutions have you thought about? | |
For reproducibility, which is not only important for getting help: | |
```{r reproducibility} | |
set.seed(1234) | |
``` | |
When you need to debug: | |
```{r debugging} | |
traceback() # Stacktrace after an error. | |
stop(“Why did I stop?”) # Assertions, reporting fatal errors | |
``` | |
More on that on http://stackoverflow.com/questions/4442518/general-suggestions-for-debugging-r | |
Useful R stuff | |
-------------- | |
```{r useful} | |
[,1] [,2] [,3] [,4] | |
[1,] 1 6 11 16 | |
[2,] 2 7 12 17 | |
[3,] 3 8 11 18 | |
[4,] 4 9 11 19 | |
[5,] 5 10 15 20 | |
# Coerce to numeric a char vector | |
as.numeric(c(“1”, ”7”, ”9”)) | |
# Get unique values (single appearing values, distinct values) | |
x = c(1,1,2,3,4,4,4) | |
unique(x) | |
[1] 1 2 3 4 | |
http://stackoverflow.com/questions/7755240/list-distinct-values-in-a-vector-in-r | |
# Another way: | |
tab <- table(x) | |
xu <- as.numeric(names(tab)) | |
xn <- as.vector(tab) | |
xu; xn | |
# Check if a vector contains an element | |
http://stackoverflow.com/questions/1169248/r-function-for-testing-if-a-vector-contains-a-given-element | |
v <- c('a','b','c','e') | |
'b' %in% v # returns TRUE | |
match('b',v) # returns the first location of 'b', in this case: 2 | |
v <- c('a','b','c') | |
any(v=='b') | |
# [1] TRUE | |
# Removing missing values in vector: | |
x <- c(1, 2, NA, 3, 4) | |
bad <- is.na(x) | |
x[!bad] | |
# Selecting only complete values in different vectors | |
# (or different columns in a data frame / matrix) | |
x <- c(1, 2, NA, 3, 4) | |
y <- c(“a”, NA, “b”, “c”, “d”) | |
good <- complete.cases(x, y) | |
x[good] | |
y[good] | |
# Combine two vectors into a data frame: | |
data.frame(starters = starters, enders = enders) | |
# Count number of TRUE values in TRUE, FALSE, NA array: | |
length(which(z)) | |
table(z)["TRUE"] | |
# Keep column names when adding data through rbind | |
# Don’t forget that there are more efficient ways than using rbind! | |
# This is "should work", but it doesn't: | |
# Create an empty data.frame with the correct names and types | |
# !!!! the stringsAsFactors part is VERY important | |
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE) | |
rbind(df, list(42, 'foo', 'bar')) # Messes up names! | |
rbind(df, list(A=42, B='foo', C='bar')) # OK... | |
# If you have at least one row, names are kept... | |
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE) | |
rbind(df, list(42, 'foo', 'bar')) # Names work now... | |
# Concatenate two vectors | |
c(vector1, vector2) | |
``` | |
Summarising data | |
---------------- | |
Before using it, data should be examined and cleaned (values that are not available, incorrect or non consistent values, columns not correctly parsed, etc). That is to say, data needs to be sumarised. *Look at data!* | |
```{r summarising} | |
dim(data) # First look at the dimensions. Shows: [1] NB_ROWS NB_COLUMS | |
nrow(data); ncol(data) # Another way to see row and column numbers | |
names(data) # Find out the variable names (column names) | |
quantile(data$col1) # See the ranges (identify if any value is outside of the normal range) | |
summary(data) # Will tell for each variable the number of occurrences and other info | |
sapply(data[1,], class) # Applies the class() to each element in the first row (find out which class is each column) | |
unique(data$col2) # Are the values shown the ones expected for this variable? | |
length(unique(data$col2) # Is the count of unique values something we expect? | |
table(data$col2) # We see how many occurrences have each of the unique values | |
table(data$col1, data$col2) # 2 dimensional table for each of the pair of values | |
table(c(0,1,2,NA,3,3), useNA = “ifany”) # Show missing values (hidden by default) | |
any(data$col1[1:10] > 40) # Will tell me how many of the selected rows are greater than 40 | |
all(data$col1[1:10] > 40) # Will tell me if ALL are greater than 40 | |
sum(is.na(reviews$time_left[1:10])) # How many missing values? | |
# Looking at subsets | |
data[data$lat > 0 & data&lon > 0, c(“lat”, “lon”)] # AND: Make sure you use a single & | |
data[data$lat > 0 | data&lon > 0, c(“lat”, “lon”)] # OR: Make sure you use a single | | |
# Means and sums of rows/columns | |
rowSums(); rowMeans(); colSums(); colMeans() | |
colMeans(reviews, na.rm = TRUE) # The mean of ALL colums that removes any NAs | |
``` | |
Data munging or wrangling | |
------------------------- | |
Data munging or data wrangling is loosely the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. | |
Mung or munge is computer jargon for "to make repeated changes which individually may be reversible, yet which ultimately result in an unintentional, irreversible destruction of large portions of the original item. | |
```{r munging} | |
# Fixing character vectors | |
tolower(names(cameraData)) # Or toupper() | |
strsplit(names(cameraData), “\\.”) # Uses two backlashes because . is a special regex char | |
sapply(strsplit(names(cameraData), “\\.”), function(x) { x[1] }) # Split & take first | |
sub(“_”, “”, names(reviews), ) # Substitution. BUT: Only the first instance! | |
gsub(“_”, “”, names(reviews), ) # Substitution. Every instance. | |
# Quantitative variables in ranges | |
# It repaces each of the values with the range it fits in | |
timeRanges <- cut(reviews$time_left, seq(0, 3600, by = 600)) | |
table(timeRanges, useNA = “ifany”) | |
# Another way | |
library(Hmisc) | |
timeRanges <- cut2(reviews$time_left, g = 6) # Six ranges are used | |
# Adding an extra variable | |
reviews$timeRanges <- cut2(reviews$time_left, g = 6) | |
# Merging data, when variables have common names. We must choose “merge _BY_” | |
merge(reviews, solutions, all = TRUE) # By default | |
merge(reviews, solutions, by.x = “solution_id”, by.y = “id”, all = TRUE) # This way it will try to merge solution ids with ids | |
# Sorting values | |
sort(data$reviewer_id) # A column or single vector | |
order(data$reviewer_id) # Returns a vector with the ordered positions (the indexes) | |
data[order(data$reviewer_id), ] # Use those to ordering indexes to order whole dataframe | |
data[order(data$reviewer_id, data$id), ] # Use two criteria for ordering | |
# Reshaping data | |
melt(misShaped, id.vars = “people”, variable.name = “treatment”, value.name = “value”) | |
``` | |
Exploratory graphics | |
-------------------- | |
```{r boxplots} | |
boxplot(pData$AGEP, col = "blue") | |
# Show me the distributuion of AGEP broken down by DDRS | |
# I will see as many boxes as DDRS has single values | |
# That is useful to see the distribution of one variable depending on | |
# the differente values another variable can take | |
boxplot(pData$AGEP ~ as.factor(pData$DDRS), col = "blue") | |
# I can also use, x axis names (groups), colors and variable widths. | |
# The width of the box is proportional to the number of observations we have in each group | |
boxplot(pData$AGEP ~ as.factor(pData$DDRS), col = c("blue", "orange"), names = c("yes", "no"), varwidth = TRUE) | |
``` | |
```{r barplots} | |
# Number of observations of each single occurrent value | |
barplot(table(pData$CIT), col = "blue") | |
``` | |
```{r histograms} | |
# Histograms are in the frequency scale | |
# Take all values, slice in a scale and count how many fall into each slice | |
hist(pData$AGEP, col = "blue") | |
# The number of breakpoints can be set | |
hist(pData$AGEP, col = "blue", breaks = 100, main = "Age") | |
``` | |
```{r density} | |
# It looks like a smoothed histogram | |
# I has the same shape but some boundaries error appears in boundaries | |
dens <- density(pData$AGEP) | |
plot(dens, lwd = 3, col = "blue") | |
# The reason why we would use densities instead of histograms is to overlay several densities | |
dens <- density(pData$AGEP) | |
densMales <- density(pData$AGEP[which(pData$SEX==1)]) | |
plot(dens, lwd = 3, col = "blue") | |
lines(densMales, lwd = 3, col = "orange") # Adds more lines to the plot | |
``` | |
```{r scatterplots} | |
# pch is character type, cex is size of the points | |
# You can see particular problems in scatterplots | |
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = "blue") | |
# Use color to divide by sex (in this case pData$SEX is either 1 or 2) | |
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = pData$SEX) | |
# Using size | |
percentMaxAge <- pData$AGEP/max(pData$AGEP) | |
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = percentMaxAge * 0.5, col = "blue") | |
# Overlaying lines/points | |
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = "blue") | |
lines(rep(100, dim(pData)[1]), pData$WAGP, col = "grey", lwd = 5) # rep repeats a value | |
points(seq(0, 200, length = 100), seq(0, 20e5, length = 100), col = "red", pch = 19) | |
# Numeric variables as factors | |
library(Hmisc) | |
# cut2 takes a continnuous variable and breaks it up into 5 equal intervals | |
# and then assigns each point as a factor of one of those intervals | |
ageGroups <- cut2(pData$AGEP, g = 5) | |
# We color the plot according to the 5 different age groups created | |
plot(pData$JWMNP, pData$WAGP, pch = 19, col = ageGroups, cex = 0.5) | |
# If you have a lot of points.. | |
x <- rnorm(1e5) | |
y <- rnorm(1e5) | |
plot(x, y, pch = 19) | |
# Sample the values! | |
sampledValues <- sample(1:1e5, size = 1000, replace = FALSE) | |
plot(x[sampledValues], y[sampledValues], pch = 19) | |
# Use a smoothScatter! | |
# (changes color according to density) | |
smoothScatter(x, y) | |
# Use the hexbin package! | |
hbo <- hexbin(x, y) | |
plot(hbo) # Color means how many points there | |
# You could also use transparency (not in Jeff Leek video) | |
``` | |
```{r qqplots} | |
# Quantile-quantile plots | |
# Its is useful for comparing distributions of two variables | |
x <- rnorm(20); y <- rnorm(20) | |
qqplot(x, y) | |
# We can also plot a line | |
abline(c(0, 1)) | |
# If you want to see if your data are normally distributed | |
# you can make a QQ-plot with normal data and your data | |
``` | |
```{r matplot_and_spagetti} | |
X <- matrix(rnorm(20 * 5), nrow = 20) | |
# This takes each column of the matrix as one specific line | |
matplot(X, type = "b") | |
``` | |
```{r heatmaps} | |
# A sort of 2D histogram | |
image(1:10, 161:236, as.matrix(pData[1:10, 161:236])) | |
# images transposes data, so we would need to transpose it previously | |
newMatrix <- as.matrix(pData[1:10, 161:236]) | |
newMatrix <- t(newMatrix)[,nrow(newMatrix):1] | |
image(161:236, 1:10, newMatrix) | |
``` | |
```{r maps} | |
library(maps) | |
map("world") # Get map of the world | |
lat <- runif(40, -180, 180); lon <- runif(40, -90, 90) # Random lat and ling values | |
pints(lat, lon, col = "plue", pch = 19) | |
``` | |
```{r missing_values} | |
# R does not plot a point if any missing value | |
x <- c(NA, NA, NA, 4, 5, 6, 7, 8, 9, 10) | |
y <- 1:10 | |
plot(x, y, pch = 19, xlim = c(0, 11), ylim = c(0,11)) # First three not shown | |
# See the distribution of NAs | |
x <- rnorm(100) | |
y <- rnorm(100) | |
y[x < 0] <- NA | |
boxplot(x ~ is.na(y)) # You can see where the NA values may be located | |
``` | |
Expository graphics | |
------------------- | |
The idea is to choose a few among exploratory graphics, based on the information they convey. After that, titles, axes, labels and colors must be fine-tuned in order to archieve the desired efects. A legend is also important sometimes. | |
```{r titles_legends } | |
# Axes | |
plot(pData$JWMNP, pData$WAGP, pch = 19, col = "blue", cex = 0.5, xlab = "Travel time (min)", ylab = "Last 12 months wages (dollars)") | |
# To change axis axis or label size use cex.lab = 2 or cex.axis = 1.5 | |
# Legends, single | |
legend(100, 20000, legend = "All surveyed", col = "blue", pch = 19, cex = 0.5) | |
# Legends, several variables | |
plot(pData$JWMNP, pData$WAGP, pch = (19, 19), col = pData$SEX, cex = c(0.5, 0.5), xlab = "TT (min)", ylab = "Wages (dollars)") | |
legend(100, 20000, legend = c("men", "women"), col = c("black", "red"), pch = (19, 19), cex = c(0.5, 0.5)) | |
# For a title use, main = "title" | |
``` | |
Using multiple panels: | |
```{r multiple_panels } | |
par(nfrow = c(1, 2)) # Orientation of graphs | |
# ... and then plot | |
hist() | |
plot() | |
# Add text with the mtext command | |
mtext(text = "(a)", side = 3, line = 1) | |
``` | |
Adding a figure caption | |
----------------------- | |
*Figure 1. Bolded text wich describes the purpose of the whole plot* If the figure has different componnents, use (a) and (b) etc, to tell about each of the figures. These captions should be enough for people to undertand the graphs without reading the whole text. | |
Saving graphics in different formats | |
------------------------------------ | |
PDF: | |
```{r pdf } | |
pdf(file = "twoPanel.pdf", height = 4, width = 8) # Height and width in inches | |
# Draw plot... | |
dev.off() # Close device (file is saved) | |
``` | |
PNG | |
```{r png } | |
png(file = "twoPanel.png", height = 480, width = 600) # Height and width in pixels | |
# Draw plot... | |
dev.off() # Close device (file is saved) | |
``` | |
Sometimes it's better to fine-tune the figure interctivelly and then save it as PDF. You can do that by plotting on the screen and then: | |
```{r copy2pdf } | |
dev.copy2pdf(file = "twoPanel.pdf") | |
``` | |
Hierarchical clustering | |
----------------------- | |
Find the two observations that are closest together, and then merge them and do the same process again at a higher level and so on and so forth. | |
How do we define "close"? It is very important to define an appropiate distance measure. | |
continuous vars -> auclidean distance | |
binary bars -> Manhattan distance | |
```{r cluestering_example} | |
set.seed(1234); par(mar = c(0, 0, 0, 0)) | |
x <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2) | |
y <- rnorm(12, mean = rep(c(1, 2, 1), each = 4), sd = 0.2) | |
plot(x, y, col = "blue", pch = 19, cex = 2) | |
text(x + 0.05, y + 0.05, labels = as.character(1:12)) | |
# Calculate distance | |
dataFrame <- data.frame(x = x, y = y) | |
distxy <- dist(dataFrame) # Using default distance, euclidean but can be specified (i.e. Manhattan) | |
hClustering <- hclust(distxy) | |
plot(hClustering) # It generates a 'dendogram' or a tree | |
``` | |
We can cut the tree at a certain height. Choosing where to do it determines which groups we get. | |
And heatmaps (+complete) | |
```{r heatmaps} | |
dataFrame <- data.frame(x = x, y = y) | |
set.seed(143) | |
dataMatrix <- as.matrix(dataFrame)[sample(1:12),] | |
heatmap(dataMatrix) | |
``` | |
K-Distance clustering | |
--------------------- | |
A partitioning approach. Fix the number of clusters in advance an calculate centroids. | |
```{r kdistance} | |
dataFrame <- data.frame(x, y) | |
kmeansObj <- kmeans(dataFrame, centers = 3) # I can tell it how many iterations and the proposed 1st iteration centers | |
names(kmeansObj) | |
kmeansObj$cluster # Tells what cluster does each point belong | |
par(mar = rep(0.2, 4) | |
plot(x, y, col = kmeansObj$cluster, pch = 19, cex = 2) | |
points(kmeansObj$centers, col = 1:3, pch = 3, cex = 3, lwd = 3) | |
``` | |
And heatmaps | |
```{r heatmaps2} | |
set.seed(143) | |
dataMatrix <- as.matrix(dataFrame)[sample(1:12),] | |
kmeansObj2 <- kmeans(dataMatrix, centers = 3) | |
par(mfrow = c(1, 2), mar = rep(0.2, 4)) | |
image(t(dataMatrix)[,nrow(dataMatrix):1], yaxt = "n" ) # Scrambled version | |
image(t(dataMatrix)[,order(kmeansObs$cluster)], yaxt = "n" ) # Ordered | |
``` | |
Dimension reduction | |
------------------- | |
Matrix data | |
```{r matrix_data} | |
set.seed(12345); par(mar = rep(0.2, 4)) | |
dataMatrix <- matrix(rnorm(400), nrow = 40) | |
image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1]) | |
``` | |
We can also cluster it | |
```{r matrix_clustered} | |
par(mar = rep(0.2, 4) | |
heatmap(dataMatrix) | |
``` | |
If there was a pattern we would see it in the clustering | |
Uncompleted |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment