hanbzu/ranalysis.Rmd

## ranalysis.Rmd
Data analysis with R -- It's better and better now
==================================================

A nice R resource, the “cookbook for R”: http://wiki.stdout.org/rcookbook/

Installing
----------

```{r installing}
# For Linux: Install package directly from CRAN accessible for all users
sudo R
install.packages("mypkg", lib="/my/own/R-packages/")
# For a single user
R # Or into R Studio R console
install.packages("mypkg", lib="/my/own/R-packages/")
# (the second argument is optional)
```

Getting help
------------

First of all:

```{r help}
help.search(“rnorm”) # Search the help for “rnorm” occurences
args(“rnorm”) # Get the arguments of “rnorm”
```

Then google, and last, post a useful question on forums. If it's about an R command:

1. What steps will reproduce the problem?
2. What is the expected output?
3. What do you see instead?
4. What version of the product (e.g. R, packages) are you using? What O.S.?

If it's about data analysis (more specialised places than stackoverflow):

1. What is the cuestion you are trying to answer?
2. What steps/tools did you use to answer it? (even code)
3. What did you expect to see? What do you see instead?
4. What other solutions have you thought about?

For reproducibility, which is not only important for getting help:
```{r reproducibility}
set.seed(1234)
```

When you need to debug:

```{r debugging}
traceback() # Stacktrace after an error.
stop(“Why did I stop?”) # Assertions, reporting fatal errors
```

More on that on http://stackoverflow.com/questions/4442518/general-suggestions-for-debugging-r

Useful R stuff
--------------

```{r useful}
     [,1] [,2] [,3] [,4]
[1,]   1    6   11   16
[2,]   2    7   12   17
[3,]   3    8   11   18
[4,]   4    9   11   19
[5,]   5   10   15   20

# Coerce to numeric a char vector
as.numeric(c(“1”, ”7”, ”9”))

# Get unique values (single appearing values, distinct values)
x = c(1,1,2,3,4,4,4)
unique(x)
[1] 1 2 3 4
http://stackoverflow.com/questions/7755240/list-distinct-values-in-a-vector-in-r
# Another way:
tab <- table(x)
xu <- as.numeric(names(tab))
xn <- as.vector(tab)
xu; xn

# Check if a vector contains an element
http://stackoverflow.com/questions/1169248/r-function-for-testing-if-a-vector-contains-a-given-element
v <- c('a','b','c','e')
'b' %in% v # returns TRUE
match('b',v) # returns the first location of 'b', in this case: 2
v <- c('a','b','c')
any(v=='b')
# [1] TRUE

# Removing missing values in vector:
x <- c(1, 2, NA, 3, 4)
bad <- is.na(x)
x[!bad]

# Selecting only complete values in different vectors
# (or different columns in a data frame / matrix)
x <- c(1, 2, NA, 3, 4)
y <- c(“a”, NA, “b”, “c”, “d”)
good <- complete.cases(x, y)
x[good]
y[good]

# Combine two vectors into a data frame:
data.frame(starters = starters, enders = enders)

# Count number of TRUE values in TRUE, FALSE, NA array:
length(which(z))
table(z)["TRUE"]

# Keep column names when adding data through rbind
# Don’t forget that there are more efficient ways than using rbind!
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
# !!!! the stringsAsFactors part is VERY important
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...

# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...

# Concatenate two vectors
c(vector1, vector2)
```

Summarising data
----------------

Before using it, data should be examined and cleaned (values that are not available, incorrect or non consistent values, columns not correctly parsed, etc). That is to say, data needs to be sumarised. *Look at data!*

```{r summarising}
dim(data) # First look at the dimensions. Shows: [1] NB_ROWS  NB_COLUMS
nrow(data); ncol(data) # Another way to see row and column numbers
names(data) # Find out the variable names (column names)
quantile(data$col1) # See the ranges (identify if any value is outside of the normal range)
summary(data) # Will tell for each variable the number of occurrences and other info
sapply(data[1,], class) # Applies the class() to each element in the first row (find out which class is each column)
unique(data$col2) # Are the values shown the ones expected for this variable?
length(unique(data$col2) # Is the count of unique values something we expect?
table(data$col2) # We see how many occurrences have each of the unique values
table(data$col1, data$col2) # 2 dimensional table for each of the pair of values
table(c(0,1,2,NA,3,3), useNA = “ifany”) # Show missing values (hidden by default)
any(data$col1[1:10] > 40) # Will tell me how many of the selected rows are greater than 40
all(data$col1[1:10] > 40) # Will tell me if ALL are greater than 40
sum(is.na(reviews$time_left[1:10])) # How many missing values?

# Looking at subsets
data[data$lat > 0 & data&lon > 0, c(“lat”, “lon”)] # AND: Make sure you use a single &
data[data$lat > 0 | data&lon > 0, c(“lat”, “lon”)] # OR: Make sure you use a single |

# Means and sums of rows/columns
rowSums(); rowMeans(); colSums(); colMeans()
colMeans(reviews, na.rm = TRUE) # The mean of ALL colums that removes any NAs
```

Data munging or wrangling
-------------------------

Data munging or data wrangling is loosely the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

Mung or munge is computer jargon for "to make repeated changes which individually may be reversible, yet which ultimately result in an unintentional, irreversible destruction of large portions of the original item.

```{r munging}
# Fixing character vectors
tolower(names(cameraData)) # Or toupper()
strsplit(names(cameraData), “\\.”) # Uses two backlashes because . is a special regex char
sapply(strsplit(names(cameraData), “\\.”), function(x) { x[1] }) # Split & take first
sub(“_”, “”, names(reviews), ) # Substitution. BUT: Only the first instance!
gsub(“_”, “”, names(reviews), ) # Substitution. Every instance.

# Quantitative variables in ranges
# It repaces each of the values with the range it fits in
timeRanges <- cut(reviews$time_left, seq(0, 3600, by = 600))
table(timeRanges, useNA = “ifany”)
# Another way
library(Hmisc)
timeRanges <- cut2(reviews$time_left, g = 6) # Six ranges are used

# Adding an extra variable
reviews$timeRanges <- cut2(reviews$time_left, g = 6)

# Merging data, when variables have common names. We must choose “merge _BY_”
merge(reviews, solutions, all = TRUE) # By default
merge(reviews, solutions, by.x = “solution_id”, by.y = “id”, all = TRUE) # This way it will try to merge solution ids with ids

# Sorting values
sort(data$reviewer_id) # A column or single vector
order(data$reviewer_id) # Returns a vector with the ordered positions (the indexes)
data[order(data$reviewer_id), ] # Use those to ordering indexes to order whole dataframe
data[order(data$reviewer_id, data$id), ] # Use two criteria for ordering

# Reshaping data
melt(misShaped, id.vars = “people”, variable.name = “treatment”, value.name = “value”)
```

Exploratory graphics
--------------------

```{r boxplots}
boxplot(pData$AGEP, col = "blue")
# Show me the distributuion of AGEP broken down by DDRS
# I will see as many boxes as DDRS has single values
# That is useful to see the distribution of one variable depending on
# the differente values another variable can take
boxplot(pData$AGEP ~ as.factor(pData$DDRS), col = "blue")
# I can also use, x axis names (groups), colors and variable widths.
# The width of the box is proportional to the number of observations we have in each group
boxplot(pData$AGEP ~ as.factor(pData$DDRS), col = c("blue", "orange"), names = c("yes", "no"), varwidth = TRUE)
```

```{r barplots}
# Number of observations of each single occurrent value
barplot(table(pData$CIT), col = "blue")
```

```{r histograms}
# Histograms are in the frequency scale
# Take all values, slice in a scale and count how many fall into each slice
hist(pData$AGEP, col = "blue")
# The number of breakpoints can be set
hist(pData$AGEP, col = "blue", breaks = 100, main = "Age")
```

```{r density}
# It looks like a smoothed histogram
# I has the same shape but some boundaries error appears in boundaries
dens <- density(pData$AGEP)
plot(dens, lwd = 3, col = "blue")
# The reason why we would use densities instead of histograms is to overlay several densities
dens <- density(pData$AGEP)
densMales <- density(pData$AGEP[which(pData$SEX==1)])
plot(dens, lwd = 3, col = "blue")
lines(densMales, lwd = 3, col = "orange") # Adds more lines to the plot
```

```{r scatterplots}
# pch is character type, cex is size of the points
# You can see particular problems in scatterplots
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = "blue")
# Use color to divide by sex (in this case pData$SEX is either 1 or 2)
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = pData$SEX)
# Using size
percentMaxAge <- pData$AGEP/max(pData$AGEP)
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = percentMaxAge * 0.5, col = "blue")
# Overlaying lines/points
plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = "blue")
lines(rep(100, dim(pData)[1]), pData$WAGP, col = "grey", lwd = 5) # rep repeats a value
points(seq(0, 200, length = 100), seq(0, 20e5, length = 100), col = "red", pch = 19)
# Numeric variables as factors
library(Hmisc)
# cut2 takes a continnuous variable and breaks it up into 5 equal intervals
# and then assigns each point as a factor of one of those intervals
ageGroups <- cut2(pData$AGEP, g = 5)
# We color the plot according to the 5 different age groups created
plot(pData$JWMNP, pData$WAGP, pch = 19, col = ageGroups, cex = 0.5)
# If you have a lot of points..
x <- rnorm(1e5)
y <- rnorm(1e5)
plot(x, y, pch = 19)
# Sample the values!
sampledValues <- sample(1:1e5, size = 1000, replace = FALSE)
plot(x[sampledValues], y[sampledValues], pch = 19)
# Use a smoothScatter!
# (changes color according to density)
smoothScatter(x, y)
# Use the hexbin package!
hbo <- hexbin(x, y)
plot(hbo) # Color means how many points there
# You could also use transparency (not in Jeff Leek video)
```

```{r qqplots}
# Quantile-quantile plots
# Its is useful for comparing distributions of two variables
x <- rnorm(20); y <- rnorm(20)
qqplot(x, y)
# We can also plot a line
abline(c(0, 1))
# If you want to see if your data are normally distributed
# you can make a QQ-plot with normal data and your data
```

```{r matplot_and_spagetti}
X <- matrix(rnorm(20 * 5), nrow = 20)
# This takes each column of the matrix as one specific line
matplot(X, type = "b")
```

```{r heatmaps}
# A sort of 2D histogram
image(1:10, 161:236, as.matrix(pData[1:10, 161:236]))
# images transposes data, so we would need to transpose it previously
newMatrix <- as.matrix(pData[1:10, 161:236])
newMatrix <- t(newMatrix)[,nrow(newMatrix):1]
image(161:236, 1:10, newMatrix)
```

```{r maps}
library(maps)
map("world") # Get map of the world
lat <- runif(40, -180, 180); lon <- runif(40, -90, 90)  # Random lat and ling values
pints(lat, lon, col = "plue", pch = 19)
```

```{r missing_values}
# R does not plot a point if any missing value
x <- c(NA, NA, NA, 4, 5, 6, 7, 8, 9, 10)
y <- 1:10
plot(x, y, pch = 19, xlim = c(0, 11), ylim = c(0,11)) # First three not shown
# See the distribution of NAs
x <- rnorm(100)
y <- rnorm(100)
y[x < 0] <- NA
boxplot(x ~ is.na(y)) # You can see where the NA values may be located
```

Expository graphics
-------------------

The idea is to choose a few among exploratory graphics, based on the information they convey. After that, titles, axes, labels and colors must be fine-tuned in order to archieve the desired efects. A legend is also important sometimes.

```{r titles_legends }
# Axes
plot(pData$JWMNP, pData$WAGP, pch = 19, col = "blue", cex = 0.5, xlab = "Travel time (min)", ylab = "Last 12 months wages (dollars)")
# To change axis axis or label size use cex.lab = 2 or cex.axis = 1.5
# Legends, single
legend(100, 20000, legend = "All surveyed", col = "blue", pch = 19, cex = 0.5)
# Legends, several variables
plot(pData$JWMNP, pData$WAGP, pch = (19, 19), col = pData$SEX, cex = c(0.5, 0.5), xlab = "TT (min)", ylab = "Wages (dollars)")
legend(100, 20000, legend = c("men", "women"), col = c("black", "red"), pch = (19, 19), cex = c(0.5, 0.5))
# For a title use, main = "title"
```

Using multiple panels:

```{r multiple_panels }
par(nfrow = c(1, 2)) # Orientation of graphs
# ... and then plot
hist()
plot()
# Add text with the mtext command
mtext(text = "(a)", side = 3, line = 1)
```

Adding a figure caption
-----------------------

*Figure 1. Bolded text wich describes the purpose of the whole plot* If the figure has different componnents, use (a) and (b) etc, to tell about each of the figures. These captions should be enough for people to undertand the graphs without reading the whole text.

Saving graphics in different formats
------------------------------------

PDF:

```{r pdf }
pdf(file = "twoPanel.pdf", height = 4, width = 8) # Height and width in inches
# Draw plot...
dev.off() # Close device (file is saved)
```

PNG

```{r png }
png(file = "twoPanel.png", height = 480, width = 600) # Height and width in pixels
# Draw plot...
dev.off() # Close device (file is saved)
```

Sometimes it's better to fine-tune the figure interctivelly and then save it as PDF. You can do that by plotting on the screen and then:

```{r copy2pdf }
dev.copy2pdf(file = "twoPanel.pdf")
```

Hierarchical clustering
-----------------------

Find the two observations that are closest together, and then merge them and do the same process again at a higher level and so on and so forth.

How do we define "close"? It is very important to define an appropiate distance measure.
continuous vars -> auclidean distance
binary bars -> Manhattan distance

```{r cluestering_example}
set.seed(1234); par(mar = c(0, 0, 0, 0))
x <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
y <- rnorm(12, mean = rep(c(1, 2, 1), each = 4), sd = 0.2)
plot(x, y, col = "blue", pch = 19, cex = 2)
text(x + 0.05, y + 0.05, labels = as.character(1:12))
# Calculate distance
dataFrame <- data.frame(x = x, y = y)
distxy <- dist(dataFrame) # Using default distance, euclidean but can be specified (i.e. Manhattan)
hClustering <- hclust(distxy)
plot(hClustering) # It generates a 'dendogram' or a tree
```

We can cut the tree at a certain height. Choosing where to do it determines which groups we get.

And heatmaps (+complete)

```{r heatmaps}
dataFrame <- data.frame(x = x, y = y)
set.seed(143)
dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
heatmap(dataMatrix)
```

K-Distance clustering
---------------------

A partitioning approach. Fix the number of clusters in advance an calculate centroids.

```{r kdistance}
dataFrame <- data.frame(x, y)
kmeansObj <- kmeans(dataFrame, centers = 3) # I can tell it how many iterations and the proposed 1st iteration centers
names(kmeansObj)
kmeansObj$cluster # Tells what cluster does each point belong

par(mar = rep(0.2, 4)
plot(x, y, col = kmeansObj$cluster, pch = 19, cex = 2)
points(kmeansObj$centers, col = 1:3, pch = 3, cex = 3, lwd = 3)
```

And heatmaps

```{r heatmaps2}
set.seed(143)
dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
kmeansObj2 <- kmeans(dataMatrix, centers = 3)
par(mfrow = c(1, 2), mar = rep(0.2, 4))
image(t(dataMatrix)[,nrow(dataMatrix):1], yaxt = "n" ) # Scrambled version
image(t(dataMatrix)[,order(kmeansObs$cluster)], yaxt = "n" ) # Ordered
```

Dimension reduction
-------------------

Matrix data
```{r matrix_data}
set.seed(12345); par(mar = rep(0.2, 4))
dataMatrix <- matrix(rnorm(400), nrow = 40)
image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1])
```
We can also cluster it

```{r matrix_clustered}
par(mar = rep(0.2, 4)
heatmap(dataMatrix)
```

If there was a pattern we would see it in the clustering

Uncompleted
	Data analysis with R -- It's better and better now
	==================================================

	A nice R resource, the “cookbook for R”: http://wiki.stdout.org/rcookbook/

	Installing
	----------

	```{r installing}
	# For Linux: Install package directly from CRAN accessible for all users
	sudo R
	install.packages("mypkg", lib="/my/own/R-packages/")
	# For a single user
	R # Or into R Studio R console
	install.packages("mypkg", lib="/my/own/R-packages/")
	# (the second argument is optional)
	```

	Getting help
	------------

	First of all:

	```{r help}
	help.search(“rnorm”) # Search the help for “rnorm” occurences
	args(“rnorm”) # Get the arguments of “rnorm”
	```

	Then google, and last, post a useful question on forums. If it's about an R command:

	1. What steps will reproduce the problem?
	2. What is the expected output?
	3. What do you see instead?
	4. What version of the product (e.g. R, packages) are you using? What O.S.?

	If it's about data analysis (more specialised places than stackoverflow):

	1. What is the cuestion you are trying to answer?
	2. What steps/tools did you use to answer it? (even code)
	3. What did you expect to see? What do you see instead?
	4. What other solutions have you thought about?

	For reproducibility, which is not only important for getting help:
	```{r reproducibility}
	set.seed(1234)
	```

	When you need to debug:

	```{r debugging}
	traceback() # Stacktrace after an error.
	stop(“Why did I stop?”) # Assertions, reporting fatal errors
	```

	More on that on http://stackoverflow.com/questions/4442518/general-suggestions-for-debugging-r

	Useful R stuff
	--------------

	```{r useful}
	[,1] [,2] [,3] [,4]
	[1,] 1 6 11 16
	[2,] 2 7 12 17
	[3,] 3 8 11 18
	[4,] 4 9 11 19
	[5,] 5 10 15 20

	# Coerce to numeric a char vector
	as.numeric(c(“1”, ”7”, ”9”))

	# Get unique values (single appearing values, distinct values)
	x = c(1,1,2,3,4,4,4)
	unique(x)
	[1] 1 2 3 4
	http://stackoverflow.com/questions/7755240/list-distinct-values-in-a-vector-in-r
	# Another way:
	tab <- table(x)
	xu <- as.numeric(names(tab))
	xn <- as.vector(tab)
	xu; xn

	# Check if a vector contains an element
	http://stackoverflow.com/questions/1169248/r-function-for-testing-if-a-vector-contains-a-given-element
	v <- c('a','b','c','e')
	'b' %in% v # returns TRUE
	match('b',v) # returns the first location of 'b', in this case: 2
	v <- c('a','b','c')
	any(v=='b')
	# [1] TRUE

	# Removing missing values in vector:
	x <- c(1, 2, NA, 3, 4)
	bad <- is.na(x)
	x[!bad]

	# Selecting only complete values in different vectors
	# (or different columns in a data frame / matrix)
	x <- c(1, 2, NA, 3, 4)
	y <- c(“a”, NA, “b”, “c”, “d”)
	good <- complete.cases(x, y)
	x[good]
	y[good]

	# Combine two vectors into a data frame:
	data.frame(starters = starters, enders = enders)

	# Count number of TRUE values in TRUE, FALSE, NA array:
	length(which(z))
	table(z)["TRUE"]

	# Keep column names when adding data through rbind
	# Don’t forget that there are more efficient ways than using rbind!
	# This is "should work", but it doesn't:
	# Create an empty data.frame with the correct names and types
	# !!!! the stringsAsFactors part is VERY important
	df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
	rbind(df, list(42, 'foo', 'bar')) # Messes up names!
	rbind(df, list(A=42, B='foo', C='bar')) # OK...

	# If you have at least one row, names are kept...
	df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
	rbind(df, list(42, 'foo', 'bar')) # Names work now...

	# Concatenate two vectors
	c(vector1, vector2)
	```

	Summarising data
	----------------

	Before using it, data should be examined and cleaned (values that are not available, incorrect or non consistent values, columns not correctly parsed, etc). That is to say, data needs to be sumarised. Look at data!

	```{r summarising}
	dim(data) # First look at the dimensions. Shows: [1] NB_ROWS NB_COLUMS
	nrow(data); ncol(data) # Another way to see row and column numbers
	names(data) # Find out the variable names (column names)
	quantile(data$col1) # See the ranges (identify if any value is outside of the normal range)
	summary(data) # Will tell for each variable the number of occurrences and other info
	sapply(data[1,], class) # Applies the class() to each element in the first row (find out which class is each column)
	unique(data$col2) # Are the values shown the ones expected for this variable?
	length(unique(data$col2) # Is the count of unique values something we expect?
	table(data$col2) # We see how many occurrences have each of the unique values
	table(data$col1, data$col2) # 2 dimensional table for each of the pair of values
	table(c(0,1,2,NA,3,3), useNA = “ifany”) # Show missing values (hidden by default)
	any(data$col1[1:10] > 40) # Will tell me how many of the selected rows are greater than 40
	all(data$col1[1:10] > 40) # Will tell me if ALL are greater than 40
	sum(is.na(reviews$time_left[1:10])) # How many missing values?

	# Looking at subsets
	data[data$lat > 0 & data&lon > 0, c(“lat”, “lon”)] # AND: Make sure you use a single &
	data[data$lat > 0 \| data&lon > 0, c(“lat”, “lon”)] # OR: Make sure you use a single \|

	# Means and sums of rows/columns
	rowSums(); rowMeans(); colSums(); colMeans()
	colMeans(reviews, na.rm = TRUE) # The mean of ALL colums that removes any NAs
	```

	Data munging or wrangling
	-------------------------

	Data munging or data wrangling is loosely the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

	Mung or munge is computer jargon for "to make repeated changes which individually may be reversible, yet which ultimately result in an unintentional, irreversible destruction of large portions of the original item.

	```{r munging}
	# Fixing character vectors
	tolower(names(cameraData)) # Or toupper()
	strsplit(names(cameraData), “\\.”) # Uses two backlashes because . is a special regex char
	sapply(strsplit(names(cameraData), “\\.”), function(x) { x[1] }) # Split & take first
	sub(“_”, “”, names(reviews), ) # Substitution. BUT: Only the first instance!
	gsub(“_”, “”, names(reviews), ) # Substitution. Every instance.

	# Quantitative variables in ranges
	# It repaces each of the values with the range it fits in
	timeRanges <- cut(reviews$time_left, seq(0, 3600, by = 600))
	table(timeRanges, useNA = “ifany”)
	# Another way
	library(Hmisc)
	timeRanges <- cut2(reviews$time_left, g = 6) # Six ranges are used

	# Adding an extra variable
	reviews$timeRanges <- cut2(reviews$time_left, g = 6)

	# Merging data, when variables have common names. We must choose “merge _BY_”
	merge(reviews, solutions, all = TRUE) # By default
	merge(reviews, solutions, by.x = “solution_id”, by.y = “id”, all = TRUE) # This way it will try to merge solution ids with ids

	# Sorting values
	sort(data$reviewer_id) # A column or single vector
	order(data$reviewer_id) # Returns a vector with the ordered positions (the indexes)
	data[order(data$reviewer_id), ] # Use those to ordering indexes to order whole dataframe
	data[order(data$reviewer_id, data$id), ] # Use two criteria for ordering

	# Reshaping data
	melt(misShaped, id.vars = “people”, variable.name = “treatment”, value.name = “value”)
	```

	Exploratory graphics
	--------------------

	```{r boxplots}
	boxplot(pData$AGEP, col = "blue")
	# Show me the distributuion of AGEP broken down by DDRS
	# I will see as many boxes as DDRS has single values
	# That is useful to see the distribution of one variable depending on
	# the differente values another variable can take
	boxplot(pData$AGEP ~ as.factor(pData$DDRS), col = "blue")
	# I can also use, x axis names (groups), colors and variable widths.
	# The width of the box is proportional to the number of observations we have in each group
	boxplot(pData$AGEP ~ as.factor(pData$DDRS), col = c("blue", "orange"), names = c("yes", "no"), varwidth = TRUE)
	```

	```{r barplots}
	# Number of observations of each single occurrent value
	barplot(table(pData$CIT), col = "blue")
	```

	```{r histograms}
	# Histograms are in the frequency scale
	# Take all values, slice in a scale and count how many fall into each slice
	hist(pData$AGEP, col = "blue")
	# The number of breakpoints can be set
	hist(pData$AGEP, col = "blue", breaks = 100, main = "Age")
	```

	```{r density}
	# It looks like a smoothed histogram
	# I has the same shape but some boundaries error appears in boundaries
	dens <- density(pData$AGEP)
	plot(dens, lwd = 3, col = "blue")
	# The reason why we would use densities instead of histograms is to overlay several densities
	dens <- density(pData$AGEP)
	densMales <- density(pData$AGEP[which(pData$SEX==1)])
	plot(dens, lwd = 3, col = "blue")
	lines(densMales, lwd = 3, col = "orange") # Adds more lines to the plot
	```

	```{r scatterplots}
	# pch is character type, cex is size of the points
	# You can see particular problems in scatterplots
	plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = "blue")
	# Use color to divide by sex (in this case pData$SEX is either 1 or 2)
	plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = pData$SEX)
	# Using size
	percentMaxAge <- pData$AGEP/max(pData$AGEP)
	plot(pData$JWMNP, pData$WAGP, pch = 19, cex = percentMaxAge * 0.5, col = "blue")
	# Overlaying lines/points
	plot(pData$JWMNP, pData$WAGP, pch = 19, cex = 0.5, col = "blue")
	lines(rep(100, dim(pData)[1]), pData$WAGP, col = "grey", lwd = 5) # rep repeats a value
	points(seq(0, 200, length = 100), seq(0, 20e5, length = 100), col = "red", pch = 19)
	# Numeric variables as factors
	library(Hmisc)
	# cut2 takes a continnuous variable and breaks it up into 5 equal intervals
	# and then assigns each point as a factor of one of those intervals
	ageGroups <- cut2(pData$AGEP, g = 5)
	# We color the plot according to the 5 different age groups created
	plot(pData$JWMNP, pData$WAGP, pch = 19, col = ageGroups, cex = 0.5)
	# If you have a lot of points..
	x <- rnorm(1e5)
	y <- rnorm(1e5)
	plot(x, y, pch = 19)
	# Sample the values!
	sampledValues <- sample(1:1e5, size = 1000, replace = FALSE)
	plot(x[sampledValues], y[sampledValues], pch = 19)
	# Use a smoothScatter!
	# (changes color according to density)
	smoothScatter(x, y)
	# Use the hexbin package!
	hbo <- hexbin(x, y)
	plot(hbo) # Color means how many points there
	# You could also use transparency (not in Jeff Leek video)
	```

	```{r qqplots}
	# Quantile-quantile plots
	# Its is useful for comparing distributions of two variables
	x <- rnorm(20); y <- rnorm(20)
	qqplot(x, y)
	# We can also plot a line
	abline(c(0, 1))
	# If you want to see if your data are normally distributed
	# you can make a QQ-plot with normal data and your data
	```

	```{r matplot_and_spagetti}
	X <- matrix(rnorm(20 * 5), nrow = 20)
	# This takes each column of the matrix as one specific line
	matplot(X, type = "b")
	```

	```{r heatmaps}
	# A sort of 2D histogram
	image(1:10, 161:236, as.matrix(pData[1:10, 161:236]))
	# images transposes data, so we would need to transpose it previously
	newMatrix <- as.matrix(pData[1:10, 161:236])
	newMatrix <- t(newMatrix)[,nrow(newMatrix):1]
	image(161:236, 1:10, newMatrix)
	```

	```{r maps}
	library(maps)
	map("world") # Get map of the world
	lat <- runif(40, -180, 180); lon <- runif(40, -90, 90) # Random lat and ling values
	pints(lat, lon, col = "plue", pch = 19)
	```

	```{r missing_values}
	# R does not plot a point if any missing value
	x <- c(NA, NA, NA, 4, 5, 6, 7, 8, 9, 10)
	y <- 1:10
	plot(x, y, pch = 19, xlim = c(0, 11), ylim = c(0,11)) # First three not shown
	# See the distribution of NAs
	x <- rnorm(100)
	y <- rnorm(100)
	y[x < 0] <- NA
	boxplot(x ~ is.na(y)) # You can see where the NA values may be located
	```

	Expository graphics
	-------------------

	The idea is to choose a few among exploratory graphics, based on the information they convey. After that, titles, axes, labels and colors must be fine-tuned in order to archieve the desired efects. A legend is also important sometimes.

	```{r titles_legends }
	# Axes
	plot(pData$JWMNP, pData$WAGP, pch = 19, col = "blue", cex = 0.5, xlab = "Travel time (min)", ylab = "Last 12 months wages (dollars)")
	# To change axis axis or label size use cex.lab = 2 or cex.axis = 1.5
	# Legends, single
	legend(100, 20000, legend = "All surveyed", col = "blue", pch = 19, cex = 0.5)
	# Legends, several variables
	plot(pData$JWMNP, pData$WAGP, pch = (19, 19), col = pData$SEX, cex = c(0.5, 0.5), xlab = "TT (min)", ylab = "Wages (dollars)")
	legend(100, 20000, legend = c("men", "women"), col = c("black", "red"), pch = (19, 19), cex = c(0.5, 0.5))
	# For a title use, main = "title"
	```

	Using multiple panels:

	```{r multiple_panels }
	par(nfrow = c(1, 2)) # Orientation of graphs
	# ... and then plot
	hist()
	plot()
	# Add text with the mtext command
	mtext(text = "(a)", side = 3, line = 1)
	```

	Adding a figure caption
	-----------------------

	Figure 1. Bolded text wich describes the purpose of the whole plot If the figure has different componnents, use (a) and (b) etc, to tell about each of the figures. These captions should be enough for people to undertand the graphs without reading the whole text.

	Saving graphics in different formats
	------------------------------------

	PDF:

	```{r pdf }
	pdf(file = "twoPanel.pdf", height = 4, width = 8) # Height and width in inches
	# Draw plot...
	dev.off() # Close device (file is saved)
	```

	PNG

	```{r png }
	png(file = "twoPanel.png", height = 480, width = 600) # Height and width in pixels
	# Draw plot...
	dev.off() # Close device (file is saved)
	```

	Sometimes it's better to fine-tune the figure interctivelly and then save it as PDF. You can do that by plotting on the screen and then:

	```{r copy2pdf }
	dev.copy2pdf(file = "twoPanel.pdf")
	```

	Hierarchical clustering
	-----------------------

	Find the two observations that are closest together, and then merge them and do the same process again at a higher level and so on and so forth.

	How do we define "close"? It is very important to define an appropiate distance measure.
	continuous vars -> auclidean distance
	binary bars -> Manhattan distance

	```{r cluestering_example}
	set.seed(1234); par(mar = c(0, 0, 0, 0))
	x <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
	y <- rnorm(12, mean = rep(c(1, 2, 1), each = 4), sd = 0.2)
	plot(x, y, col = "blue", pch = 19, cex = 2)
	text(x + 0.05, y + 0.05, labels = as.character(1:12))
	# Calculate distance
	dataFrame <- data.frame(x = x, y = y)
	distxy <- dist(dataFrame) # Using default distance, euclidean but can be specified (i.e. Manhattan)
	hClustering <- hclust(distxy)
	plot(hClustering) # It generates a 'dendogram' or a tree
	```

	We can cut the tree at a certain height. Choosing where to do it determines which groups we get.

	And heatmaps (+complete)

	```{r heatmaps}
	dataFrame <- data.frame(x = x, y = y)
	set.seed(143)
	dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
	heatmap(dataMatrix)
	```

	K-Distance clustering
	---------------------

	A partitioning approach. Fix the number of clusters in advance an calculate centroids.

	```{r kdistance}
	dataFrame <- data.frame(x, y)
	kmeansObj <- kmeans(dataFrame, centers = 3) # I can tell it how many iterations and the proposed 1st iteration centers
	names(kmeansObj)
	kmeansObj$cluster # Tells what cluster does each point belong

	par(mar = rep(0.2, 4)
	plot(x, y, col = kmeansObj$cluster, pch = 19, cex = 2)
	points(kmeansObj$centers, col = 1:3, pch = 3, cex = 3, lwd = 3)
	```

	And heatmaps

	```{r heatmaps2}
	set.seed(143)
	dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
	kmeansObj2 <- kmeans(dataMatrix, centers = 3)
	par(mfrow = c(1, 2), mar = rep(0.2, 4))
	image(t(dataMatrix)[,nrow(dataMatrix):1], yaxt = "n" ) # Scrambled version
	image(t(dataMatrix)[,order(kmeansObs$cluster)], yaxt = "n" ) # Ordered
	```

	Dimension reduction
	-------------------

	Matrix data
	```{r matrix_data}
	set.seed(12345); par(mar = rep(0.2, 4))
	dataMatrix <- matrix(rnorm(400), nrow = 40)
	image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1])
	```
	We can also cluster it

	```{r matrix_clustered}
	par(mar = rep(0.2, 4)
	heatmap(dataMatrix)
	```

	If there was a pattern we would see it in the clustering

	Uncompleted