Skip to content

Instantly share code, notes, and snippets.

@nikolay-shenkov
Last active August 29, 2015 14:18
Show Gist options
  • Save nikolay-shenkov/88a329c58ca37f711da5 to your computer and use it in GitHub Desktop.
Save nikolay-shenkov/88a329c58ca37f711da5 to your computer and use it in GitHub Desktop.
How to create a plot of a categorical variable versus a continuous (numerical one) and deal with overploting.
library(ggplot2)
# some background about the data
# http://www.diamondse.info/
?diamonds
summary(diamonds)
str(diamonds)
is.factor(diamonds$clarity)
is.factor(diamonds$color)
# clarity and color are already factor variables
# In many cases you will need to convert categorical variables to factors
# when you read in your own dataset.
ggplot(aes(x=color,y=price), data=diamonds) + geom_point()
# let's try jittering
?geom_jitter
ggplot(aes(x=color,y=price), data=diamonds) +
geom_jitter(position=position_jitter(width=0.5, height=0))
ggplot(aes(x=color,y=price), data=diamonds) +
geom_jitter(position=position_jitter(width=0.3, height=0), alpha=0.1)
# we might need to subsample - ~3000 observations
sample_data <- function(sample_size=3000) {
index <- sample(nrow(diamonds), sample_size)
return(diamonds[index, ])
}
# use a sample of the data and add box-plots
ggplot(aes(x=color,y=price), data=sample_data()) +
geom_boxplot(outlier.size=0) +
geom_jitter(position=position_jitter(width=0.3, height=0), alpha=0.4,
color="steelblue2")
# we will add a third variable - clarity
ggplot(aes(x=color,y=price, color=clarity), data=sample_data()) +
geom_jitter(position=position_jitter(width=0.3, height=0), alpha=0.8)
# Instead of using color to distringuish different levels of
# clarity, we can use facets.
# We need to increase our sample of the data in order to
# have enough data for each facet in our plot
ggplot(aes(x=color,y=price), data=sample_data(5000)) +
geom_boxplot(outlier.size=0, color="blue") +
geom_jitter(position=position_jitter(width=0.3, height=0), alpha=0.2) +
facet_wrap(~clarity)
# not a perfect plot - what are ways in which we can improve it?
# It shows trade-offs between showing box plots and the
# distributions of elements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment