Skip to content

Instantly share code, notes, and snippets.

@zross
Last active November 8, 2016 14:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zross/84ef7f61a1ba87421dd95e8b292cd804 to your computer and use it in GitHub Desktop.
Save zross/84ef7f61a1ba87421dd95e8b292cd804 to your computer and use it in GitHub Desktop.
# **********************************************************************************
# Data Science with R, Part 2 - Predictive modeling and machine learning: Exercise 1
# **********************************************************************************
# This exercise is designed to remind you of key R
# concepts and code that you will need to be familiar
# with before participating in Part 2 of the R workshop.
# Please feel free to use Google, stackoverflow etc to help
# you answer questions, treat them as if they were real-world
# challenges
# Answer the questions below. Your answers will not be reviewed
# by me. I recommend you write your answer below the question.
# load the package with this
library(datascienceRworkshop2)
# ***********************************************
# ----- R refresher: Question 1 ----
# ***********************************************
# How do you load packages? Please load dplyr, tidyr,
# ggplot2 and mlbench
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 2 ----
# ***********************************************
# Please load the BostonHousing2 dataset (not
# BostonHousing).Then review the documentation
# on this set. Please note that this is tract-
# level data from the 1970 Census! Not recent.
data("BostonHousing2")
?BostonHousing2
# To reduce typing let's rename with
bh2 <- BostonHousing2
# A reminder that if you accidentally damage the dataset
# you can reload by rerunning:
data("BostonHousing2")
bh2 <- BostonHousing2
# What function might you use to look at the first few
# records. What function would tell you the field types?
# Are all of the fields numeric?
# You can use the Part 1 cheat sheet if needed. Open it
# with:
cheat_sheet("part1")
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 3 ----
# ***********************************************
# How would you count how many observations there
# are per town. You have several options here.
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 4 ----
# ***********************************************
# How would you use base functions to create a subset
# with only those records from the town Belmont? Call
# this new table belmont.
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 5 ----
# ***********************************************
# In all questions related to median housing values
# you should use the corrected variable (cmedv)
# and not the original variable (medv)
# You will need to know how to use dplyr and tidyr
# in Part 2. If these are relatively new to you
# you might review my blog post on the
# topic here: http://bit.ly/17HltG8
# Using dplyr on the bh2 set (not your subset)
# compute the average crime rate (variable crim)
# and average median home value (cmedv) by town.
# Call the new table avgvals and call the new
# variables avg.crim and avg.cmedv
# What is the average crime rate for Arlington?
# How about for Westwood?
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 6 ----
# *************************************************************
# How would you do a quick histogram (using base R) of
# of the average crime (using your new avgvals dataset)?
# What is the minimum crime rate?
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 7 ----
# ***********************************************
# How would you use ggplot2 to plot crime (on the x-axis)
# against median house values?
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 8 ----
# ***********************************************
# Seems like the high values of crime might be
# making the plot harder to read. How would
# you add a log scale to the crime axis and
# a smooth to the ggplot? The log axis was not
# covered in Part 1 so you may need to do
# a Google search. Does there appear to be
# a trend between crime and median housing values?
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 9 ----
# ***********************************************
# Here is pseudo-code for computing a three-level
# factor of median housing values
cut(cmedv, breaks = c(0, 20, 35, 55),
labels = c("lowHs", "medHs", "highHs")) # will give error
# How would you use dplyr to add a variable to
# bh2 with a three-level factor of median housing?
# Call the new variable cmedv_cat
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 10 ----
# ***********************************************
# If you want to add a column with the total number of
# tracts per town you have two basic options in
# dplyr. Here is one:
# cnt <- group_by(bh2, town) %>%
# summarise(total_tracts = n())
# bh2 <- inner_join(bh2, cnt, by = "town")
# But you can simplify this and skip the join
# by just replacing the summarise function with
# mutate
# Do this, replace summarise with mutate in the
# above code, does it add a new column to your
# dataset as expected?
# ***********************************************
# ----- R refresher: Question 11 ----
# ***********************************************
# This one is very difficult (!). You might even want to
# take a little break before doing it :) But it's hard
# because it is a real-world type of coding challenge.
# I want to know what proportion of tracts in the town
# are each category of housing value. So what proportion
# of tracts are in lowHs from Arlington Ashland etc. How
# would you create a table that looks like this (where
# lowHs would be proportion of tracts in the lowest
# median housing category):
# town lowHs medHs highHs
# Arlington 0.0 1.00 0.00
# Ashland 0.0 1.00 0.00
# Bedford 0.0 1.00 0.00
# Belmont 0.0 0.38 0.62
# Beverly 0.5 0.50 0.00
# Here are the steps you might use. I would use dplyr and tidyr
# Create a new table with tract count by town and median housing
# category:
# 1. group by town, cmedv_cat and total tracts (total tracts just so output table has this variable)
# 2. summarize with n() to get count of tracts
# 3. use mutate to add a new column with newcount/total tracts. Result:
# town cmedv_cat total_tracts count proportion
# Arlington medHs 7 7 1.00
# Ashland medHs 2 2 1.00
# Bedford medHs 2 2 1.00
# Belmont medHs 8 3 0.38
# Belmont highHs 8 5 0.62
# Make into wide format
# 1. use select to drop total tracts and count (only need proportions)
# 2. use tidyr spread function to make into wide format
# town lowHs medHs highHs
# Arlington 0.0 1.00 0.00
# Ashland 0.0 1.00 0.00
# Bedford 0.0 1.00 0.00
# Belmont 0.0 0.38 0.62
# Beverly 0.5 0.50 0.00
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 12 ----
# ***********************************************
# Single and double brackets do slightly different
# things when you apply to lists. Create a list with
lst <- as.list(as.character(unique(bh2$town)[1:5]))
# What is the difference between these two lines
# of code?
lst[[1]]
lst[1]
# If you want to pull out the 1st through 3rd list element
# with 1:3 do you use single or double brackets?
# ***** Answer below **********
# ***********************************************
# ----- R refresher: Question 13 ----
# ***********************************************
# How would you create a function that takes data
# as input and returns the mean and standard deviation?
# ***********************************************
# ----- R refresher: Question 14 ----
# ***********************************************
# How would you use sapply to compute the average
# and standard deviation of these list elements
# using the function you created above
lst <- list(norm = rnorm(20), unif = runif(10), seq = 1:100)
# ***********************************************
# ----- R refresher: Question 15 ----
# ***********************************************
# In Part 1 we did not discuss the function set.seed()
# but in Part 2 we will use it regularly so that we
# can compare answers.
# The help for set.seed is not easy to follow but
# the basic idea is that the "random" number generator
# is actually a "pseudo-random" number generator and
# if it gets the same starting value (seed) it will
# produce the same results.
# Run the following lines of code. Do you get the same
# results?
sample(1:100, 3) # I get 75 10 53
set.seed(2)
sample(1:100, 3) # I get 19 70 57
# ***********************************************
# ----- R refresher: Question 16 ----
# ***********************************************
# Part 1 did not cover any statistics. If you're taking
# Part 2 I am assuming that you have run statistical
# models (for example, linear or logistic regression)
# in another system (SAS/stata etc). But I want to you
# to be familiar with the basic syntax in R
# Please review the code below to see how a simple
# linear regression model is run. Note, in particular,
# how the ~ (tilda) is used to separate the left
# and right side of the model (y and x)
# average median housing value, average tax by town
dat <- group_by(bh2, town) %>%
summarise(avg.tax = mean(tax), avg.cmedv = mean(cmedv))
# split into training and test dataset. Normally I
# would not do this with so few observations but
# we will for example purposes.
set.seed(100)
part <- sample(1:92, 75)
training <- dat[part,]
testing <- dat[-part,]
plot(dat$avg.tax, dat$avg.cmedv)
# more appropriate model would use the raw
# data but this is an example
lm1 <- lm(avg.cmedv ~ avg.tax, data = training)
summary(lm1)
par(mfrow=c(2,2))
plot(lm1) # definitely not normally distributed resids
# get residuals
head(lm1$residuals)
# get fitted values
head(lm1$fitted.values)
# do prediction
par(mfrow = c(1,1))
preds1 <- predict(lm1, testing)
plot(preds1, testing$avg.cmedv)
abline(0,1) # not good!
# ***********************************************
# ----- R refresher: Question 17 ----
# ***********************************************
# In Part 1 we did not discuss the expand.grid
# function but it will be useful in Part 2
# See if you can figure out what expand grid does.
expand.grid(a = 1:3, b = 1:2)
# ***********************************************
# ----- R refresher: Question 18 ----
# ***********************************************
# Here is a dataset
dat <- data.frame(a = 1:3, b = 1:3, c = 1:3)
# What is the difference between these two
# lines of code for selecting variables? What
# might be a reason to use the first line (with ::)?
dplyr::select(dat, a,c)
select(dat, a, c)
# ***********************************************
# ----- R refresher: Question 19 ----
# ***********************************************
# In Part 1 we talked briefly about using the RDS
# format for saving. What are the functions to
# read and write RDS files? If you had a list like
# the one below why might you use RDS instead of,
# say, CSV or Excel.
list_for_saving <- list(a=rnorm(5), b=matrix(1:10, nrow=2),
c = lm(1:10~rnorm(10)))
# ***********************************************
# ----- R refresher: Question 20 ----
# ***********************************************
# The split function is not something we talked
# about in Part 1 but comes in handy once in awhile
# (and will be useful in a discussion of clusters)
# look at the help for split and split this
# data.frame using the letter variable. What class
# of object does it create and how long is it?
dat <- data.frame(rand = rnorm(30),
letter = sample(c("a", "b", "c"),
size = 30, replace = TRUE))
# Solutions to exercise-1
# *********************************
# ----- Solution 1 ----
# *********************************
# This one should be easy!
library(dplyr)
library(tidyr)
library(mlbench)
# *********************************
# ----- Solution 2 ----
# *********************************
# A couple of options here
head(bh2) # but no info on field types
str(bh2) # not all numeric, two factors
library(dplyr)
glimpse(bh2)
# town and chas are factors
# *********************************
# ----- Solution 3 ----
# *********************************
# Easiest would be
table(bh2$town)
# More useful maybe could be one of these from dplyr
group_by(bh2, town) %>% summarize(cnt = n())
count(bh2, town)
# *********************************
# ----- Solution 4 ----
# *********************************
belmont <- bh2[bh2$town == "Belmont", ]
# *********************************
# ----- Solution 5 ----
# *********************************
avgvals <- group_by(bh2, town) %>%
summarize(avg.crim = mean(crim),
avg.cmedv = mean(cmedv))
# > head(avgvals)
# # A tibble: 6 × 3
# town avg.crim avg.cmedv
# <fctr> <dbl> <dbl>
# 1 Arlington 0.08184571 25.2000
# 2 Ashland 0.04183500 21.4000
# 3 Bedford 0.01813000 30.1000
# 4 Belmont 0.07290500 36.2000
# > tail(avgvals)
# # A tibble: 6 × 3
# town avg.crim avg.cmedv
# <fctr> <dbl> <dbl>
# 1 Westwood 0.0500100 31.23333
# 2 Weymouth 0.0427550 20.03750
# 3 Wilmington 0.1071967 20.10000
# *********************************
# ----- Solution 6 ----
# *********************************
min(avgvals$avg.crim) #0.00632
hist(avgvals$avg.crim)
# *********************************
# ----- Solution 7 ----
# *********************************
library(ggplot2)
ggplot(avgvals, aes(avg.crim, avg.cmedv)) + geom_point()
# *********************************
# ----- Solution 8 ----
# *********************************
# there definitely seems to be a trend with higher
# crime associated with lower median housing values
ggplot(avgvals, aes(avg.crim, avg.cmedv)) +
geom_point() + scale_x_log10() + geom_smooth()
# *********************************
# ----- Solution 9 ----
# *********************************
bh2 <- mutate(bh2, cmedv_cat =
cut(cmedv, breaks = c(0, 20, 35, 55),
labels = c("lowHs", "medHs", "highHs")))
# *********************************
# ----- Solution 10 ----
# *********************************
# Yes a new field is added
bh2 <- group_by(bh2, town) %>%
mutate(total_tracts = n())
# *********************************
# ----- Solution 11 ----
# *********************************
propmed <- group_by(bh2, town, cmedv_cat, total_tracts) %>%
summarize(count = n()) %>%
mutate(proportion = round(count/total_tracts,2))
propmedW <- select(propmed, -total_tracts, -count) %>%
spread(cmedv_cat, proportion, fill = 0)
# *********************************
# ----- Solution 12 ----
# *********************************
# Double brackets will select a single element
# and the object type will be the same as that element.
# In other words, if list element 1 is a data.frame
# then using lst[[1]] will return a data.frame
#
# Single brackets returns a list of elements. So if you
# have a list with a data.frame and then a vector and use
# lst[1:2] a list with a data.frame and vector is returned.
#
# You can only select multiple list elements using single brackets
# lst[1:3]
# *********************************
# ----- Solution 13 ----
# *********************************
f <- function(x){
c(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T))
}
# *********************************
# ----- Solution 14 ----
# *********************************
sapply(lst, f)
# *********************************
# ----- Solution 15 ----
# *********************************
# Unlikely that you get the same results without set.seed
# you should get the same results with set.seed.
# *********************************
# ----- Solution 16 ----
# *********************************
# There was no question here
# *********************************
# ----- Solution 17 ----
# *********************************
# expand.grid will create a dataframe with all
# possible combinations of the input values
# *********************************
# ----- Solution 18 ----
# *********************************
# The double colon can be used to specify the
# package a function comes from. I can see two
# reasons you might use the colon:
# 1) You might have two packages that you loaded that
# each have the same function
# (like MASS and dplyr have select). The package
# loaded last takes precendence so you can use the ::
# to use the masked version of the function. Sometimes
# it's useful to use the :: just to be clear about
# the source of a function even it's not required.
# 2) The other situation where you might use the :: is when
# you don't want to load an entire package and just
# want to use a single function
# *********************************
# ----- Solution 19 ----
# *********************************
# You can use saveRDS() and readRDS() to save and
# read. RDS is useful for saving R objects that
# are not conveniently converted to standard
# tables if you intend to use in R again. So
# the list example I showed would not fit
# easily into a single table and if you wanted
# to save it for later you could save as RDS.
# *********************************
# ----- Solution 20 ----
# *********************************
dat <- data.frame(rand = rnorm(30),
letter = sample(c("a", "b", "c"),
size = 30, replace = TRUE))
datsplit <- split(dat, dat$letter)
class(datsplit) # list
length(datsplit)# length 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment