Last active
November 8, 2016 14:56
-
-
Save zross/84ef7f61a1ba87421dd95e8b292cd804 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# ********************************************************************************** | |
# Data Science with R, Part 2 - Predictive modeling and machine learning: Exercise 1 | |
# ********************************************************************************** | |
# This exercise is designed to remind you of key R | |
# concepts and code that you will need to be familiar | |
# with before participating in Part 2 of the R workshop. | |
# Please feel free to use Google, stackoverflow etc to help | |
# you answer questions, treat them as if they were real-world | |
# challenges | |
# Answer the questions below. Your answers will not be reviewed | |
# by me. I recommend you write your answer below the question. | |
# load the package with this | |
library(datascienceRworkshop2) | |
# *********************************************** | |
# ----- R refresher: Question 1 ---- | |
# *********************************************** | |
# How do you load packages? Please load dplyr, tidyr, | |
# ggplot2 and mlbench | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 2 ---- | |
# *********************************************** | |
# Please load the BostonHousing2 dataset (not | |
# BostonHousing).Then review the documentation | |
# on this set. Please note that this is tract- | |
# level data from the 1970 Census! Not recent. | |
data("BostonHousing2") | |
?BostonHousing2 | |
# To reduce typing let's rename with | |
bh2 <- BostonHousing2 | |
# A reminder that if you accidentally damage the dataset | |
# you can reload by rerunning: | |
data("BostonHousing2") | |
bh2 <- BostonHousing2 | |
# What function might you use to look at the first few | |
# records. What function would tell you the field types? | |
# Are all of the fields numeric? | |
# You can use the Part 1 cheat sheet if needed. Open it | |
# with: | |
cheat_sheet("part1") | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 3 ---- | |
# *********************************************** | |
# How would you count how many observations there | |
# are per town. You have several options here. | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 4 ---- | |
# *********************************************** | |
# How would you use base functions to create a subset | |
# with only those records from the town Belmont? Call | |
# this new table belmont. | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 5 ---- | |
# *********************************************** | |
# In all questions related to median housing values | |
# you should use the corrected variable (cmedv) | |
# and not the original variable (medv) | |
# You will need to know how to use dplyr and tidyr | |
# in Part 2. If these are relatively new to you | |
# you might review my blog post on the | |
# topic here: http://bit.ly/17HltG8 | |
# Using dplyr on the bh2 set (not your subset) | |
# compute the average crime rate (variable crim) | |
# and average median home value (cmedv) by town. | |
# Call the new table avgvals and call the new | |
# variables avg.crim and avg.cmedv | |
# What is the average crime rate for Arlington? | |
# How about for Westwood? | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 6 ---- | |
# ************************************************************* | |
# How would you do a quick histogram (using base R) of | |
# of the average crime (using your new avgvals dataset)? | |
# What is the minimum crime rate? | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 7 ---- | |
# *********************************************** | |
# How would you use ggplot2 to plot crime (on the x-axis) | |
# against median house values? | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 8 ---- | |
# *********************************************** | |
# Seems like the high values of crime might be | |
# making the plot harder to read. How would | |
# you add a log scale to the crime axis and | |
# a smooth to the ggplot? The log axis was not | |
# covered in Part 1 so you may need to do | |
# a Google search. Does there appear to be | |
# a trend between crime and median housing values? | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 9 ---- | |
# *********************************************** | |
# Here is pseudo-code for computing a three-level | |
# factor of median housing values | |
cut(cmedv, breaks = c(0, 20, 35, 55), | |
labels = c("lowHs", "medHs", "highHs")) # will give error | |
# How would you use dplyr to add a variable to | |
# bh2 with a three-level factor of median housing? | |
# Call the new variable cmedv_cat | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 10 ---- | |
# *********************************************** | |
# If you want to add a column with the total number of | |
# tracts per town you have two basic options in | |
# dplyr. Here is one: | |
# cnt <- group_by(bh2, town) %>% | |
# summarise(total_tracts = n()) | |
# bh2 <- inner_join(bh2, cnt, by = "town") | |
# But you can simplify this and skip the join | |
# by just replacing the summarise function with | |
# mutate | |
# Do this, replace summarise with mutate in the | |
# above code, does it add a new column to your | |
# dataset as expected? | |
# *********************************************** | |
# ----- R refresher: Question 11 ---- | |
# *********************************************** | |
# This one is very difficult (!). You might even want to | |
# take a little break before doing it :) But it's hard | |
# because it is a real-world type of coding challenge. | |
# I want to know what proportion of tracts in the town | |
# are each category of housing value. So what proportion | |
# of tracts are in lowHs from Arlington Ashland etc. How | |
# would you create a table that looks like this (where | |
# lowHs would be proportion of tracts in the lowest | |
# median housing category): | |
# town lowHs medHs highHs | |
# Arlington 0.0 1.00 0.00 | |
# Ashland 0.0 1.00 0.00 | |
# Bedford 0.0 1.00 0.00 | |
# Belmont 0.0 0.38 0.62 | |
# Beverly 0.5 0.50 0.00 | |
# Here are the steps you might use. I would use dplyr and tidyr | |
# Create a new table with tract count by town and median housing | |
# category: | |
# 1. group by town, cmedv_cat and total tracts (total tracts just so output table has this variable) | |
# 2. summarize with n() to get count of tracts | |
# 3. use mutate to add a new column with newcount/total tracts. Result: | |
# town cmedv_cat total_tracts count proportion | |
# Arlington medHs 7 7 1.00 | |
# Ashland medHs 2 2 1.00 | |
# Bedford medHs 2 2 1.00 | |
# Belmont medHs 8 3 0.38 | |
# Belmont highHs 8 5 0.62 | |
# Make into wide format | |
# 1. use select to drop total tracts and count (only need proportions) | |
# 2. use tidyr spread function to make into wide format | |
# town lowHs medHs highHs | |
# Arlington 0.0 1.00 0.00 | |
# Ashland 0.0 1.00 0.00 | |
# Bedford 0.0 1.00 0.00 | |
# Belmont 0.0 0.38 0.62 | |
# Beverly 0.5 0.50 0.00 | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 12 ---- | |
# *********************************************** | |
# Single and double brackets do slightly different | |
# things when you apply to lists. Create a list with | |
lst <- as.list(as.character(unique(bh2$town)[1:5])) | |
# What is the difference between these two lines | |
# of code? | |
lst[[1]] | |
lst[1] | |
# If you want to pull out the 1st through 3rd list element | |
# with 1:3 do you use single or double brackets? | |
# ***** Answer below ********** | |
# *********************************************** | |
# ----- R refresher: Question 13 ---- | |
# *********************************************** | |
# How would you create a function that takes data | |
# as input and returns the mean and standard deviation? | |
# *********************************************** | |
# ----- R refresher: Question 14 ---- | |
# *********************************************** | |
# How would you use sapply to compute the average | |
# and standard deviation of these list elements | |
# using the function you created above | |
lst <- list(norm = rnorm(20), unif = runif(10), seq = 1:100) | |
# *********************************************** | |
# ----- R refresher: Question 15 ---- | |
# *********************************************** | |
# In Part 1 we did not discuss the function set.seed() | |
# but in Part 2 we will use it regularly so that we | |
# can compare answers. | |
# The help for set.seed is not easy to follow but | |
# the basic idea is that the "random" number generator | |
# is actually a "pseudo-random" number generator and | |
# if it gets the same starting value (seed) it will | |
# produce the same results. | |
# Run the following lines of code. Do you get the same | |
# results? | |
sample(1:100, 3) # I get 75 10 53 | |
set.seed(2) | |
sample(1:100, 3) # I get 19 70 57 | |
# *********************************************** | |
# ----- R refresher: Question 16 ---- | |
# *********************************************** | |
# Part 1 did not cover any statistics. If you're taking | |
# Part 2 I am assuming that you have run statistical | |
# models (for example, linear or logistic regression) | |
# in another system (SAS/stata etc). But I want to you | |
# to be familiar with the basic syntax in R | |
# Please review the code below to see how a simple | |
# linear regression model is run. Note, in particular, | |
# how the ~ (tilda) is used to separate the left | |
# and right side of the model (y and x) | |
# average median housing value, average tax by town | |
dat <- group_by(bh2, town) %>% | |
summarise(avg.tax = mean(tax), avg.cmedv = mean(cmedv)) | |
# split into training and test dataset. Normally I | |
# would not do this with so few observations but | |
# we will for example purposes. | |
set.seed(100) | |
part <- sample(1:92, 75) | |
training <- dat[part,] | |
testing <- dat[-part,] | |
plot(dat$avg.tax, dat$avg.cmedv) | |
# more appropriate model would use the raw | |
# data but this is an example | |
lm1 <- lm(avg.cmedv ~ avg.tax, data = training) | |
summary(lm1) | |
par(mfrow=c(2,2)) | |
plot(lm1) # definitely not normally distributed resids | |
# get residuals | |
head(lm1$residuals) | |
# get fitted values | |
head(lm1$fitted.values) | |
# do prediction | |
par(mfrow = c(1,1)) | |
preds1 <- predict(lm1, testing) | |
plot(preds1, testing$avg.cmedv) | |
abline(0,1) # not good! | |
# *********************************************** | |
# ----- R refresher: Question 17 ---- | |
# *********************************************** | |
# In Part 1 we did not discuss the expand.grid | |
# function but it will be useful in Part 2 | |
# See if you can figure out what expand grid does. | |
expand.grid(a = 1:3, b = 1:2) | |
# *********************************************** | |
# ----- R refresher: Question 18 ---- | |
# *********************************************** | |
# Here is a dataset | |
dat <- data.frame(a = 1:3, b = 1:3, c = 1:3) | |
# What is the difference between these two | |
# lines of code for selecting variables? What | |
# might be a reason to use the first line (with ::)? | |
dplyr::select(dat, a,c) | |
select(dat, a, c) | |
# *********************************************** | |
# ----- R refresher: Question 19 ---- | |
# *********************************************** | |
# In Part 1 we talked briefly about using the RDS | |
# format for saving. What are the functions to | |
# read and write RDS files? If you had a list like | |
# the one below why might you use RDS instead of, | |
# say, CSV or Excel. | |
list_for_saving <- list(a=rnorm(5), b=matrix(1:10, nrow=2), | |
c = lm(1:10~rnorm(10))) | |
# *********************************************** | |
# ----- R refresher: Question 20 ---- | |
# *********************************************** | |
# The split function is not something we talked | |
# about in Part 1 but comes in handy once in awhile | |
# (and will be useful in a discussion of clusters) | |
# look at the help for split and split this | |
# data.frame using the letter variable. What class | |
# of object does it create and how long is it? | |
dat <- data.frame(rand = rnorm(30), | |
letter = sample(c("a", "b", "c"), | |
size = 30, replace = TRUE)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Solutions to exercise-1 | |
# ********************************* | |
# ----- Solution 1 ---- | |
# ********************************* | |
# This one should be easy! | |
library(dplyr) | |
library(tidyr) | |
library(mlbench) | |
# ********************************* | |
# ----- Solution 2 ---- | |
# ********************************* | |
# A couple of options here | |
head(bh2) # but no info on field types | |
str(bh2) # not all numeric, two factors | |
library(dplyr) | |
glimpse(bh2) | |
# town and chas are factors | |
# ********************************* | |
# ----- Solution 3 ---- | |
# ********************************* | |
# Easiest would be | |
table(bh2$town) | |
# More useful maybe could be one of these from dplyr | |
group_by(bh2, town) %>% summarize(cnt = n()) | |
count(bh2, town) | |
# ********************************* | |
# ----- Solution 4 ---- | |
# ********************************* | |
belmont <- bh2[bh2$town == "Belmont", ] | |
# ********************************* | |
# ----- Solution 5 ---- | |
# ********************************* | |
avgvals <- group_by(bh2, town) %>% | |
summarize(avg.crim = mean(crim), | |
avg.cmedv = mean(cmedv)) | |
# > head(avgvals) | |
# # A tibble: 6 × 3 | |
# town avg.crim avg.cmedv | |
# <fctr> <dbl> <dbl> | |
# 1 Arlington 0.08184571 25.2000 | |
# 2 Ashland 0.04183500 21.4000 | |
# 3 Bedford 0.01813000 30.1000 | |
# 4 Belmont 0.07290500 36.2000 | |
# > tail(avgvals) | |
# # A tibble: 6 × 3 | |
# town avg.crim avg.cmedv | |
# <fctr> <dbl> <dbl> | |
# 1 Westwood 0.0500100 31.23333 | |
# 2 Weymouth 0.0427550 20.03750 | |
# 3 Wilmington 0.1071967 20.10000 | |
# ********************************* | |
# ----- Solution 6 ---- | |
# ********************************* | |
min(avgvals$avg.crim) #0.00632 | |
hist(avgvals$avg.crim) | |
# ********************************* | |
# ----- Solution 7 ---- | |
# ********************************* | |
library(ggplot2) | |
ggplot(avgvals, aes(avg.crim, avg.cmedv)) + geom_point() | |
# ********************************* | |
# ----- Solution 8 ---- | |
# ********************************* | |
# there definitely seems to be a trend with higher | |
# crime associated with lower median housing values | |
ggplot(avgvals, aes(avg.crim, avg.cmedv)) + | |
geom_point() + scale_x_log10() + geom_smooth() | |
# ********************************* | |
# ----- Solution 9 ---- | |
# ********************************* | |
bh2 <- mutate(bh2, cmedv_cat = | |
cut(cmedv, breaks = c(0, 20, 35, 55), | |
labels = c("lowHs", "medHs", "highHs"))) | |
# ********************************* | |
# ----- Solution 10 ---- | |
# ********************************* | |
# Yes a new field is added | |
bh2 <- group_by(bh2, town) %>% | |
mutate(total_tracts = n()) | |
# ********************************* | |
# ----- Solution 11 ---- | |
# ********************************* | |
propmed <- group_by(bh2, town, cmedv_cat, total_tracts) %>% | |
summarize(count = n()) %>% | |
mutate(proportion = round(count/total_tracts,2)) | |
propmedW <- select(propmed, -total_tracts, -count) %>% | |
spread(cmedv_cat, proportion, fill = 0) | |
# ********************************* | |
# ----- Solution 12 ---- | |
# ********************************* | |
# Double brackets will select a single element | |
# and the object type will be the same as that element. | |
# In other words, if list element 1 is a data.frame | |
# then using lst[[1]] will return a data.frame | |
# | |
# Single brackets returns a list of elements. So if you | |
# have a list with a data.frame and then a vector and use | |
# lst[1:2] a list with a data.frame and vector is returned. | |
# | |
# You can only select multiple list elements using single brackets | |
# lst[1:3] | |
# ********************************* | |
# ----- Solution 13 ---- | |
# ********************************* | |
f <- function(x){ | |
c(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T)) | |
} | |
# ********************************* | |
# ----- Solution 14 ---- | |
# ********************************* | |
sapply(lst, f) | |
# ********************************* | |
# ----- Solution 15 ---- | |
# ********************************* | |
# Unlikely that you get the same results without set.seed | |
# you should get the same results with set.seed. | |
# ********************************* | |
# ----- Solution 16 ---- | |
# ********************************* | |
# There was no question here | |
# ********************************* | |
# ----- Solution 17 ---- | |
# ********************************* | |
# expand.grid will create a dataframe with all | |
# possible combinations of the input values | |
# ********************************* | |
# ----- Solution 18 ---- | |
# ********************************* | |
# The double colon can be used to specify the | |
# package a function comes from. I can see two | |
# reasons you might use the colon: | |
# 1) You might have two packages that you loaded that | |
# each have the same function | |
# (like MASS and dplyr have select). The package | |
# loaded last takes precendence so you can use the :: | |
# to use the masked version of the function. Sometimes | |
# it's useful to use the :: just to be clear about | |
# the source of a function even it's not required. | |
# 2) The other situation where you might use the :: is when | |
# you don't want to load an entire package and just | |
# want to use a single function | |
# ********************************* | |
# ----- Solution 19 ---- | |
# ********************************* | |
# You can use saveRDS() and readRDS() to save and | |
# read. RDS is useful for saving R objects that | |
# are not conveniently converted to standard | |
# tables if you intend to use in R again. So | |
# the list example I showed would not fit | |
# easily into a single table and if you wanted | |
# to save it for later you could save as RDS. | |
# ********************************* | |
# ----- Solution 20 ---- | |
# ********************************* | |
dat <- data.frame(rand = rnorm(30), | |
letter = sample(c("a", "b", "c"), | |
size = 30, replace = TRUE)) | |
datsplit <- split(dat, dat$letter) | |
class(datsplit) # list | |
length(datsplit)# length 3 | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment