zross/exercise1.R

## exercise1.R
# **********************************************************************************
# Data Science with R, Part 2 - Predictive modeling and machine learning: Exercise 1
# **********************************************************************************

# This exercise is designed to remind you of key R
# concepts and code that you will need to be familiar
# with before participating in Part 2 of the R workshop.
# Please feel free to use Google, stackoverflow etc to help
# you answer questions, treat them as if they were real-world
# challenges


# Answer the questions below. Your answers will not be reviewed
# by me. I recommend you write your answer below the question.

# load the package with this
library(datascienceRworkshop2)


# ***********************************************
# ----- R refresher: Question 1 ----
# ***********************************************


# How do you load packages? Please load dplyr, tidyr,
# ggplot2 and mlbench


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 2 ----
# ***********************************************


# Please load the BostonHousing2 dataset (not
# BostonHousing).Then review the documentation
# on this set. Please note that this is tract-
# level data from the 1970 Census! Not recent.


data("BostonHousing2")
?BostonHousing2

# To reduce typing let's rename with

bh2 <- BostonHousing2


# A reminder that if you accidentally damage the dataset
# you can reload by rerunning:

data("BostonHousing2")
bh2 <- BostonHousing2

# What function might you use to look at the first few
# records. What function would tell you the field types?
# Are all of the fields numeric?


# You can use the Part 1 cheat sheet if needed. Open it
# with:

cheat_sheet("part1")


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 3 ----
# ***********************************************

# How would you count how many observations there
# are per town. You have several options here.


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 4 ----
# ***********************************************

# How would you use base functions to create a subset
# with only those records from the town Belmont? Call
# this new table belmont.


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 5 ----
# ***********************************************

# In all questions related to median housing values
# you should use the corrected variable (cmedv)
# and not the original variable (medv)

# You will need to know how to use dplyr and tidyr
# in Part 2. If these are relatively new to you
# you might review my blog post on the
# topic here: http://bit.ly/17HltG8

# Using dplyr on the bh2 set (not your subset)
# compute the average crime rate (variable crim)
# and average median home value (cmedv) by town.
# Call the new table avgvals and call the new
# variables avg.crim and avg.cmedv

# What is the average crime rate for Arlington?
# How about for Westwood?


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 6 ----
# *************************************************************

# How would you do a quick histogram (using base R) of
# of the average crime (using your new avgvals dataset)?
# What is the minimum crime rate?


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 7 ----
# ***********************************************

# How would you use ggplot2 to plot crime (on the x-axis)
# against median house values?


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 8 ----
# ***********************************************

# Seems like the high values of crime might be
# making the plot harder to read. How would
# you add a log scale to the crime axis and
# a smooth to the ggplot? The log axis was not
# covered in Part 1 so you may need to do
# a Google search. Does there appear to be
# a trend between crime and median housing values?


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 9 ----
# ***********************************************

# Here is pseudo-code for computing a three-level
# factor of median housing values

cut(cmedv, breaks = c(0, 20, 35, 55),
    labels = c("lowHs", "medHs", "highHs")) # will give error

# How would you use dplyr to add a variable to
# bh2 with a three-level factor of median housing?
# Call the new variable cmedv_cat


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 10 ----
# ***********************************************

# If you want to add a column with the total number of
# tracts per town you have two basic options in
# dplyr. Here is one:

# cnt <- group_by(bh2, town) %>%
#       summarise(total_tracts = n())
# bh2 <- inner_join(bh2, cnt, by = "town")

# But you can simplify this and skip the join
# by just replacing the summarise function with
# mutate

# Do this, replace summarise with mutate in the
# above code, does it add a new column to your
# dataset as expected?


# ***********************************************
# ----- R refresher: Question 11 ----
# ***********************************************

# This one is very difficult (!). You might even want to
# take a little break before doing it :) But it's hard
# because it is a real-world type of coding challenge.

# I want to know what proportion of tracts in the town
# are each category of housing value. So what proportion
# of tracts are in lowHs from Arlington Ashland etc. How
# would you create a table that looks like this (where
# lowHs would be proportion of tracts in the lowest
# median housing category):

 #     town lowHs medHs highHs
 # Arlington   0.0  1.00   0.00
 #   Ashland   0.0  1.00   0.00
 #   Bedford   0.0  1.00   0.00
 #   Belmont   0.0  0.38   0.62
 #   Beverly   0.5  0.50   0.00


# Here are the steps you might use. I would use dplyr and tidyr

# Create a new table with tract count by town and median housing
# category:

#  1. group by town, cmedv_cat and total tracts (total tracts just so output table has this variable)
#  2. summarize with n() to get count of tracts
#  3. use mutate to add a new column with newcount/total tracts. Result:

 #      town cmedv_cat total_tracts count proportion
 # Arlington     medHs            7     7       1.00
 #   Ashland     medHs            2     2       1.00
 #   Bedford     medHs            2     2       1.00
 #   Belmont     medHs            8     3       0.38
 #   Belmont    highHs            8     5       0.62

# Make into wide format

#  1.  use select to drop total tracts and count (only need proportions)
#  2.  use tidyr spread function to make into wide format

 #     town lowHs medHs highHs
 # Arlington   0.0  1.00   0.00
 #   Ashland   0.0  1.00   0.00
 #   Bedford   0.0  1.00   0.00
 #   Belmont   0.0  0.38   0.62
 #   Beverly   0.5  0.50   0.00


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 12 ----
# ***********************************************


# Single and double brackets do slightly different
# things when you apply to lists. Create a list with

lst <- as.list(as.character(unique(bh2$town)[1:5]))

# What is the difference between these two lines
# of code?


lst[[1]]
lst[1]


# If you want to pull out the 1st through 3rd list element
# with 1:3 do you use single or double brackets?


# ***** Answer below **********


# ***********************************************
# ----- R refresher: Question 13 ----
# ***********************************************

# How would you create a function that takes data
# as input and returns the mean and standard deviation?


# ***********************************************
# ----- R refresher: Question 14 ----
# ***********************************************

# How would you use sapply to compute the average
# and standard deviation of these list elements
# using the function you created above

lst <- list(norm = rnorm(20), unif = runif(10), seq = 1:100)


# ***********************************************
# ----- R refresher: Question 15 ----
# ***********************************************

# In Part 1 we did not discuss the function set.seed()
# but in Part 2 we will use it regularly so that we
# can compare answers.

# The help for set.seed is not easy to follow but
# the basic idea is that the "random" number generator
# is actually a "pseudo-random" number generator and
# if it gets the same starting value (seed) it will
# produce the same results.

# Run the following lines of code. Do you get the same
# results?

sample(1:100, 3) # I get 75 10 53

set.seed(2)
sample(1:100, 3) # I get 19 70 57


# ***********************************************
# ----- R refresher: Question 16 ----
# ***********************************************

# Part 1 did not cover any statistics. If you're taking
# Part 2 I am assuming that you have run statistical
# models (for example, linear or logistic regression)
# in another system (SAS/stata etc). But I want to you
# to be familiar with the basic syntax in R

# Please review the code below to see how a simple
# linear regression model is run. Note, in particular,
# how the ~ (tilda) is used to separate the left
# and right side of the model (y and x)


# average median housing value, average tax by town
dat <- group_by(bh2, town) %>%
  summarise(avg.tax = mean(tax), avg.cmedv = mean(cmedv))


# split into training and test dataset. Normally I
# would not do this with so few observations but
# we will for example purposes.

set.seed(100)
part <- sample(1:92, 75)
training <- dat[part,]
testing <- dat[-part,]

plot(dat$avg.tax, dat$avg.cmedv)

# more appropriate model would use the raw
# data but this is an example
lm1 <- lm(avg.cmedv ~ avg.tax, data = training)
summary(lm1)
par(mfrow=c(2,2))
plot(lm1) # definitely not normally distributed resids

# get residuals
head(lm1$residuals)

# get fitted values
head(lm1$fitted.values)

# do prediction
par(mfrow = c(1,1))
preds1 <- predict(lm1, testing)
plot(preds1, testing$avg.cmedv)
abline(0,1) # not good!


# ***********************************************
# ----- R refresher: Question 17 ----
# ***********************************************

# In Part 1 we did not discuss the expand.grid
# function but it will be useful in Part 2

# See if you can figure out what expand grid does.
expand.grid(a = 1:3, b = 1:2)


# ***********************************************
# ----- R refresher: Question 18 ----
# ***********************************************

# Here is a dataset

dat <- data.frame(a = 1:3, b = 1:3, c = 1:3)


# What is the difference between these two
# lines of code for selecting variables? What
# might be a reason to use the first line (with ::)?

dplyr::select(dat, a,c)
select(dat, a, c)


# ***********************************************
# ----- R refresher: Question 19 ----
# ***********************************************

# In Part 1 we talked briefly about using the RDS
# format for saving. What are the functions to
# read and write RDS files? If you had a list like
# the one below why might you use RDS instead of,
# say, CSV or Excel.

list_for_saving <- list(a=rnorm(5), b=matrix(1:10, nrow=2),
                           c = lm(1:10~rnorm(10)))


# ***********************************************
# ----- R refresher: Question 20 ----
# ***********************************************

# The split function is not something we talked
# about in Part 1 but comes in handy once in awhile
# (and will be useful in a discussion of clusters)
# look at the help for split and split this
# data.frame using the letter variable. What class
# of object does it create and how long is it?
dat <- data.frame(rand = rnorm(30),
                  letter = sample(c("a", "b", "c"),
                      size = 30, replace = TRUE))

## solution1.R
# Solutions to exercise-1


# *********************************
# ----- Solution 1 ----
# *********************************

# This one should be easy!

library(dplyr)
library(tidyr)
library(mlbench)


# *********************************
# ----- Solution 2 ----
# *********************************

# A couple of options here

head(bh2) # but no info on field types
str(bh2) # not all numeric, two factors
library(dplyr)
glimpse(bh2)

# town and chas are factors


# *********************************
# ----- Solution 3 ----
# *********************************

# Easiest would be

table(bh2$town)


# More useful maybe could be one of these from dplyr

group_by(bh2, town) %>% summarize(cnt = n())
count(bh2, town)


# *********************************
# ----- Solution 4 ----
# *********************************


belmont <- bh2[bh2$town == "Belmont", ]


# *********************************
# ----- Solution 5 ----
# *********************************


avgvals <- group_by(bh2, town) %>%
  summarize(avg.crim = mean(crim),
            avg.cmedv = mean(cmedv))

# > head(avgvals)
# # A tibble: 6 × 3
#                      town   avg.crim avg.cmedv
#                    <fctr>      <dbl>     <dbl>
# 1               Arlington 0.08184571   25.2000
# 2                 Ashland 0.04183500   21.4000
# 3                 Bedford 0.01813000   30.1000
# 4                 Belmont 0.07290500   36.2000

# > tail(avgvals)
# # A tibble: 6 × 3
#         town  avg.crim avg.cmedv
#       <fctr>     <dbl>     <dbl>
# 1   Westwood 0.0500100  31.23333
# 2   Weymouth 0.0427550  20.03750
# 3 Wilmington 0.1071967  20.10000


# *********************************
# ----- Solution 6 ----
# *********************************

min(avgvals$avg.crim) #0.00632
hist(avgvals$avg.crim)


# *********************************
# ----- Solution 7 ----
# *********************************

library(ggplot2)
ggplot(avgvals, aes(avg.crim, avg.cmedv)) + geom_point()


# *********************************
# ----- Solution 8 ----
# *********************************

# there definitely seems to be a trend with higher
# crime associated with lower median housing values

ggplot(avgvals, aes(avg.crim, avg.cmedv)) +
  geom_point() + scale_x_log10() + geom_smooth()


# *********************************
# ----- Solution 9 ----
# *********************************


bh2 <- mutate(bh2, cmedv_cat =
                cut(cmedv, breaks = c(0, 20, 35, 55),
                    labels = c("lowHs", "medHs", "highHs")))


# *********************************
# ----- Solution 10 ----
# *********************************

# Yes a new field is added

bh2 <- group_by(bh2, town) %>%
  mutate(total_tracts = n())


# *********************************
# ----- Solution 11 ----
# *********************************

propmed <- group_by(bh2, town, cmedv_cat, total_tracts) %>%
  summarize(count = n()) %>%
  mutate(proportion = round(count/total_tracts,2))

propmedW <- select(propmed, -total_tracts, -count) %>%
  spread(cmedv_cat, proportion, fill = 0)


# *********************************
# ----- Solution 12 ----
# *********************************

# Double brackets will select a single element
# and the object type will be the same as that element.
# In other words, if list element 1 is a data.frame
# then using lst[[1]] will return a data.frame
#
# Single brackets returns a list of elements. So if you
# have a list with a data.frame and then a vector and use
# lst[1:2] a list with a data.frame and vector is returned.
#
# You can only select multiple list elements using single brackets
# lst[1:3]


# *********************************
# ----- Solution 13 ----
# *********************************

f <- function(x){

  c(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T))
}


# *********************************
# ----- Solution 14 ----
# *********************************


sapply(lst, f)


# *********************************
# ----- Solution 15 ----
# *********************************

# Unlikely that you get the same results without set.seed
# you should get the same results with set.seed.


# *********************************
# ----- Solution 16 ----
# *********************************

# There was no question here


# *********************************
# ----- Solution 17 ----
# *********************************

# expand.grid will create a dataframe with all
# possible combinations of the input values


# *********************************
# ----- Solution 18 ----
# *********************************


# The double colon can be used to specify the
# package a function comes from. I can see two
# reasons you might use the colon:

# 1) You might have two packages that you loaded that
#    each have the same function
#    (like MASS and dplyr have select). The package
#    loaded last takes precendence so you can use the ::
#    to use the masked version of the function. Sometimes
#    it's useful to use the :: just to be clear about
#    the source of a function even it's not required.

# 2) The other situation where you might use the :: is when
#    you don't want to load an entire package and just
#    want to use a single function


# *********************************
# ----- Solution 19 ----
# *********************************

# You can use saveRDS() and readRDS() to save and
# read. RDS is useful for saving R objects that
# are not conveniently converted to standard
# tables if you intend to use in R again. So
# the list example I showed would not fit
# easily into a single table and if you wanted
# to save it for later you could save as RDS.


# *********************************
# ----- Solution 20 ----
# *********************************


dat <- data.frame(rand = rnorm(30),
                  letter = sample(c("a", "b", "c"),
                                  size = 30, replace = TRUE))
datsplit <- split(dat, dat$letter)
class(datsplit) # list
length(datsplit)# length 3
	# **********************************************************************************
	# Data Science with R, Part 2 - Predictive modeling and machine learning: Exercise 1
	# **********************************************************************************

	# This exercise is designed to remind you of key R
	# concepts and code that you will need to be familiar
	# with before participating in Part 2 of the R workshop.
	# Please feel free to use Google, stackoverflow etc to help
	# you answer questions, treat them as if they were real-world
	# challenges


	# Answer the questions below. Your answers will not be reviewed
	# by me. I recommend you write your answer below the question.

	# load the package with this
	library(datascienceRworkshop2)


	# ***********************************************
	# ----- R refresher: Question 1 ----
	# ***********************************************


	# How do you load packages? Please load dplyr, tidyr,
	# ggplot2 and mlbench



	# *** Answer below ********





	# ***********************************************
	# ----- R refresher: Question 2 ----
	# ***********************************************


	# Please load the BostonHousing2 dataset (not
	# BostonHousing).Then review the documentation
	# on this set. Please note that this is tract-
	# level data from the 1970 Census! Not recent.


	data("BostonHousing2")
	?BostonHousing2

	# To reduce typing let's rename with

	bh2 <- BostonHousing2


	# A reminder that if you accidentally damage the dataset
	# you can reload by rerunning:

	data("BostonHousing2")
	bh2 <- BostonHousing2

	# What function might you use to look at the first few
	# records. What function would tell you the field types?
	# Are all of the fields numeric?


	# You can use the Part 1 cheat sheet if needed. Open it
	# with:

	cheat_sheet("part1")



	# *** Answer below ********






	# ***********************************************
	# ----- R refresher: Question 3 ----
	# ***********************************************

	# How would you count how many observations there
	# are per town. You have several options here.



	# *** Answer below ********




	# ***********************************************
	# ----- R refresher: Question 4 ----
	# ***********************************************

	# How would you use base functions to create a subset
	# with only those records from the town Belmont? Call
	# this new table belmont.





	# *** Answer below ********




	# ***********************************************
	# ----- R refresher: Question 5 ----
	# ***********************************************

	# In all questions related to median housing values
	# you should use the corrected variable (cmedv)
	# and not the original variable (medv)

	# You will need to know how to use dplyr and tidyr
	# in Part 2. If these are relatively new to you
	# you might review my blog post on the
	# topic here: http://bit.ly/17HltG8

	# Using dplyr on the bh2 set (not your subset)
	# compute the average crime rate (variable crim)
	# and average median home value (cmedv) by town.
	# Call the new table avgvals and call the new
	# variables avg.crim and avg.cmedv

	# What is the average crime rate for Arlington?
	# How about for Westwood?



	# *** Answer below ********



	# ***********************************************
	# ----- R refresher: Question 6 ----
	# *************************************************************

	# How would you do a quick histogram (using base R) of
	# of the average crime (using your new avgvals dataset)?
	# What is the minimum crime rate?




	# *** Answer below ********



	# ***********************************************
	# ----- R refresher: Question 7 ----
	# ***********************************************

	# How would you use ggplot2 to plot crime (on the x-axis)
	# against median house values?




	# *** Answer below ********



	# ***********************************************
	# ----- R refresher: Question 8 ----
	# ***********************************************

	# Seems like the high values of crime might be
	# making the plot harder to read. How would
	# you add a log scale to the crime axis and
	# a smooth to the ggplot? The log axis was not
	# covered in Part 1 so you may need to do
	# a Google search. Does there appear to be
	# a trend between crime and median housing values?






	# *** Answer below ********






	# ***********************************************
	# ----- R refresher: Question 9 ----
	# ***********************************************

	# Here is pseudo-code for computing a three-level
	# factor of median housing values

	cut(cmedv, breaks = c(0, 20, 35, 55),
	labels = c("lowHs", "medHs", "highHs")) # will give error

	# How would you use dplyr to add a variable to
	# bh2 with a three-level factor of median housing?
	# Call the new variable cmedv_cat




	# *** Answer below ********





	# ***********************************************
	# ----- R refresher: Question 10 ----
	# ***********************************************

	# If you want to add a column with the total number of
	# tracts per town you have two basic options in
	# dplyr. Here is one:

	# cnt <- group_by(bh2, town) %>%
	# summarise(total_tracts = n())
	# bh2 <- inner_join(bh2, cnt, by = "town")

	# But you can simplify this and skip the join
	# by just replacing the summarise function with
	# mutate

	# Do this, replace summarise with mutate in the
	# above code, does it add a new column to your
	# dataset as expected?






	# ***********************************************
	# ----- R refresher: Question 11 ----
	# ***********************************************

	# This one is very difficult (!). You might even want to
	# take a little break before doing it :) But it's hard
	# because it is a real-world type of coding challenge.

	# I want to know what proportion of tracts in the town
	# are each category of housing value. So what proportion
	# of tracts are in lowHs from Arlington Ashland etc. How
	# would you create a table that looks like this (where
	# lowHs would be proportion of tracts in the lowest
	# median housing category):

	# town lowHs medHs highHs
	# Arlington 0.0 1.00 0.00
	# Ashland 0.0 1.00 0.00
	# Bedford 0.0 1.00 0.00
	# Belmont 0.0 0.38 0.62
	# Beverly 0.5 0.50 0.00


	# Here are the steps you might use. I would use dplyr and tidyr

	# Create a new table with tract count by town and median housing
	# category:

	# 1. group by town, cmedv_cat and total tracts (total tracts just so output table has this variable)
	# 2. summarize with n() to get count of tracts
	# 3. use mutate to add a new column with newcount/total tracts. Result:

	# town cmedv_cat total_tracts count proportion
	# Arlington medHs 7 7 1.00
	# Ashland medHs 2 2 1.00
	# Bedford medHs 2 2 1.00
	# Belmont medHs 8 3 0.38
	# Belmont highHs 8 5 0.62

	# Make into wide format

	# 1. use select to drop total tracts and count (only need proportions)
	# 2. use tidyr spread function to make into wide format

	# town lowHs medHs highHs
	# Arlington 0.0 1.00 0.00
	# Ashland 0.0 1.00 0.00
	# Bedford 0.0 1.00 0.00
	# Belmont 0.0 0.38 0.62
	# Beverly 0.5 0.50 0.00




	# *** Answer below ********






	# ***********************************************
	# ----- R refresher: Question 12 ----
	# ***********************************************


	# Single and double brackets do slightly different
	# things when you apply to lists. Create a list with

	lst <- as.list(as.character(unique(bh2$town)[1:5]))

	# What is the difference between these two lines
	# of code?


	lst[[1]]
	lst[1]


	# If you want to pull out the 1st through 3rd list element
	# with 1:3 do you use single or double brackets?




	# *** Answer below ********





	# ***********************************************
	# ----- R refresher: Question 13 ----
	# ***********************************************

	# How would you create a function that takes data
	# as input and returns the mean and standard deviation?




	# ***********************************************
	# ----- R refresher: Question 14 ----
	# ***********************************************

	# How would you use sapply to compute the average
	# and standard deviation of these list elements
	# using the function you created above

	lst <- list(norm = rnorm(20), unif = runif(10), seq = 1:100)


	# ***********************************************
	# ----- R refresher: Question 15 ----
	# ***********************************************

	# In Part 1 we did not discuss the function set.seed()
	# but in Part 2 we will use it regularly so that we
	# can compare answers.

	# The help for set.seed is not easy to follow but
	# the basic idea is that the "random" number generator
	# is actually a "pseudo-random" number generator and
	# if it gets the same starting value (seed) it will
	# produce the same results.

	# Run the following lines of code. Do you get the same
	# results?

	sample(1:100, 3) # I get 75 10 53

	set.seed(2)
	sample(1:100, 3) # I get 19 70 57



	# ***********************************************
	# ----- R refresher: Question 16 ----
	# ***********************************************

	# Part 1 did not cover any statistics. If you're taking
	# Part 2 I am assuming that you have run statistical
	# models (for example, linear or logistic regression)
	# in another system (SAS/stata etc). But I want to you
	# to be familiar with the basic syntax in R

	# Please review the code below to see how a simple
	# linear regression model is run. Note, in particular,
	# how the ~ (tilda) is used to separate the left
	# and right side of the model (y and x)


	# average median housing value, average tax by town
	dat <- group_by(bh2, town) %>%
	summarise(avg.tax = mean(tax), avg.cmedv = mean(cmedv))


	# split into training and test dataset. Normally I
	# would not do this with so few observations but
	# we will for example purposes.

	set.seed(100)
	part <- sample(1:92, 75)
	training <- dat[part,]
	testing <- dat[-part,]

	plot(dat$avg.tax, dat$avg.cmedv)

	# more appropriate model would use the raw
	# data but this is an example
	lm1 <- lm(avg.cmedv ~ avg.tax, data = training)
	summary(lm1)
	par(mfrow=c(2,2))
	plot(lm1) # definitely not normally distributed resids

	# get residuals
	head(lm1$residuals)

	# get fitted values
	head(lm1$fitted.values)

	# do prediction
	par(mfrow = c(1,1))
	preds1 <- predict(lm1, testing)
	plot(preds1, testing$avg.cmedv)
	abline(0,1) # not good!


	# ***********************************************
	# ----- R refresher: Question 17 ----
	# ***********************************************

	# In Part 1 we did not discuss the expand.grid
	# function but it will be useful in Part 2

	# See if you can figure out what expand grid does.
	expand.grid(a = 1:3, b = 1:2)




	# ***********************************************
	# ----- R refresher: Question 18 ----
	# ***********************************************

	# Here is a dataset

	dat <- data.frame(a = 1:3, b = 1:3, c = 1:3)


	# What is the difference between these two
	# lines of code for selecting variables? What
	# might be a reason to use the first line (with ::)?

	dplyr::select(dat, a,c)
	select(dat, a, c)


	# ***********************************************
	# ----- R refresher: Question 19 ----
	# ***********************************************

	# In Part 1 we talked briefly about using the RDS
	# format for saving. What are the functions to
	# read and write RDS files? If you had a list like
	# the one below why might you use RDS instead of,
	# say, CSV or Excel.

	list_for_saving <- list(a=rnorm(5), b=matrix(1:10, nrow=2),
	c = lm(1:10~rnorm(10)))



	# ***********************************************
	# ----- R refresher: Question 20 ----
	# ***********************************************

	# The split function is not something we talked
	# about in Part 1 but comes in handy once in awhile
	# (and will be useful in a discussion of clusters)
	# look at the help for split and split this
	# data.frame using the letter variable. What class
	# of object does it create and how long is it?
	dat <- data.frame(rand = rnorm(30),
	letter = sample(c("a", "b", "c"),
	size = 30, replace = TRUE))
	# Solutions to exercise-1





	# *********************************
	# ----- Solution 1 ----
	# *********************************

	# This one should be easy!

	library(dplyr)
	library(tidyr)
	library(mlbench)


	# *********************************
	# ----- Solution 2 ----
	# *********************************

	# A couple of options here

	head(bh2) # but no info on field types
	str(bh2) # not all numeric, two factors
	library(dplyr)
	glimpse(bh2)

	# town and chas are factors




	# *********************************
	# ----- Solution 3 ----
	# *********************************

	# Easiest would be

	table(bh2$town)


	# More useful maybe could be one of these from dplyr

	group_by(bh2, town) %>% summarize(cnt = n())
	count(bh2, town)



	# *********************************
	# ----- Solution 4 ----
	# *********************************


	belmont <- bh2[bh2$town == "Belmont", ]


	# *********************************
	# ----- Solution 5 ----
	# *********************************


	avgvals <- group_by(bh2, town) %>%
	summarize(avg.crim = mean(crim),
	avg.cmedv = mean(cmedv))

	# > head(avgvals)
	# # A tibble: 6 × 3
	# town avg.crim avg.cmedv
	# <fctr> <dbl> <dbl>
	# 1 Arlington 0.08184571 25.2000
	# 2 Ashland 0.04183500 21.4000
	# 3 Bedford 0.01813000 30.1000
	# 4 Belmont 0.07290500 36.2000

	# > tail(avgvals)
	# # A tibble: 6 × 3
	# town avg.crim avg.cmedv
	# <fctr> <dbl> <dbl>
	# 1 Westwood 0.0500100 31.23333
	# 2 Weymouth 0.0427550 20.03750
	# 3 Wilmington 0.1071967 20.10000


	# *********************************
	# ----- Solution 6 ----
	# *********************************

	min(avgvals$avg.crim) #0.00632
	hist(avgvals$avg.crim)





	# *********************************
	# ----- Solution 7 ----
	# *********************************

	library(ggplot2)
	ggplot(avgvals, aes(avg.crim, avg.cmedv)) + geom_point()



	# *********************************
	# ----- Solution 8 ----
	# *********************************

	# there definitely seems to be a trend with higher
	# crime associated with lower median housing values

	ggplot(avgvals, aes(avg.crim, avg.cmedv)) +
	geom_point() + scale_x_log10() + geom_smooth()




	# *********************************
	# ----- Solution 9 ----
	# *********************************


	bh2 <- mutate(bh2, cmedv_cat =
	cut(cmedv, breaks = c(0, 20, 35, 55),
	labels = c("lowHs", "medHs", "highHs")))



	# *********************************
	# ----- Solution 10 ----
	# *********************************

	# Yes a new field is added

	bh2 <- group_by(bh2, town) %>%
	mutate(total_tracts = n())


	# *********************************
	# ----- Solution 11 ----
	# *********************************

	propmed <- group_by(bh2, town, cmedv_cat, total_tracts) %>%
	summarize(count = n()) %>%
	mutate(proportion = round(count/total_tracts,2))

	propmedW <- select(propmed, -total_tracts, -count) %>%
	spread(cmedv_cat, proportion, fill = 0)




	# *********************************
	# ----- Solution 12 ----
	# *********************************

	# Double brackets will select a single element
	# and the object type will be the same as that element.
	# In other words, if list element 1 is a data.frame
	# then using lst[[1]] will return a data.frame
	#
	# Single brackets returns a list of elements. So if you
	# have a list with a data.frame and then a vector and use
	# lst[1:2] a list with a data.frame and vector is returned.
	#
	# You can only select multiple list elements using single brackets
	# lst[1:3]



	# *********************************
	# ----- Solution 13 ----
	# *********************************

	f <- function(x){

	c(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T))
	}



	# *********************************
	# ----- Solution 14 ----
	# *********************************


	sapply(lst, f)


	# *********************************
	# ----- Solution 15 ----
	# *********************************

	# Unlikely that you get the same results without set.seed
	# you should get the same results with set.seed.


	# *********************************
	# ----- Solution 16 ----
	# *********************************

	# There was no question here


	# *********************************
	# ----- Solution 17 ----
	# *********************************

	# expand.grid will create a dataframe with all
	# possible combinations of the input values



	# *********************************
	# ----- Solution 18 ----
	# *********************************


	# The double colon can be used to specify the
	# package a function comes from. I can see two
	# reasons you might use the colon:

	# 1) You might have two packages that you loaded that
	# each have the same function
	# (like MASS and dplyr have select). The package
	# loaded last takes precendence so you can use the ::
	# to use the masked version of the function. Sometimes
	# it's useful to use the :: just to be clear about
	# the source of a function even it's not required.

	# 2) The other situation where you might use the :: is when
	# you don't want to load an entire package and just
	# want to use a single function


	# *********************************
	# ----- Solution 19 ----
	# *********************************

	# You can use saveRDS() and readRDS() to save and
	# read. RDS is useful for saving R objects that
	# are not conveniently converted to standard
	# tables if you intend to use in R again. So
	# the list example I showed would not fit
	# easily into a single table and if you wanted
	# to save it for later you could save as RDS.


	# *********************************
	# ----- Solution 20 ----
	# *********************************


	dat <- data.frame(rand = rnorm(30),
	letter = sample(c("a", "b", "c"),
	size = 30, replace = TRUE))
	datsplit <- split(dat, dat$letter)
	class(datsplit) # list
	length(datsplit)# length 3