-
knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=9, warnings=FALSE) options(width=116)
Exploration of caret with linear models
Packages, global vars.
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(tidyverse))
## Conflicts with tidy packages ---------------------------------------------------------------------------------------
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(broom))
#suppressPackageStartupMessages(library(formula.tools))
suppressPackageStartupMessages(library(caret))
def.par <- par(no.readonly = TRUE) # save default, for resetting...
nobs <- 20
nobs <- 50
train_prob <- 0.7
b1 <- 5
x1_min <- 0
x1_max <- 1
noise_sd <- 1
set.seed(123)
dat_1p <- data_frame(x1=runif(nobs, x1_min, x1_max)
, y= b1 * x1 + rnorm(nobs, 0, noise_sd))
inTrain <- caret::createDataPartition(dat_1p$y, p=train_prob, list=FALSE)
dat_1p$is_train <- ifelse(seq_along(dat_1p$y) %in% inTrain, TRUE, FALSE)
dat_1p %>% print()
## # A tibble: 50 x 3
## x1 y is_train
## <dbl> <dbl> <lgl>
## 1 0.2875775 -0.2488057 TRUE
## 2 0.7883051 4.7793127 FALSE
## 3 0.4089769 2.1982577 FALSE
## 4 0.8830174 3.2769501 FALSE
## 5 0.9404673 5.9561513 TRUE
## 6 0.0455565 0.6542467 TRUE
## 7 0.5281055 2.3454560 TRUE
## 8 0.8924190 5.3572209 TRUE
## 9 0.5514350 3.6353086 TRUE
## 10 0.4566147 3.1046548 TRUE
## # ... with 40 more rows
qplot(x1, y, data=dat_1p, color=is_train, main="y vs. x1")
Note that caret doesn't build a lm model differently on dat_train
any
differently than lm
does. What id does do is estimate performance on
a holdout, like dat_test
.
Some relevant questions:
- http://stats.stackexchange.com/questions/195478/in-r-package-caret-how-is-linear-regression-model-trained-by-using-resampling
- http://stats.stackexchange.com/questions/236629/caret-extracting-lm-parameter-estimates-from-different-cross-validation-runs
My question on parameter estimates:
Note that the straight lm and caret lm regressions are identical in output.
dat_train <- dat_1p %>% filter(is_train)
dat_test <- dat_1p %>% filter(!is_train)
## Normal lm on train
lmFit <- lm(y ~ x1, dat_train)
summary(lmFit)
##
## Call:
## lm(formula = y ~ x1, data = dat_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2994 -0.5175 -0.1330 0.4954 2.1485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08407 0.30517 -0.275 0.785
## x1 5.31872 0.52924 10.050 5.44e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9508 on 36 degrees of freedom
## Multiple R-squared: 0.7372, Adjusted R-squared: 0.7299
## F-statistic: 101 on 1 and 36 DF, p-value: 5.44e-12
## leave-one-out CV via caret
fitControl <- trainControl(method = "LOOCV")
## Perform fit with cross-validation
set.seed(825)
cvFit <- train(y ~ x1
, data = dat_train,
method = "lm",
trControl = fitControl)
## fit that is returned from caret
summary(cvFit)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2994 -0.5175 -0.1330 0.4954 2.1485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08407 0.30517 -0.275 0.785
## x1 5.31872 0.52924 10.050 5.44e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9508 on 36 degrees of freedom
## Multiple R-squared: 0.7372, Adjusted R-squared: 0.7299
## F-statistic: 101 on 1 and 36 DF, p-value: 5.44e-12
## fit in $finalModel
summary(cvFit$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2994 -0.5175 -0.1330 0.4954 2.1485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08407 0.30517 -0.275 0.785
## x1 5.31872 0.52924 10.050 5.44e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9508 on 36 degrees of freedom
## Multiple R-squared: 0.7372, Adjusted R-squared: 0.7299
## F-statistic: 101 on 1 and 36 DF, p-value: 5.44e-12
Let's loop over the different LOOCV sets, fit lm on the set, extract the
b1
estimate for the x1
coefficient, and look at mean and s.d. of the
estimates. I want to see how they compare to the overall estimate and
s.d. from the whole of the training set.
cv_indexes <- cvFit$control$index
dat_cv_est <- ldply(cv_indexes, function(e) {
datl <- dat_train[e,]
llmfit <- lm(y ~ x1, datl)
tidy(llmfit)
})
b1_ests <- dat_cv_est %>%
filter(term == "x1") %>%
.$estimate
b1_ests
## [1] 5.203913 5.162665 5.395016 5.322409 5.226866 5.305220 5.328623 5.245441 5.325665 5.329937 5.328399 5.275226
## [13] 5.439832 5.302202 5.133323 5.435753 5.164865 5.492166 5.352659 5.345585 5.329286 5.311122 5.320465 5.323287
## [25] 5.410194 5.296835 5.290819 5.323169 5.366054 5.374727 5.226830 5.358944 5.329310 5.355308 5.556754 5.265015
## [37] 5.120783 5.423633
## Mean of CV parameter estimates
b1_ests %>% mean()
## [1] 5.318376
## Std. Err of parameter estimates
b1_ests %>% {sd(.)/sqrt(length(.))}
## [1] 0.014961
I note that the mean of the LOOCV x1 coeff estimates are close to what was found with
lm
on the wholedat_train
dataframe, but std. error is not.