mpettis/caret-lm-loocv-param-est.md

## caret-lm-loocv-param-est.md

      
    Raw
  

              caret-lm-loocv-param-est.md
            
          
Description


Config


Data


Modeling
knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=9, warnings=FALSE)
options(width=116)


Description

Exploration of caret with linear models
Config

Packages, global vars.
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(tidyverse))

## Conflicts with tidy packages ---------------------------------------------------------------------------------------

suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(broom))
#suppressPackageStartupMessages(library(formula.tools))
suppressPackageStartupMessages(library(caret))

def.par <- par(no.readonly = TRUE) # save default, for resetting...

Data

nobs <- 20
nobs <- 50 
train_prob <- 0.7

b1 <- 5
x1_min <- 0
x1_max <- 1

noise_sd <- 1

set.seed(123)
dat_1p <- data_frame(x1=runif(nobs, x1_min, x1_max)
  , y= b1 * x1 + rnorm(nobs, 0, noise_sd))


inTrain <- caret::createDataPartition(dat_1p$y, p=train_prob, list=FALSE)
dat_1p$is_train <- ifelse(seq_along(dat_1p$y) %in% inTrain, TRUE, FALSE)

dat_1p %>% print()

## # A tibble: 50 x 3
##           x1          y is_train
##        <dbl>      <dbl>    <lgl>
## 1  0.2875775 -0.2488057     TRUE
## 2  0.7883051  4.7793127    FALSE
## 3  0.4089769  2.1982577    FALSE
## 4  0.8830174  3.2769501    FALSE
## 5  0.9404673  5.9561513     TRUE
## 6  0.0455565  0.6542467     TRUE
## 7  0.5281055  2.3454560     TRUE
## 8  0.8924190  5.3572209     TRUE
## 9  0.5514350  3.6353086     TRUE
## 10 0.4566147  3.1046548     TRUE
## # ... with 40 more rows

qplot(x1, y, data=dat_1p, color=is_train, main="y vs. x1")


Modeling

Note that caret doesn't build a lm model differently on dat_train any
differently than lm does. What id does do is estimate performance on
a holdout, like dat_test.
Some relevant questions:

http://stats.stackexchange.com/questions/195478/in-r-package-caret-how-is-linear-regression-model-trained-by-using-resampling
http://stats.stackexchange.com/questions/236629/caret-extracting-lm-parameter-estimates-from-different-cross-validation-runs

My question on parameter estimates:

http://stats.stackexchange.com/questions/114168/how-to-get-sub-training-and-sub-test-from-cross-validation-in-caret


Note that the straight lm and caret lm regressions are identical in
output.

dat_train <- dat_1p %>% filter(is_train)
dat_test <- dat_1p %>% filter(!is_train)

  ## Normal lm on train
lmFit <- lm(y ~ x1, dat_train)
summary(lmFit)

## 
## Call:
## lm(formula = y ~ x1, data = dat_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2994 -0.5175 -0.1330  0.4954  2.1485 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.08407    0.30517  -0.275    0.785    
## x1           5.31872    0.52924  10.050 5.44e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9508 on 36 degrees of freedom
## Multiple R-squared:  0.7372, Adjusted R-squared:  0.7299 
## F-statistic:   101 on 1 and 36 DF,  p-value: 5.44e-12

  ## leave-one-out CV via caret
fitControl <- trainControl(method = "LOOCV")

  ## Perform fit with cross-validation
set.seed(825)
cvFit <- train(y ~ x1
            , data = dat_train,
            method = "lm",
            trControl = fitControl)

  ## fit that is returned from caret
summary(cvFit)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2994 -0.5175 -0.1330  0.4954  2.1485 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.08407    0.30517  -0.275    0.785    
## x1           5.31872    0.52924  10.050 5.44e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9508 on 36 degrees of freedom
## Multiple R-squared:  0.7372, Adjusted R-squared:  0.7299 
## F-statistic:   101 on 1 and 36 DF,  p-value: 5.44e-12

  ## fit in $finalModel
summary(cvFit$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2994 -0.5175 -0.1330  0.4954  2.1485 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.08407    0.30517  -0.275    0.785    
## x1           5.31872    0.52924  10.050 5.44e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9508 on 36 degrees of freedom
## Multiple R-squared:  0.7372, Adjusted R-squared:  0.7299 
## F-statistic:   101 on 1 and 36 DF,  p-value: 5.44e-12

Let's loop over the different LOOCV sets, fit lm on the set, extract the
b1 estimate for the x1 coefficient, and look at mean and s.d. of the
estimates. I want to see how they compare to the overall estimate and
s.d. from the whole of the training set.
cv_indexes <- cvFit$control$index
dat_cv_est <- ldply(cv_indexes, function(e) {
  datl <- dat_train[e,]
  llmfit <- lm(y ~ x1, datl)
  tidy(llmfit)
})

b1_ests <- dat_cv_est %>%
  filter(term == "x1") %>%
  .$estimate
b1_ests

##  [1] 5.203913 5.162665 5.395016 5.322409 5.226866 5.305220 5.328623 5.245441 5.325665 5.329937 5.328399 5.275226
## [13] 5.439832 5.302202 5.133323 5.435753 5.164865 5.492166 5.352659 5.345585 5.329286 5.311122 5.320465 5.323287
## [25] 5.410194 5.296835 5.290819 5.323169 5.366054 5.374727 5.226830 5.358944 5.329310 5.355308 5.556754 5.265015
## [37] 5.120783 5.423633

  ## Mean of CV parameter estimates
b1_ests %>% mean()

## [1] 5.318376

  ## Std. Err of parameter estimates
b1_ests %>% {sd(.)/sqrt(length(.))}

## [1] 0.014961


I note that the mean of the LOOCV x1 coeff estimates are close to what
was found with lm on the whole dat_train dataframe, but std. error
is not.