Skip to content

Instantly share code, notes, and snippets.

@AaronRendahl
Created September 20, 2016 21:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AaronRendahl/eccad43578fb699a5c3481b798ae5957 to your computer and use it in GitHub Desktop.
Save AaronRendahl/eccad43578fb699a5c3481b798ae5957 to your computer and use it in GitHub Desktop.
Constructing function calls for more optimal use of optim, and more (TCRUG Sep 2016)

Constructing function calls for more optimal use of optim, and more (TCRUG-2016-09)

Aaron Rendahl
September 22, 2016

About myself...

  • Manager of the Statistical Consulting Center at the University of Minnesota
  • Help researchers design their studies and analyze their data
  • Also work with external companies, data analytics, etc.
  • PhD in Statistics (so not a programmer...)
  • But have used R since around 2005

how does one call a function programatically?

with call and do.call

say I have a function to add two numbers

myf <- function(x, y) x + y
myf(1, 1)
## [1] 2

do.call

do.call evaluates the function "what", with the arguments "args" (supplied as a list)

args(do.call)
## function (what, args, quote = FALSE, envir = parent.frame()) 
## NULL
do.call(myf, list(1, 1))
## [1] 2

call

call prepares it but does not evaluate it, and instead takes arguments separately through the ... argument

args(call)
## function (name, ...) 
## NULL
## <bytecode: 0x181a800>
(c1 <- call("myf", 1, 1))
## myf(1, 1)
eval(c1)
## [1] 2

I'll focus on do.call.

For better or for worse, I also will ignore the "envir" option which allows the call to be executed in a particular environment.

  • For better, because it would take longer than I have to address this in enough detail to be meaningful.
  • For worse because every so often really weird things can happen if you don't understand what this means.

for split-apply-combine

The most common use (perhaps) is as part of split-apply-combine, as so named and made famous by Hadley Wickham. This is what we used to do before he wrote the plyr package, and still is occassionally useful.

Say we have data that we want to split by group and apply a function to.

Here's a data frame with five groups, 10 observations of a and b for each group

set.seed(5)
d <- data.frame(group=rep(LETTERS[1:5], each=10), a=10 + round(rnorm(50, 1)), b=11 + round(rnorm(50,1)))
str(d)
## 'data.frame':	50 obs. of  3 variables:
##  $ group: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ a    : num  10 12 10 11 13 10 11 10 11 11 ...
##  $ b    : num  13 12 13 11 12 11 13 12 12 12 ...
## split it by group
ds <- split(d, d$group)
## apply a function to each part that gets mean a, mean b, confidence interval of difference for paired t test
out <- lapply(ds, function(di) with(di, 
        c(mean.a=mean(a), 
          mean.b=mean(b), 
          t.confint=t.test(a, b, paired=TRUE)$conf.int)))
out
## $A
##     mean.a     mean.b t.confint1 t.confint2 
## 10.9000000 12.1000000 -2.1418111 -0.2581889 
## 
## $B
##     mean.a     mean.b t.confint1 t.confint2 
## 10.5000000 11.5000000 -1.8922068 -0.1077932 
## 
## $C
##       mean.a       mean.b   t.confint1   t.confint2 
## 11.500000000 12.000000000 -1.005833719  0.005833719 
## 
## $D
##     mean.a     mean.b t.confint1 t.confint2 
## 11.1000000 12.0000000 -2.2677272  0.4677272 
## 
## $E
##     mean.a     mean.b t.confint1 t.confint2 
##  11.200000  12.200000  -1.674445  -0.325555
do.call(rbind, out)
##   mean.a mean.b t.confint1   t.confint2
## A   10.9   12.1  -2.141811 -0.258188870
## B   10.5   11.5  -1.892207 -0.107793183
## C   11.5   12.0  -1.005834  0.005833719
## D   11.1   12.0  -2.267727  0.467727151
## E   11.2   12.2  -1.674445 -0.325555041

more on how it works, with lm

(because it shows you the call itself in the output)

let's make and plot some data (avoiding dollar signs...)

set.seed(1)
n <- 50
d <- data.frame(x=round(runif(n, 0, 10), 1))
d <- within(d, {
    y <- round(1 + 0.9*x + rnorm(n, 0, 4), 1)
})
with(d, plot(x, y))

normally would fit model like this

lm(y ~ x, data=d)
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Coefficients:
## (Intercept)            x  
##      1.2246       0.9304

so with do.call, let's try this...

do.call(lm, list(y~x, data=d))
## 
## Call:
## (function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## {
##     ret.x <- x
##     ret.y <- y
##     cl <- match.call()
##     mf <- match.call(expand.dots = FALSE)
##     m <- match(c("formula", "data", "subset", "weights", "na.action", 
##         "offset"), names(mf), 0L)
##     mf <- mf[c(1L, m)]
##     mf$drop.unused.levels <- TRUE
##     mf[[1L]] <- quote(stats::model.frame)
##     mf <- eval(mf, parent.frame())
##     if (method == "model.frame") 
##         return(mf)
##     else if (method != "qr") 
##         warning(gettextf("method = '%s' is not supported. Using 'qr'", 
##             method), domain = NA)
##     mt <- attr(mf, "terms")
##     y <- model.response(mf, "numeric")
##     w <- as.vector(model.weights(mf))
##     if (!is.null(w) && !is.numeric(w)) 
##         stop("'weights' must be a numeric vector")
##     offset <- as.vector(model.offset(mf))
##     if (!is.null(offset)) {
##         if (length(offset) != NROW(y)) 
##             stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
##                 length(offset), NROW(y)), domain = NA)
##     }
##     if (is.empty.model(mt)) {
##         x <- NULL
##         z <- list(coefficients = if (is.matrix(y)) matrix(, 0, 
##             3) else numeric(), residuals = y, fitted.values = 0 * 
##             y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w != 
##             0) else if (is.matrix(y)) nrow(y) else length(y))
##         if (!is.null(offset)) {
##             z$fitted.values <- offset
##             z$residuals <- y - offset
##         }
##     }
##     else {
##         x <- model.matrix(mt, mf, contrasts)
##         z <- if (is.null(w)) 
##             lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
##                 ...)
##         else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, 
##             ...)
##     }
##     class(z) <- c(if (is.matrix(y)) "mlm", "lm")
##     z$na.action <- attr(mf, "na.action")
##     z$offset <- offset
##     z$contrasts <- attr(x, "contrasts")
##     z$xlevels <- .getXlevels(mt, mf)
##     z$call <- cl
##     z$terms <- mt
##     if (model) 
##         z$model <- mf
##     if (ret.x) 
##         z$x <- x
##     if (ret.y) 
##         z$y <- y
##     if (!qr) 
##         z$qr <- NULL
##     z
## })(formula = y ~ x, data = structure(list(x = c(2.7, 3.7, 5.7, 
## 9.1, 2, 9, 9.4, 6.6, 6.3, 0.6, 2.1, 1.8, 6.9, 3.8, 7.7, 5, 7.2, 
## 9.9, 3.8, 7.8, 9.3, 2.1, 6.5, 1.3, 2.7, 3.9, 0.1, 3.8, 8.7, 3.4, 
## 4.8, 6, 4.9, 1.9, 8.3, 6.7, 7.9, 1.1, 7.2, 4.1, 8.2, 6.5, 7.8, 
## 5.5, 5.3, 7.9, 0.2, 4.8, 7.3, 6.9), y = c(3.2, 3.7, 0.2, 7.3, 
## 4.5, 14.5, 9, 8.5, 6.5, -4, 1.2, 1, 7, 8.8, 11, 4.8, 6.5, 12.7, 
## 6.6, 5.3, 6.5, 4.3, 9.9, 1.7, 7, 6.1, -1.4, 5.8, 4.3, 9.8, 13.2, 
## 4.9, 1.2, 5, 7.9, 16.6, 8, 4.7, 7.6, 1.7, 9.1, -0.4, 13.9, 6.6, 
## 14.5, 10, -1.7, 7.8, 3.8, 2.2)), row.names = c(NA, -50L), .Names = c("x", 
## "y"), class = "data.frame"))
## 
## Coefficients:
## (Intercept)            x  
##      1.2246       0.9304

wow, what was that!

two things: 1) function call is passed in as value of argument, and 2) parameters are evaluated before sending to do.call

instead send the function as a character and wrap parameters we don't want evaluated first in "quote", which simply returns its argument when evaluated

do.call("lm", list(y~x, data=quote(d)))
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Coefficients:
## (Intercept)            x  
##      1.2246       0.9304

Both, however, give the correct result and can be used later with summary, coef, etc, as needed. The fact that parameters are evaluated first, though, is potentially really handy!

Programatically fitting models

This is my second main use for do.call, to programatically fit different formulas.

Can store formula in a character variable, it could be a vector of options that we loop through.

f <- "y ~ x"
## will work without do.call, but call refers to "f"
(m1 <- lm(f, data=d))
## 
## Call:
## lm(formula = f, data = d)
## 
## Coefficients:
## (Intercept)            x  
##      1.2246       0.9304
## do.call instead evaluates the f first, so call is as expected
(m2 <- do.call("lm", list(as.formula(f), data=quote(d))))
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Coefficients:
## (Intercept)            x  
##      1.2246       0.9304

doesn't matter in most contexts, but makes me a little nervous sometimes say we change f, then update the model (not that I use update a lot...)

f <- "y ~ 1"
update(m1)
## 
## Call:
## lm(formula = f, data = d)
## 
## Coefficients:
## (Intercept)  
##       6.178
update(m2)
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Coefficients:
## (Intercept)            x  
##      1.2246       0.9304

There are situations when you might want either, I guess, but it feels cleaner to me to have the actual formula I used stored rather than a generic "f". Though if you think about it, I prefer the opposite for the data argument, so...

fitting all possible (additive linear) models

set.seed(16)
v <- paste0("x", 1:5)
for(vi in v) {
  d[[vi]] <- round(rnorm(nrow(d), mean=10, sd=2), 1)
}
out <- do.call(expand.grid, setNames(rep(list(c(TRUE, FALSE)), length=5), v))
head(out)
##      x1    x2    x3   x4   x5
## 1  TRUE  TRUE  TRUE TRUE TRUE
## 2 FALSE  TRUE  TRUE TRUE TRUE
## 3  TRUE FALSE  TRUE TRUE TRUE
## 4 FALSE FALSE  TRUE TRUE TRUE
## 5  TRUE  TRUE FALSE TRUE TRUE
## 6 FALSE  TRUE FALSE TRUE TRUE
f <- apply(out, 1, function(x) paste(c(1, names(out)[x]), collapse=" + "))
f <- paste("y ~", f)
head(f)
## [1] "y ~ 1 + x1 + x2 + x3 + x4 + x5" "y ~ 1 + x2 + x3 + x4 + x5"     
## [3] "y ~ 1 + x1 + x3 + x4 + x5"      "y ~ 1 + x3 + x4 + x5"          
## [5] "y ~ 1 + x1 + x2 + x4 + x5"      "y ~ 1 + x2 + x4 + x5"
ms <- lapply(f, function(fi) {
  do.call("lm", list(as.formula(fi), data=quote(d)))
})
out$AIC <- sapply(ms, AIC)
summary(ms[[which.min(out$AIC)]])
## 
## Call:
## lm(formula = y ~ 1 + x1 + x3, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2497 -2.1521 -0.1565  2.0469  9.1151 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.2044     4.1090   4.430  5.6e-05 ***
## x1           -0.6016     0.3396  -1.771   0.0830 .  
## x3           -0.5869     0.2941  -1.995   0.0518 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.183 on 47 degrees of freedom
## Multiple R-squared:  0.1604,	Adjusted R-squared:  0.1246 
## F-statistic: 4.488 on 2 and 47 DF,  p-value: 0.01646

Careful with interpreting results from using the "best" model... Here it comes out statistically significant at the 0.05 level (though I cherry picked the random seed). How often does this happen by random chance? I used replicate to repeat the above 1000 times; here are the proportion of best models and full models with p<0.05.

##  best  full 
## 0.224 0.048

using do.call for optimal use of optim

See also: stats4::mle and bbmle::mle2, which makes some (possibly all) of this unneeded. Still, handy to see how parameters can be passed around from function to function.

making data in global environment in real life, this would be in a data frame could explore how to write functions that use variables from within data frame but not today...

rm(list=ls())
set.seed(1)
n <- 50
x <- round(runif(n, 0, 10), 1)
y <- round(1 + 0.9*x + rnorm(n, 0, 4), 1)
plot(x, y)

say we want to find the best linear fit, and standard deviation, assuming data is normally distributed around linear fit with constant variance

$y_i \sim N(a + bx_i, s^2)$

want to get parameters a, b, and s that maximize corresponding likelihood

m1 <- lm(y ~ x)
summary(m1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6721 -2.2880 -0.5557  2.0841  9.1418 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.2246     1.1680   1.048      0.3    
## x             0.9304     0.1958   4.752 1.87e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.725 on 48 degrees of freedom
## Multiple R-squared:  0.3199,	Adjusted R-squared:  0.3057 
## F-statistic: 22.58 on 1 and 48 DF,  p-value: 1.868e-05

I'll report log(s) instead of s itself, so that range can be over whole real line

setNames(c(coef(m1), log(sqrt(sum(residuals(m1)^2)/n))), c("int", "slope", "logsd"))
##       int     slope     logsd 
## 1.2246150 0.9303879 1.2946861

say we want to do this "by hand" as an example of what to do if had more complex likelihood function we want to minimize the negative log likelihood

negloglik <- function(int, slope, logsd, x, y) {
    -sum(dnorm(y, mean=int + slope * x, sd=exp(logsd), log=TRUE))
}

using optim is an option to do this but optim wants the parameters as one parameter at the beginning

args(optim)
## function (par, fn, gr = NULL, ..., method = c("Nelder-Mead", 
##     "BFGS", "CG", "L-BFGS-B", "SANN", "Brent"), lower = -Inf, 
##     upper = Inf, control = list(), hessian = FALSE) 
## NULL

method one

positional, lengthy

nll1 <- function(par, x, y) {
    negloglik(int=par[1], slope=par[2], logsd=par[3], x=x, y=y)
}
optim(par=c(int=1, slope=1, logsd=log(4)), nll1, x=x, y=y)$par
##       int     slope     logsd 
## 1.2232355 0.9306589 1.2946277

method two

uses do.call to put them in call automatically

nll2 <- function(par, x, y) {
    do.call(negloglik, c(as.list(par), list(x=x, y=y)))
}
optim(par=c(int=1, slope=1, logsd=log(4)), nll2, x=x, y=y)$par
##       int     slope     logsd 
## 1.2232355 0.9306589 1.2946277

Aside: combining lists

combining lists seems like it should be straightforward here are two lists, let's combine them

a1 <- list(x=1, y=2)
a2 <- list(a=c(10,11), b=c(11,12))
c(a1, a2)
## $x
## [1] 1
## 
## $y
## [1] 2
## 
## $a
## [1] 10 11
## 
## $b
## [1] 11 12

but this only works if both of the things are lists, if not, it doesn't do what we want

d <- 1:3
c(a2, d=d)
## $a
## [1] 10 11
## 
## $b
## [1] 11 12
## 
## $d1
## [1] 1
## 
## $d2
## [1] 2
## 
## $d3
## [1] 3
list(a2, d=d)
## [[1]]
## [[1]]$a
## [1] 10 11
## 
## [[1]]$b
## [1] 11 12
## 
## 
## $d
## [1] 1 2 3
c(a2, list(d=d))
## $a
## [1] 10 11
## 
## $b
## [1] 11 12
## 
## $d
## [1] 1 2 3

the other function I used is as.list, which takes a vector and makes it into a list instead. Contrast with just list.

a <- c(x=1, y=2)
as.list(a)
## $x
## [1] 1
## 
## $y
## [1] 2
list(a)
## [[1]]
## x y 
## 1 2

method three

uses do.call to optimize over some, not others this is really why I learned about this, to be able to profile the likelihood more easily

nll3 <- function(par, x, y, ...) {
    do.call(negloglik, c(as.list(par), list(x=x, y=y), ...))
}
optim(par=c(int=1, slope=1), nll3, x=x, y=y, logsd=log(4))$par
##       int     slope 
## 1.2243274 0.9305031

method four

more generic, as all parameters could get wrapped into the ...

nll4 <- function(par, ...) {
    do.call(negloglik, c(as.list(par), list(...)))
}
optim(par=c(int=1, slope=1), nll4, x=x, y=y, logsd=log(4))$par
##       int     slope 
## 1.2243274 0.9305031

more...

OK, now we're going to "show off..." not that you would choose these for this example but might be handy in other kinds of circumstances

method five

a helper function for optim

optimhelp <- function(par, FUN, ...) {
    do.call(FUN, c(as.list(par), list(...)))
}
optim(par=c(int=1, slope=1), optimhelp, FUN=negloglik, x=x, y=y, logsd=log(4))$par
##       int     slope 
## 1.2243274 0.9305031

method six: factory

a factory to make really generic ignoring issues with environments and closures right now...

optfactory <- function(FUN) {
    function(par, ...) { do.call(FUN, c(as.list(par), list(...))) }
}

nllpar <- optfactory(negloglik)
optim(par=c(int=1, slope=1), nllpar, x=x, y=y, logsd=log(4))$par
##       int     slope 
## 1.2243274 0.9305031

method seven, a new version of optim

this automatically allows extra parameters to be sent to optim as well!

optim2 <- function(par, fn, ...) {
    ff <- function(par, ...) { do.call(fn, c(as.list(par), list(...))) }
    optim(par=par, fn=ff, ...)
}
optim2(par=c(int=1, slope=1), fn=negloglik, x=x, y=y, logsd=log(4))$par
##       int     slope 
## 1.2243274 0.9305031

I suppose you could oneliner this too...

optim(par=c(int=1, slope=1), fn=function(par, ...) { do.call(negloglik, c(as.list(par), list(...))) }, x=x, y=y, logsd=log(4))$par
##       int     slope 
## 1.2243274 0.9305031

or, can just include them as parameters, probably the clearest way to write the code for users

optim2b <- function(par, fn, gr=NULL, ...,
                    method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent"),
                    lower = -Inf, upper = Inf, control = list(), hessian = FALSE) {
    ff <- function(par, ...) { do.call(fn, c(as.list(par), list(...))) }
    optim(par=par, fn=ff, gr=gr, ..., method=method, lower=lower, upper=upper, control=control, hessian=hessian)
}
optim2b(par=c(int=1, slope=1), fn=negloglik, x=x, y=y, logsd=log(4), hessian=TRUE)[c("par", "hessian")]
## $par
##       int     slope 
## 1.2243274 0.9305031 
## 
## $hessian
##           int    slope
## int    3.1250  16.6375
## slope 16.6375 111.1987

or, for kicks, could show how to separate parameters for the two functions

optim2c <- function(par, fn, extra, ...) {
    ff <- function(par, extra) { do.call(fn, c(as.list(par), extra)) }
    optim(par=par, fn=ff, extra=extra, ...)
}
optim2c(par=c(int=1, slope=1), fn=negloglik, extra=list(x=x, y=y, logsd=log(4)), hessian=TRUE)[c("par", "hessian")]
## $par
##       int     slope 
## 1.2243274 0.9305031 
## 
## $hessian
##           int    slope
## int    3.1250  16.6375
## slope 16.6375 111.1987

Conclusion

Knowing how to put together functions programatically with do.call can be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment