cjgeyer/regression.md

## regression.md

      
    Raw
  

              regression.md
            
          
    Introduction

Suppose we want to write an R package to do regression (and don't know how).  Here are some things we need to know.
My qualifications for writing this are that I have two regression-like packages on CRAN and I have also taught statistical computing.  But that doesn't make me a wizard.  I don't know all the incantations, only some of them.
Here are the things this page talks about.


How to turn an R formula into a model matrix (the right way).


How to write methods of the generic functions in the R "core" packages (mostly in the stats package).  This involves stuff in the NAMESPACE file of your package and also stuff in Rd files (documentation) of these methods.


How to write a method that prints (or seems to) like summary or anova.


More about summary: printing tables with and without "signif.stars".


More about anova: checking for nested models.


Formulas

The right way to handle basic formulas (the kind that the R functions lm and glm understand) is to have the R function lm do it for you.  This is a bit tricky.  So I wrote a toy R package bar that illustrates it.
What you do is hand the formula and data frame (if any) to lm with the optional argument method = "model.frame", which tells lm we don't want to do linear regression, we just want it to turn the formula and the data (if any) into a model frame (which, if you don't know what that is, it is an R thingummy that is useful in this context, meaning I don't even know what it really is, but do know enough to use it for this purpose).
Having the model frame, the R function model.matrix returns the model matrix and the R function model.response returns the response vector, as shown in the example bar/package/bar/R/bar.R.
The point of doing things this way is that is variables in the formula can come from either the R global environment (where R puts stuff when you do top-level assignments (not inside functions)) or from an R data frame supplied in the argument to your regression function that you should call data to follow the examples of the R functions lm and glm.  You need to find variables in either place.  In fact, if your function is called inside another function, variables could be found anywhere in the call stack or search path (Section 3.5 of The R Language Definition is the definitive reference on where R finds stuff when it looks up names of objects, but the whole point of using lm with method = "model.frame" is that you don't need to know, it knows).
Now you are on your own.  Presumably you know what to do with a model matrix and response vector.
My way of converting formulas to x and y is a way of R, but not the only way. Instead you can follow http://developer.r-project.org/model-fitting-functions.html if you like.
Generics

Basics

By "generics" here we are talking about so-called S3 generic functions, which include summary, predict, and anova.
You can tell if a function is generic by looking at the definition, for example,
> summary
function (object, ...) 
UseMethod("summary")

or by looking at the documentation (the help for summary starts off with "summary is a generic function").
When such a function is called, R looks at the class of the first argument, pastes the name of the class on the end of the name of the function with a dot as a separator, and redoes the call with the new name, for example, if fred is an R object
of class "lm" that (presumably) is the result of a call to the R function lm, then
summary(fred)

is equivalent to
summary.lm(fred)

if there is a function summary.lm.  If there is no such function, then it is equivalent to
summary.default(fred)

if there is such a function summary.default.  Otherwise it is an error.  This is the mechanism that allows R generic functions to do different things for different classes of objects, something appropriate for each.
If there are more arguments, then only the class of the first argument counts.
predict(fred, newdata = data.frame(x = x))

is equivalent to
predict.lm(fred, newdata = data.frame(x = x))

The class of an object can be a character vector, for example, if fred has class c("bar", "baz"), and foo is generic,
then foo(fred) results in


a call foo.bar(fred) if there is a function foo.bar,


otherwise a call foo.baz(fred) if there is a function foo.baz,


otherwise a call foo.default(fred) if there is a function foo.default,


otherwise an error.


When the generic function already exists (as it does for all the ones we are talking about here), all you have to do to add a method is write the method, unless your method is in an R package you are writing, which is what this section is about.
Generics for lm

Checking in the R package stats for version 4.0.4 there are the following functions.


generic
method


add1
lm


alias
lm


anova
lm


anova
lmlist


case.names
lm


confint
lm


cooks.distance
lm


deviance
lm


dfbeta
lm


dfbetas
lm


drop1
lm


dummy.coef
lm


effects
lm


extractAIC
lm


family
lm


formula
lm


hatvalues
lm


influence
lm


labels
lm


logLik
lm


model.frame
lm


model.matrix
lm


nobs
lm


plot
lm


predict
lm


print
lm


print
summary.lm


proj
lm


qr
lm


residuals
lm


rstandard
lm


rstudent
lm


simulate
lm


summary
lm


variable.names
lm


vcov
lm


vcov
summary.lm


And there are also the following default methods that do something useful with objects of class "lm"


generic
method


AIC
default


BIC
default


coef
default


df.residual
default


fitted
default


getCall
default


na.action
default


proj
default


sigma
default


terms
default


update
default


weights
default


You might not want to implement methods for all of the generic functions in these tables. In fact, you almost certainly don't. But you should consider them all and whether you want them.
You definitely do want to implement any for which you want functionality similar to their methods for classes "lm" and "glm".  Users understand, for example, how to use summary(foo) when foo is an R object that is the result of a call to the R function lm or glm, and you want users to be able to transfer that knowledge to your package if the functionality is similar and your method for your package will not surprise and befuddle them.
A Rant About Misapplied OOP

The odd man out here is anova.lmlist.  The idea is that anova.lmand anova.glm and anova.mlm do rather different things when given one model fit or several.  Hence it makes sense to have two different functions to do these two different things.  But this is an implementation detail that users should not be bothered with.  When called with more than one model fit, all three of anova.lm, anova.glm, and anova.mlm immediately call the function with list tacked onto the end of its name.  So far there is nothing crazy about this.  But what is IMHO crazy is that anova.lmlist and anova.glmlist are generic functions, whereas anova.mlmlist is not.  Moreover, anova.lmlist and anova.glmlist have different kinds of arguments and so work differently.
anova(out1, out2, out3)

is equivalent to
anova(structure(out1, class = c("lmlist", "lm")), out2, out3)

when out1, out2, and out3 are objects of class "lm".  But
anova(out1, out2, out3)

is equivalent to
anova(structure(list(out1, out2, out3), class = "glmlist"))

when out1, out2, and out3 are objects of class "glm".


IMHO no user is going to call anova.lmlist or anova.glmlist this way.


Hence anova.lmlist and anova.glmlist should not be generic methods.


To have them show up when users do methods("anova") just confuses users.


This shows the main problem with OOP (object-oriented programming): programmers use it just because they can, often when they shouldn't.  It can, and often does, do more harm than good.


anova.lmlist and anova.glmlist follow an OOD (object-oriented design) anti-pattern.


You can have a function like anova.lmlist or anova.glmlist if you like. Just don't make it a method of the generic.


NAMESPACE

In the NAMESPACE file of your package, you have a line of the form
S3method(foo, bar)

where foo is the generic function and bar is the method.
This means that in the R "core", presumably in package stats, there is a generic function foo, your regression-like method is going to return an object of class "bar", and your S3 method is called foo.bar.
Section 1.5.2 of Writing R Extensions is the authoritative reference for this.
Documentation

In the Rd file for your method, you document the signature of the function differently than usual.  Instead of having a section
\usage{foo.bar(arg1, arg2, \ldots)}

you should have
\usage{\method{foo}{bar}(arg1, arg2, \dots)}

Otherwise the documentation is the same as for any other function.
Section 2.1.1 of Writing R Extensions is the authoritative reference for this.
Argument Agreement

The arguments of the method you write must agree with the generic.  If the generic has a ... argument (and most do), then the method can have more arguments, but must still have the named arguments of the generic in the same order.  Otherwise R CMD check will complain about your package, and rightly so.  If the arguments don't match, it's a bug.
Methods that (Appear to) Print

Methods that (appear to) print, like summary and anova should not.
Instead they should just return an R object.  Just like in every other R command, an R command involving such a function will autoprint (show the value of the expression if the command is not an assignment, and print nothing if the command is an assignment).  And the function that autoprints S3 objects is the generic function print.  So instead of having print commands in summary.fred (for various values of "fred"), you just return an object of a class that you can write a new print method for.  The traditional choice of the class is "summary.fred" so you write the method print.summary.fred.
And similarly for anova or any other class that (appears to) print.
Except that you may not need to write a method of print at all.  For example, if the print.anova method in the stats package does what you want, then make the result returned by your anova method (say anova.fred) have class "anova" and it will get printed by print.anova.
How does R know what the generic is when there are two dots in the name of the method?  Because, of course, you have
S3method("print", "summary.fred")

in your NAMESPACE file.
The reason for this is that sometimes users don't want to see the printing, they want to use the stuff calculated in there
(P-values, for example) for further calculations.  Then they can do
bar <- summary(foo)
pval <- bar$coefficients[ , "Pr(>|t|)"]

and stuff like that.
Unless summary.fred does not print, so the printing is done by print.summary.fred, you are not following the way of R.
More on summary

In your summary methods you should


pay attention to options("show.signif.stars") and


use the R function printCoefmat to format your output.


Rather than blather more about this, I merely cite
getS3method("summary", "lm")
getS3method("print", "summary.lm")

as examples.
More on anova

In your anova methods you should check that models are nested (this advice is controversial).
IMHO methods of the generic function anova should not be footguns (useful for users to shoot themselves in the foot). When the models fit are not nested, most such functions do GIGO (garbage in garbage out) rather then GIEMO (garbage in, error messages out), which is following the way of R.  And those are footguns.  In particular, anova.lm and anova.glm are footguns.
The toy R package bar has an R function
anova.bar that illustrates how this checking can be done.
The R functions in the CRAN packages aster and glmm do a pretty good job of checking for nesting, even for random effects models.  What the bar package does is copied from them, except it doesn't do random effects.
generic	method
add1	lm
alias	lm
anova	lm
anova	lmlist
case.names	lm
confint	lm
cooks.distance	lm
deviance	lm
dfbeta	lm
dfbetas	lm
drop1	lm
dummy.coef	lm
effects	lm
extractAIC	lm
family	lm
formula	lm
hatvalues	lm
influence	lm
labels	lm
logLik	lm
model.frame	lm
model.matrix	lm
nobs	lm
plot	lm
predict	lm
print	lm
print	summary.lm
proj	lm
qr	lm
residuals	lm
rstandard	lm
rstudent	lm
simulate	lm
summary	lm
variable.names	lm
vcov	lm
vcov	summary.lm
generic	method
AIC	default
BIC	default
coef	default
df.residual	default
fitted	default
getCall	default
na.action	default
proj	default
sigma	default
terms	default
update	default
weights	default