Suppose we want to write an R package to do regression (and don't know how). Here are some things we need to know.
My qualifications for writing this are that I have two regression-like packages on CRAN and I have also taught statistical computing. But that doesn't make me a wizard. I don't know all the incantations, only some of them.
Here are the things this page talks about.
-
How to turn an R formula into a model matrix (the right way).
-
How to write methods of the generic functions in the R "core" packages (mostly in the
stats
package). This involves stuff in the NAMESPACE file of your package and also stuff in Rd files (documentation) of these methods. -
How to write a method that prints (or seems to) like
summary
oranova
. -
More about
summary
: printing tables with and without "signif.stars
". -
More about
anova
: checking for nested models.
The right way to handle basic formulas (the kind that the R functions lm
and glm
understand) is to have the R function lm
do it for you. This is a bit tricky. So I wrote a toy R package bar
that illustrates it.
What you do is hand the formula and data frame (if any) to lm with the optional argument method = "model.frame"
, which tells lm
we don't want to do linear regression, we just want it to turn the formula and the data (if any) into a model frame (which, if you don't know what that is, it is an R thingummy that is useful in this context, meaning I don't even know what it really is, but do know enough to use it for this purpose).
Having the model frame, the R function model.matrix
returns the model matrix and the R function model.response
returns the response vector, as shown in the example bar/package/bar/R/bar.R
.
The point of doing things this way is that is variables in the formula can come from either the R global environment (where R puts stuff when you do top-level assignments (not inside functions)) or from an R data frame supplied in the argument to your regression function that you should call data
to follow the examples of the R functions lm
and glm
. You need to find variables in either place. In fact, if your function is called inside another function, variables could be found anywhere in the call stack or search path (Section 3.5 of The R Language Definition is the definitive reference on where R finds stuff when it looks up names of objects, but the whole point of using lm
with method = "model.frame"
is that you don't need to know, it knows).
Now you are on your own. Presumably you know what to do with a model matrix and response vector.
My way of converting formulas to x and y is a way of R, but not the only way. Instead you can follow http://developer.r-project.org/model-fitting-functions.html if you like.
By "generics" here we are talking about so-called S3 generic functions, which include summary
, predict
, and anova
.
You can tell if a function is generic by looking at the definition, for example,
> summary
function (object, ...)
UseMethod("summary")
or by looking at the documentation (the help for summary
starts off with "summary
is a generic function").
When such a function is called, R looks at the class of the first argument, pastes the name of the class on the end of the name of the function with a dot as a separator, and redoes the call with the new name, for example, if fred
is an R object
of class "lm"
that (presumably) is the result of a call to the R function lm
, then
summary(fred)
is equivalent to
summary.lm(fred)
if there is a function summary.lm
. If there is no such function, then it is equivalent to
summary.default(fred)
if there is such a function summary.default
. Otherwise it is an error. This is the mechanism that allows R generic functions to do different things for different classes of objects, something appropriate for each.
If there are more arguments, then only the class of the first argument counts.
predict(fred, newdata = data.frame(x = x))
is equivalent to
predict.lm(fred, newdata = data.frame(x = x))
The class of an object can be a character vector, for example, if fred
has class c("bar", "baz")
, and foo
is generic,
then foo(fred)
results in
-
a call
foo.bar(fred)
if there is a functionfoo.bar
, -
otherwise a call
foo.baz(fred)
if there is a functionfoo.baz
, -
otherwise a call
foo.default(fred)
if there is a functionfoo.default
, -
otherwise an error.
When the generic function already exists (as it does for all the ones we are talking about here), all you have to do to add a method is write the method, unless your method is in an R package you are writing, which is what this section is about.
Checking in the R package stats
for version 4.0.4 there are the following functions.
generic | method |
---|---|
add1 | lm |
alias | lm |
anova | lm |
anova | lmlist |
case.names | lm |
confint | lm |
cooks.distance | lm |
deviance | lm |
dfbeta | lm |
dfbetas | lm |
drop1 | lm |
dummy.coef | lm |
effects | lm |
extractAIC | lm |
family | lm |
formula | lm |
hatvalues | lm |
influence | lm |
labels | lm |
logLik | lm |
model.frame | lm |
model.matrix | lm |
nobs | lm |
plot | lm |
predict | lm |
lm | |
summary.lm | |
proj | lm |
qr | lm |
residuals | lm |
rstandard | lm |
rstudent | lm |
simulate | lm |
summary | lm |
variable.names | lm |
vcov | lm |
vcov | summary.lm |
And there are also the following default methods that do something useful with objects of class "lm"
generic | method |
---|---|
AIC | default |
BIC | default |
coef | default |
df.residual | default |
fitted | default |
getCall | default |
na.action | default |
proj | default |
sigma | default |
terms | default |
update | default |
weights | default |
You might not want to implement methods for all of the generic functions in these tables. In fact, you almost certainly don't. But you should consider them all and whether you want them.
You definitely do want to implement any for which you want functionality similar to their methods for classes "lm"
and "glm"
. Users understand, for example, how to use summary(foo)
when foo
is an R object that is the result of a call to the R function lm
or glm
, and you want users to be able to transfer that knowledge to your package if the functionality is similar and your method for your package will not surprise and befuddle them.
The odd man out here is anova.lmlist
. The idea is that anova.lm
and anova.glm
and anova.mlm
do rather different things when given one model fit or several. Hence it makes sense to have two different functions to do these two different things. But this is an implementation detail that users should not be bothered with. When called with more than one model fit, all three of anova.lm
, anova.glm
, and anova.mlm
immediately call the function with list
tacked onto the end of its name. So far there is nothing crazy about this. But what is IMHO crazy is that anova.lmlist
and anova.glmlist
are generic functions, whereas anova.mlmlist
is not. Moreover, anova.lmlist
and anova.glmlist
have different kinds of arguments and so work differently.
anova(out1, out2, out3)
is equivalent to
anova(structure(out1, class = c("lmlist", "lm")), out2, out3)
when out1
, out2
, and out3
are objects of class "lm"
. But
anova(out1, out2, out3)
is equivalent to
anova(structure(list(out1, out2, out3), class = "glmlist"))
when out1
, out2
, and out3
are objects of class "glm"
.
-
IMHO no user is going to call
anova.lmlist
oranova.glmlist
this way. -
Hence
anova.lmlist
andanova.glmlist
should not be generic methods. -
To have them show up when users do
methods("anova")
just confuses users. -
This shows the main problem with OOP (object-oriented programming): programmers use it just because they can, often when they shouldn't. It can, and often does, do more harm than good.
-
anova.lmlist
andanova.glmlist
follow an OOD (object-oriented design) anti-pattern. -
You can have a function like
anova.lmlist
oranova.glmlist
if you like. Just don't make it a method of the generic.
In the NAMESPACE
file of your package, you have a line of the form
S3method(foo, bar)
where foo
is the generic function and bar
is the method.
This means that in the R "core", presumably in package stats
, there is a generic function foo
, your regression-like method is going to return an object of class "bar"
, and your S3 method is called foo.bar
.
Section 1.5.2 of Writing R Extensions is the authoritative reference for this.
In the Rd file for your method, you document the signature of the function differently than usual. Instead of having a section
\usage{foo.bar(arg1, arg2, \ldots)}
you should have
\usage{\method{foo}{bar}(arg1, arg2, \dots)}
Otherwise the documentation is the same as for any other function.
Section 2.1.1 of Writing R Extensions is the authoritative reference for this.
The arguments of the method you write must agree with the generic. If the generic has a ...
argument (and most do), then the method can have more arguments, but must still have the named arguments of the generic in the same order. Otherwise R CMD check
will complain about your package, and rightly so. If the arguments don't match, it's a bug.
Methods that (appear to) print, like summary
and anova
should not.
Instead they should just return an R object. Just like in every other R command, an R command involving such a function will autoprint (show the value of the expression if the command is not an assignment, and print nothing if the command is an assignment). And the function that autoprints S3 objects is the generic function print
. So instead of having print commands in summary.fred
(for various values of "fred
"), you just return an object of a class that you can write a new print method for. The traditional choice of the class is "summary.fred"
so you write the method print.summary.fred
.
And similarly for anova
or any other class that (appears to) print.
Except that you may not need to write a method of print
at all. For example, if the print.anova
method in the stats
package does what you want, then make the result returned by your anova method (say anova.fred
) have class "anova"
and it will get printed by print.anova
.
How does R know what the generic is when there are two dots in the name of the method? Because, of course, you have
S3method("print", "summary.fred")
in your NAMESPACE
file.
The reason for this is that sometimes users don't want to see the printing, they want to use the stuff calculated in there (P-values, for example) for further calculations. Then they can do
bar <- summary(foo)
pval <- bar$coefficients[ , "Pr(>|t|)"]
and stuff like that.
Unless summary.fred
does not print, so the printing is done by print.summary.fred
, you are not following the way of R.
In your summary
methods you should
-
pay attention to
options("show.signif.stars")
and -
use the R function
printCoefmat
to format your output.
Rather than blather more about this, I merely cite
getS3method("summary", "lm")
getS3method("print", "summary.lm")
as examples.
In your anova
methods you should check that models are nested (this advice is controversial).
IMHO methods of the generic function anova
should not be footguns (useful for users to shoot themselves in the foot). When the models fit are not nested, most such functions do GIGO (garbage in garbage out) rather then GIEMO (garbage in, error messages out), which is following the way of R. And those are footguns. In particular, anova.lm
and anova.glm
are footguns.
The toy R package bar
has an R function
anova.bar
that illustrates how this checking can be done.
The R functions in the CRAN packages aster
and glmm
do a pretty good job of checking for nesting, even for random effects models. What the bar
package does is copied from them, except it doesn't do random effects.