Skip to content

Instantly share code, notes, and snippets.

@cjgeyer
Last active February 11, 2021 01:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cjgeyer/2056ae760b6cf696b551a9542b73d21d to your computer and use it in GitHub Desktop.
Save cjgeyer/2056ae760b6cf696b551a9542b73d21d to your computer and use it in GitHub Desktop.
R Packages that Do Regression of Some Sort

Introduction

Suppose we want to write an R package to do regression (and don't know how). Here are some things we need to know.

My qualifications for writing this are that I have two regression-like packages on CRAN and I have also taught statistical computing. But that doesn't make me a wizard. I don't know all the incantations, only some of them.

Here are the things this page talks about.

  1. How to turn an R formula into a model matrix (the right way).

  2. How to write methods of the generic functions in the R "core" packages (mostly in the stats package). This involves stuff in the NAMESPACE file of your package and also stuff in Rd files (documentation) of these methods.

  3. How to write a method that prints (or seems to) like summary or anova.

  4. More about summary: printing tables with and without "signif.stars".

  5. More about anova: checking for nested models.

Formulas

The right way to handle basic formulas (the kind that the R functions lm and glm understand) is to have the R function lm do it for you. This is a bit tricky. So I wrote a toy R package bar that illustrates it. What you do is hand the formula and data frame (if any) to lm with the optional argument method = "model.frame", which tells lm we don't want to do linear regression, we just want it to turn the formula and the data (if any) into a model frame (which, if you don't know what that is, it is an R thingummy that is useful in this context, meaning I don't even know what it really is, but do know enough to use it for this purpose). Having the model frame, the R function model.matrix returns the model matrix and the R function model.response returns the response vector, as shown in the example bar/package/bar/R/bar.R.

The point of doing things this way is that is variables in the formula can come from either the R global environment (where R puts stuff when you do top-level assignments (not inside functions)) or from an R data frame supplied in the argument to your regression function that you should call data to follow the examples of the R functions lm and glm. You need to find variables in either place. In fact, if your function is called inside another function, variables could be found anywhere in the call stack or search path (Section 3.5 of The R Language Definition is the definitive reference on where R finds stuff when it looks up names of objects, but the whole point of using lm with method = "model.frame" is that you don't need to know, it knows).

Now you are on your own. Presumably you know what to do with a model matrix and response vector.

My way of converting formulas to x and y is a way of R, but not the only way. Instead you can follow http://developer.r-project.org/model-fitting-functions.html if you like.

Generics

Basics

By "generics" here we are talking about so-called S3 generic functions, which include summary, predict, and anova. You can tell if a function is generic by looking at the definition, for example,

> summary
function (object, ...) 
UseMethod("summary")

or by looking at the documentation (the help for summary starts off with "summary is a generic function").

When such a function is called, R looks at the class of the first argument, pastes the name of the class on the end of the name of the function with a dot as a separator, and redoes the call with the new name, for example, if fred is an R object of class "lm" that (presumably) is the result of a call to the R function lm, then

summary(fred)

is equivalent to

summary.lm(fred)

if there is a function summary.lm. If there is no such function, then it is equivalent to

summary.default(fred)

if there is such a function summary.default. Otherwise it is an error. This is the mechanism that allows R generic functions to do different things for different classes of objects, something appropriate for each.

If there are more arguments, then only the class of the first argument counts.

predict(fred, newdata = data.frame(x = x))

is equivalent to

predict.lm(fred, newdata = data.frame(x = x))

The class of an object can be a character vector, for example, if fred has class c("bar", "baz"), and foo is generic, then foo(fred) results in

  • a call foo.bar(fred) if there is a function foo.bar,

  • otherwise a call foo.baz(fred) if there is a function foo.baz,

  • otherwise a call foo.default(fred) if there is a function foo.default,

  • otherwise an error.

When the generic function already exists (as it does for all the ones we are talking about here), all you have to do to add a method is write the method, unless your method is in an R package you are writing, which is what this section is about.

Generics for lm

Checking in the R package stats for version 4.0.4 there are the following functions.

generic method
add1 lm
alias lm
anova lm
anova lmlist
case.names lm
confint lm
cooks.distance lm
deviance lm
dfbeta lm
dfbetas lm
drop1 lm
dummy.coef lm
effects lm
extractAIC lm
family lm
formula lm
hatvalues lm
influence lm
labels lm
logLik lm
model.frame lm
model.matrix lm
nobs lm
plot lm
predict lm
print lm
print summary.lm
proj lm
qr lm
residuals lm
rstandard lm
rstudent lm
simulate lm
summary lm
variable.names lm
vcov lm
vcov summary.lm

And there are also the following default methods that do something useful with objects of class "lm"

generic method
AIC default
BIC default
coef default
df.residual default
fitted default
getCall default
na.action default
proj default
sigma default
terms default
update default
weights default

You might not want to implement methods for all of the generic functions in these tables. In fact, you almost certainly don't. But you should consider them all and whether you want them.

You definitely do want to implement any for which you want functionality similar to their methods for classes "lm" and "glm". Users understand, for example, how to use summary(foo) when foo is an R object that is the result of a call to the R function lm or glm, and you want users to be able to transfer that knowledge to your package if the functionality is similar and your method for your package will not surprise and befuddle them.

A Rant About Misapplied OOP

The odd man out here is anova.lmlist. The idea is that anova.lmand anova.glm and anova.mlm do rather different things when given one model fit or several. Hence it makes sense to have two different functions to do these two different things. But this is an implementation detail that users should not be bothered with. When called with more than one model fit, all three of anova.lm, anova.glm, and anova.mlm immediately call the function with list tacked onto the end of its name. So far there is nothing crazy about this. But what is IMHO crazy is that anova.lmlist and anova.glmlist are generic functions, whereas anova.mlmlist is not. Moreover, anova.lmlist and anova.glmlist have different kinds of arguments and so work differently.

anova(out1, out2, out3)

is equivalent to

anova(structure(out1, class = c("lmlist", "lm")), out2, out3)

when out1, out2, and out3 are objects of class "lm". But

anova(out1, out2, out3)

is equivalent to

anova(structure(list(out1, out2, out3), class = "glmlist"))

when out1, out2, and out3 are objects of class "glm".

  • IMHO no user is going to call anova.lmlist or anova.glmlist this way.

  • Hence anova.lmlist and anova.glmlist should not be generic methods.

  • To have them show up when users do methods("anova") just confuses users.

  • This shows the main problem with OOP (object-oriented programming): programmers use it just because they can, often when they shouldn't. It can, and often does, do more harm than good.

  • anova.lmlist and anova.glmlist follow an OOD (object-oriented design) anti-pattern.

  • You can have a function like anova.lmlist or anova.glmlist if you like. Just don't make it a method of the generic.

NAMESPACE

In the NAMESPACE file of your package, you have a line of the form

S3method(foo, bar)

where foo is the generic function and bar is the method.

This means that in the R "core", presumably in package stats, there is a generic function foo, your regression-like method is going to return an object of class "bar", and your S3 method is called foo.bar.

Section 1.5.2 of Writing R Extensions is the authoritative reference for this.

Documentation

In the Rd file for your method, you document the signature of the function differently than usual. Instead of having a section

\usage{foo.bar(arg1, arg2, \ldots)}

you should have

\usage{\method{foo}{bar}(arg1, arg2, \dots)}

Otherwise the documentation is the same as for any other function.

Section 2.1.1 of Writing R Extensions is the authoritative reference for this.

Argument Agreement

The arguments of the method you write must agree with the generic. If the generic has a ... argument (and most do), then the method can have more arguments, but must still have the named arguments of the generic in the same order. Otherwise R CMD check will complain about your package, and rightly so. If the arguments don't match, it's a bug.

Methods that (Appear to) Print

Methods that (appear to) print, like summary and anova should not.

Instead they should just return an R object. Just like in every other R command, an R command involving such a function will autoprint (show the value of the expression if the command is not an assignment, and print nothing if the command is an assignment). And the function that autoprints S3 objects is the generic function print. So instead of having print commands in summary.fred (for various values of "fred"), you just return an object of a class that you can write a new print method for. The traditional choice of the class is "summary.fred" so you write the method print.summary.fred. And similarly for anova or any other class that (appears to) print.

Except that you may not need to write a method of print at all. For example, if the print.anova method in the stats package does what you want, then make the result returned by your anova method (say anova.fred) have class "anova" and it will get printed by print.anova.

How does R know what the generic is when there are two dots in the name of the method? Because, of course, you have

S3method("print", "summary.fred")

in your NAMESPACE file.

The reason for this is that sometimes users don't want to see the printing, they want to use the stuff calculated in there (P-values, for example) for further calculations. Then they can do

bar <- summary(foo)
pval <- bar$coefficients[ , "Pr(>|t|)"]

and stuff like that.

Unless summary.fred does not print, so the printing is done by print.summary.fred, you are not following the way of R.

More on summary

In your summary methods you should

  • pay attention to options("show.signif.stars") and

  • use the R function printCoefmat to format your output.

Rather than blather more about this, I merely cite

getS3method("summary", "lm")
getS3method("print", "summary.lm")

as examples.

More on anova

In your anova methods you should check that models are nested (this advice is controversial).

IMHO methods of the generic function anova should not be footguns (useful for users to shoot themselves in the foot). When the models fit are not nested, most such functions do GIGO (garbage in garbage out) rather then GIEMO (garbage in, error messages out), which is following the way of R. And those are footguns. In particular, anova.lm and anova.glm are footguns.

The toy R package bar has an R function anova.bar that illustrates how this checking can be done. The R functions in the CRAN packages aster and glmm do a pretty good job of checking for nesting, even for random effects models. What the bar package does is copied from them, except it doesn't do random effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment