Skip to content

Instantly share code, notes, and snippets.

@mobeets
Last active August 29, 2015 14:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mobeets/8e13a1625d86bd54be64 to your computer and use it in GitHub Desktop.
Save mobeets/8e13a1625d86bd54be64 to your computer and use it in GitHub Desktop.
Regularization

Introduction

Regularization, in the context of Bayesian regression or fitting, refers to placing a prior on the parameters that you're fitting. The prior in this case is there to enforce an ideal world on the appearance of your parameters: e.g. that they are smooth, or sparse (i.e. only a few of them are non-zero).

The model

The methods discussed below may apply more generally, but for now I'm going to restrict the models to be linear Gaussian models, which means that the data is a linear function of the regressors, plus Gaussian noise. Another way of saying this is that I'm assuming that if you fit the data as a linear function, your errors will be normally distributed.

model

So now you've got a set of data, X, and some observations, Y, and what you want to find is the set of weights k that best describes the data. We can use Bayes's rule to write the posterior distribution for k:

bayes

Maximizing the posterior will give us a Bayesian estimate (i.e. the maximum a posteriori, or MAP, estimate) of k. We can take the log of this to make fitting easier:

kmax

It's the second term, the P(k), that regularization refers to.

Choosing the prior

So when you fit one of these linear Gaussian models what you end up with is a linear filter, or a set of weights, k. The assumption in this world of regularization, by way of the choice of prior, is that the distribution of these weights is a zero-mean multivariate Gaussian. These weights may be independent from one another, or they may not. In any case, this is called a Gaussian prior.

Here are some common regularization methods, distinguished by their assumptions about the covariance of the weights:

  • ridge regression: weights are independent of one another, and identically distributed with variance σ.
  • automatic relevance determination (ARD): weights are indendent of one another, and each is distributed with its own variance σ
  • automatic smoothness determination (ASD): the strength of correlations between weights is proportional to their closeness (i.e. assuming that weights nearby in the matrix refer to weights nearby in stimulus space)

The different types of regularization, then, all assume the weights are drawn from a Gaussian distribution with mean zero.

prior

They are distinguished only by their choice of matrix A, which then determines the covariance matrix of the weights k. In fact, any A such that A's transpose times A is positive definite would make a valid Gaussian prior.

Hyperparameters

kmax

It's important to note that the influence of your prior, the P(k), trades off with the influence of your actual data: if you have only a few data points, the prior will influence the resulting parameters; as your dataset gets larger, though, the data will dominate. The relative influence of your prior, with respect to the likelihood (i.e. the data), is controlled by what's called the hyperparameters. This is the λ in the definition of the prior above, as well as any parameters in the matrix A (e.g. the σ in ridge regression or ARD).

The hyperparameters of a model are the parameters controlling your prior on the model's parameters. If you have enough data, you can fit these using cross-validation: divide your data into a training set and a testing set, pick the best hyperparameters using the training set, and then assess performance by using these hyperparameters to fit the testing set.

However, if you have more than a few hyperparameters, as is often the case for more sophisticated regularization methods, you won't have enough data to do n-fold cross-validation, where the n refers to the number of hyperparameters you have.

Some methods for fitting hyperparameters without cross-validation include:

  • fixed-point methods
  • expectation-maximization
  • variational methods
  • empirical Bayes (aka evidence optimization, Type II maximum likelihood, maximum marginal likelihood, or even "cheating")
  • full Bayes, involving a "hyperprior" for the hyperparameters and a sampling method, e.g. Markov Chain Monte Carlo (MCMC)

More information

Here are some papers with more (precise) information:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment