Regularization, in the context of Bayesian regression or fitting, refers to placing a prior on the parameters that you're fitting. The prior in this case is there to enforce an ideal world on the appearance of your parameters: e.g. that they are smooth, or sparse (i.e. only a few of them are non-zero).
The methods discussed below may apply more generally, but for now I'm going to restrict the models to be linear Gaussian models, which means that the data is a linear function of the regressors, plus Gaussian noise. Another way of saying this is that I'm assuming that if you fit the data as a linear function, your errors will be normally distributed.
So now you've got a set of data, X, and some observations, Y, and what you want to find is the set of weights k that best describes the data. We can use Bayes's rule to write the posterior distribution for k:
Maximizing the posterior will give us a Bayesian estimate (i.e. the maximum a posteriori, or MAP, estimate) of k. We can take the log of this to make fitting easier:
It's the second term, the P(k), that regularization refers to.
So when you fit one of these linear Gaussian models what you end up with is a linear filter, or a set of weights, k. The assumption in this world of regularization, by way of the choice of prior, is that the distribution of these weights is a zero-mean multivariate Gaussian. These weights may be independent from one another, or they may not. In any case, this is called a Gaussian prior.
Here are some common regularization methods, distinguished by their assumptions about the covariance of the weights:
- ridge regression: weights are independent of one another, and identically distributed with variance σ.
- automatic relevance determination (ARD): weights are indendent of one another, and each is distributed with its own variance σ
- automatic smoothness determination (ASD): the strength of correlations between weights is proportional to their closeness (i.e. assuming that weights nearby in the matrix refer to weights nearby in stimulus space)
The different types of regularization, then, all assume the weights are drawn from a Gaussian distribution with mean zero.
They are distinguished only by their choice of matrix A, which then determines the covariance matrix of the weights k. In fact, any A such that A's transpose times A is positive definite would make a valid Gaussian prior.
It's important to note that the influence of your prior, the P(k), trades off with the influence of your actual data: if you have only a few data points, the prior will influence the resulting parameters; as your dataset gets larger, though, the data will dominate. The relative influence of your prior, with respect to the likelihood (i.e. the data), is controlled by what's called the hyperparameters. This is the λ in the definition of the prior above, as well as any parameters in the matrix A (e.g. the σ in ridge regression or ARD).
The hyperparameters of a model are the parameters controlling your prior on the model's parameters. If you have enough data, you can fit these using cross-validation: divide your data into a training set and a testing set, pick the best hyperparameters using the training set, and then assess performance by using these hyperparameters to fit the testing set.
However, if you have more than a few hyperparameters, as is often the case for more sophisticated regularization methods, you won't have enough data to do n-fold cross-validation, where the n refers to the number of hyperparameters you have.
Some methods for fitting hyperparameters without cross-validation include:
- fixed-point methods
- expectation-maximization
- variational methods
- empirical Bayes (aka evidence optimization, Type II maximum likelihood, maximum marginal likelihood, or even "cheating")
- full Bayes, involving a "hyperprior" for the hyperparameters and a sampling method, e.g. Markov Chain Monte Carlo (MCMC)
Here are some papers with more (precise) information:
- Wu et al's "Complete Functional Characterization of Sensory Neurons by System Identification" (2006)
- Mineault et al's "Improved classification images with sparse priors in a smooth basis" (2009)
- Park et al's "Receptive Field Inference with Localized Priors" (2011)