Skip to content

Instantly share code, notes, and snippets.

@IvanYashchuk
Last active August 12, 2019 12:56
Show Gist options
  • Save IvanYashchuk/c211a84d8502c6fe7a7b8e458c18e11a to your computer and use it in GitHub Desktop.
Save IvanYashchuk/c211a84d8502c6fe7a7b8e458c18e11a to your computer and use it in GitHub Desktop.
ChainerX Gaussian Process Regression
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Implementation of Gaussian Process Regression with ChainerX

The purpose of this notebook is to demonstrate the ChainerX linear albebra routines in the example of Gaussian Process Regression.

import chainerx
import chainerx as np
import matplotlib.pyplot as plt
%matplotlib inline
chainerx.set_default_device('cuda:0')

Unknown functions can be represented using Gaussian Processes (GPs). GPs can be used for regression and classification tasks. Here, we focus on a simple GP regression case. There is a whole book for details Gaussian Processes for Machine Learning.

We are interested in learning a function f(x) from data {(xi, yi) | i = 1, …, n}, where yi ∈ R is a noisy observation of f(xi).

A Gaussian process is a random process where any point x ∈ ℝd is assigned a random variable f(x) and where the joint distribution of a finite number of these variables p(f(x1), ..., f(xN)) is itself Gaussian:


p(f|X) = 𝒩(f|μ, K)

Here, f = (f(x1), ..., f(xN)), μ = (m(x1), ..., m(xN)) and Kij = κ(xi, xj). m is the mean function and usually considered to be zero function. κ is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (i.e. smoothness) is defined by K. If points xi and xj are considered to be similar by the kernel the function values at these points, f(xi) and f(xj), can be expected to be similar too. The kernel function κ has free hyper-parameters θ.

A GP prior p(f|X) can be converted into a GP posterior p(f|X, y) after having observed some data y. The posterior can then be used to make predictions f* given new input X*:


p(f*|X*, X, y) = ∫p(f*|X*, f)p(f|X, ydf = 𝒩(f*|μ*, Σ*)

So the posterior predictive distribution is also a Gaussian with mean μ* and Σ*. By definition of the GP, the joint distribution of observed data y and predictions f* is

$$\begin{aligned} \begin{pmatrix}\mathbf{y} \\ \mathbf{f}_*\end{pmatrix} \sim \mathcal{N} \left(\boldsymbol{0}, \begin{pmatrix}\mathbf{K}_y & \mathbf{K}_* \\ \mathbf{K}_*^T & \mathbf{K}_{**}\end{pmatrix} \right) \end{aligned}$$

With N training data and N* new input data, Ky = κ(X, X) + σy2I = K + σy2I is N × N, K* = κ(X, X*) is N × N* and K * * = κ(X*, X*) is N* × N*. σy2 is the noise term in the diagonal of Ky. It is the noise of target data y, so it is set to zero if training targets are noise-free and to a value greater than zero if observations are noisy. The mean is set to 0. The sufficient statistics of the posterior predictive distribution, μ* and Σ*, can be computed with


μ* = K*TKy − 1y


Σ* = K * * − K*TKy − 1K*

Given this model, there are two basic problems to be solved. First, we need to learn the hyper-parameters θ and σy2. Second, we need to predict mean and variance of f(x*) at test input points X*.

The squared exponential covariance function is given by

$$k(\mathbf{x}_i, \mathbf{x}_j) = \sigma_f^2 \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2_ 2}{2\ell^2}\right)$$

where σf > 0 and ℓ > 0 are hyperparameters of the kernel. There are many other kernels that can be used for Gaussian processes. This specific covariance function is perhaps the most common covariance function used in statistics and machine learning. It is also known as the radial basis function kernel, the gaussian kernel, or the exponeniated quadratic kernel.

def gaussian_rbf(x1, x2, l=1, sigma_f=1):
    """Returns the NxM kernel matrix between the two sets of input X1 and X2. 

    Args:
        X1: NxD matrix
        X2: MxD matrix
        alpha: scalar parameter
        scale: scalar parameter

    Returns
        NxM kernel matrix
    """
    # distance between each rows
    dist_matrix = np.sum(np.square(x1), axis=1).reshape(-1, 1) + np.sum(np.square(x2), axis=1) - 2 * np.dot(x1, x2.T)
    return sigma_f**2 * np.exp(-1 / (2 * l**2) * dist_matrix)
kernel = gaussian_rbf

Let’s visualize the entries of the kernel matrix.

# create an Nx1 vector of equidistant points in [-3, 9]
N = 50
X = np.linspace(-3, 9, N).reshape(-1, 1)

cov = kernel(X, X)
plt.pcolor(chainerx.to_numpy(cov))
plt.colorbar()

<matplotlib.colorbar.Colorbar at 0x7f04781b4470>

image

Prior

Let’s generate samples from a Gaussian process prior. Specifically, we will consider a zero-mean Gaussian process prior for functions of the form f : ℝ → ℝ using the squared exponential kernel. That is,


f(x) ∼ 𝒢𝒫(0 , k(x,x′)).

Let fn = f(xn) ∈ ℝ be the value of the function f evaluated at a point xn ∈ ℝ. Furthermore, let f ∈ ℝN be the vector of function values for each of the points of X .

The Gaussian process prior for f becomes


f ∼ 𝒩(0,K),

where K is the kernel matrix.

To draw random functions from the GP we draw random samples from the corresponding multivariate normal. The common way of drawing such samples is using Cholesky decomposition, the algorithm is described in wikipedia.

def generate_samples(mean, K, S):
    """Returns M samples from a Gaussian process with kernel matrix K.

    Args:
        K: NxN kernel matrix
        S: number of samples

    Returns:
        SxS matrix of samples
    """

    z = np.random.normal(size = (mean.shape[0], S))
    L = np.linalg.cholesky(K + 1e-6*np.eye(mean.shape[0])) # added `1e-6*eye(N)` to covariance matrix to fix non positive definitness
    w = mean + L.dot(z)
    return w

Next, let’s define a function for plotting a Gaussian process with specified mean and 95.45% confidence interval (2σ).

def plot_gp(Xp, mu, Sigma, title="", num_samples=0):

    mean, std = mu.ravel(), np.sqrt(np.diag(Sigma))


    # plot distribution
    plt.plot(Xp, mean, color='k', label='Mean')
    plt.plot(Xp, mean + 2*std, color='k', linestyle='--')
    plt.plot(Xp, mean - 2*std, color='k', linestyle='--')
    # plt.fill_between does not work well with chainerx arrays so we are converting the input to numpy arrays
    plt.fill_between(*map(chainerx.to_numpy, (Xp.ravel(), mean - 2*std, mean + 2*std)), alpha=0.1)

    # generate samples
    if num_samples > 0:
        fs = generate_samples(mu, Sigma, num_samples)
        plt.plot(Xp, fs, color='b', alpha=.25)

    plt.title(title)
    plt.legend()
# Finite number of points
X = np.arange(-1, 7, 0.025).reshape(-1, 1).astype('float64')

# Mean and covariance of the prior
mu = np.zeros(X.shape, dtype='float64')
cov = kernel(X, X)

# Plot GP mean, confidence interval and samples 
plot_gp(X, mu, cov, num_samples = 5)

image

Prediction from noise-free training data

Let’s implement computation of μ* and Σ* using previously defined formulas.

def posterior_predictive(X, X_train, Y_train, l=1.0, sigma_f=1.0, sigma_y=1e-8):
    '''Computes the suffifient statistics of the GP posterior predictive distribution 
    from m training data X_train and Y_train and n new inputs X.

    Args:
        X: New input locations (N x D).
        X_train: Training locations (M x D).
        Y_train: Training targets (M x 1).
        l: Kernel length parameter.
        sigma_f: Kernel vertical variation parameter.
        sigma_y: Noise parameter.

    Returns:
        Posterior mean vector (N x D) and covariance matrix (N x N).
    '''
    K = kernel(X_train, X_train, l, sigma_f) + sigma_y**2 * np.eye(len(X_train))
    K_s = kernel(X_train, X, l, sigma_f)
    K_ss = kernel(X, X, l, sigma_f)

    z = np.linalg.solve(K, K_s).T

    mu_s = z.dot(Y_train)
    cov_s = K_ss - z.dot(K_s)

    return mu_s, cov_s

and apply them to noise-free training data X_train and Y_train. The following example draws five samples from the posterior predictive and plots them along with the mean, confidence interval and training data. In a noise-free model, variance at the training points is zero and all random functions drawn from the posterior go through the trainig points.

# Noise free training data
X_train = np.linspace(0,6,5).reshape(-1,1)
Y_train = np.sin(0.07*X_train**3)

# Compute mean and covariance of the posterior predictive distribution
mu_s, cov_s = posterior_predictive(X, X_train, Y_train)

plot_gp(X, mu_s, cov_s, num_samples = 5)
plt.plot(X_train, Y_train, 'ro', label='Training data', markersize=5)
plt.legend()

<matplotlib.legend.Legend at 0x7f0478005518>

image

Prediction from noisy training data

Now let’s add some noise to the model, then training points are only approximated and the variance at the training points is non-zero.

# Noisy training data
noise = 0.3
X_train = np.linspace(0,6,100).reshape(-1,1)
Y_train = np.sin(0.07*X_train**3)  + noise * np.random.normal(size=X_train.shape)

# Compute mean and covariance of the posterior predictive distribution
mu_s, cov_s = posterior_predictive(X, X_train, Y_train, sigma_y=noise)

plot_gp(X, mu_s, cov_s, num_samples = 10)
plt.plot(X_train, Y_train, 'rx')

[<matplotlib.lines.Line2D at 0x7f047801ba58>]

image

Effect of kernel parameters and noise parameter

The following example shows the effect of kernel parameters l and σf as well as the noise parameter σy. Higher l values lead to smoother functions and therefore to coarser approximations of the training data. Lower l values make functions more wiggly with wide confidence intervals between training data points. σf controls the vertical variation of functions drawn from the GP. This can be seen by the wide confidence intervals outside the training data region in the right figure of the second row. σy represents the amount of noise in the training data. Higher σy values make more coarse approximations which avoids overfitting to noisy data.

params = [
    (0.3, 1.0, 0.2),
    (3.0, 1.0, 0.2),
    (1.0, 0.3, 0.2),
    (1.0, 3.0, 0.2),
    (1.0, 1.0, 0.05),
    (1.0, 1.0, 1.5),
]

plt.figure(figsize=(12, 5))

for i, (l, sigma_f, sigma_y) in enumerate(params):
    mu_s, cov_s = posterior_predictive(X, X_train, Y_train, l=l, 
                                       sigma_f=sigma_f, 
                                       sigma_y=sigma_y)
    plt.subplot(3, 2, i + 1)
    plt.subplots_adjust(top=2)
    plt.title(f'l = {l}, sigma_f = {sigma_f}, sigma_y = {sigma_y}')
    plot_gp(X, mu_s, cov_s)
    plt.plot(X_train, Y_train, 'rx')
    plt.legend()

image

We can define the learning criteria and estimate optimal values for these parameters. Optimal values can be estimated by maximizing the marginal log-likelihood which is given by

$$\log p(\mathbf{y} | \mathbf{X}) = \log \mathcal{N}(\mathbf{y} | \boldsymbol{0},\mathbf{K}_y) = -\frac{1}{2} \mathbf{y}^T \mathbf{K}_y^{-1} \mathbf{y} -\frac{1}{2} \log \begin{vmatrix}\mathbf{K}_y\end{vmatrix} -\frac{N}{2} \log(2\pi)$$

Instead of the maximization problem we turn it into the minimization problem. We will minimize the negative marginal log-likelihood w.r.t. parameters l and σf, σy is set to the known noise level of the data. If the noise level is unknown, σy can be estimated as well along with the other parameters. The minimization problem is solved using gradient-based algorithm.

We avoid computing the determinant of $\begin{vmatrix}\mathbf{K}_y\end{vmatrix}$ by using the fact that the sum of the logarithms of the diagonal entries of the Cholesky decomposition of a matrix S equals half of the logarithm of determinant of matrix S. i.e. 0.5 * log(det(S)) and sum(log(diag(chol(S)))) are equal up to numerical precision.

from scipy.optimize import minimize

def nll_fn(X_train, Y_train, noise):
    '''Returns a function that computes the negative marginal log-
    likelihood for training data X_train and Y_train and given 
    noise level.

    Args:
        X_train: training locations (M x D).
        Y_train: training targets (M x 1).
        noise: known noise level of Y_train. 

    Returns:
        Minimization objective.
    '''
    def step(theta):
        K = kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
            noise**2 * np.eye(len(X_train))
        return np.sum(np.log(np.diag(np.linalg.cholesky(K)))) + \
               0.5 * Y_train.T.dot(np.linalg.inv(K)).dot(Y_train) + \
               0.5 * len(X_train) * np.log(np.array(2*np.pi, dtype='float64'))
    return step

def objective_and_grad(params):
    """Returns both value and gradient. Suitable for use
    in scipy.optimize"""
    theta = np.random.normal(size=params.size).require_grad()
    fun = nll_fn(X_train, Y_train, noise)
    out = fun(theta)
    out.backward()
    return chainerx.to_numpy(out), chainerx.to_numpy(theta.grad)

init_params = 0.1*np.random.normal(size=2)

# Minimize the negative log-likelihood w.r.t. parameters l and sigma_f.
res = minimize(objective_and_grad, init_params, method='L-BFGS-B', jac=True, options={'gtol': 1e-9, 'ftol': 0})

# Store the optimization results in global variables so that we can
# compare it later with the results from other implementations.
l_opt, sigma_f_opt = res.x

# Compute the prosterior predictive statistics with optimized kernel parameters and plot the results
mu_s, cov_s = posterior_predictive(X, X_train, Y_train, l=l_opt, sigma_f=sigma_f_opt, sigma_y=noise)
plot_gp(X, mu_s, cov_s)
plt.plot(X_train, Y_train, 'rx')

[<matplotlib.lines.Line2D at 0x7f04716afbe0>]

image

res

fun: array([[66.3643472]])

hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>

jac: array([ 76.02868139, 160.05121638])

message: b'CONVERGENCE: REL_REDUCTION_OF_F<=_FACTR*EPSMCH'
nfev: 8

nit: 3

status: 0
success: True

x: array([0.40666028, 0.90399203])

plt.figure(figsize=(12, 8))
plt.plot(X_train, Y_train, 'rx')
plt.plot(X, np.sin(0.07*X**3), label='true')
plot_gp(X, mu_s, cov_s)
plt.legend()

<matplotlib.legend.Legend at 0x7f0471976400>

image

References

[1] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.
[1] Martin Krasser. Gaussian Processes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment