IvanYashchuk/GPR-ChainerX.ipynb

## GPR-ChainerX.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              GPR-ChainerX.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## GPR-ChainerX.rst

      
    Raw
  

              GPR-ChainerX.rst
            
          
    Implementation of Gaussian Process Regression with ChainerX

The purpose of this notebook is to demonstrate the ChainerX linear albebra routines in the example of Gaussian Process Regression.
import chainerx
import chainerx as np
import matplotlib.pyplot as plt
%matplotlib inline
chainerx.set_default_device('cuda:0')
Unknown functions can be represented using Gaussian Processes (GPs). GPs can be used for regression and classification tasks. Here, we focus on a simple GP regression case. There is a whole book for details Gaussian Processes for Machine Learning.
We are interested in learning a function f(x) from data {(x_i, y_i) | i = 1, …, n}, where y_i ∈ R is a noisy observation of f(x_i).
A Gaussian process is a random process where any point x ∈ ℝ^d is assigned a random variable f(x) and where the joint distribution of a finite number of these variables p(f(x₁), ..., f(x_N)) is itself Gaussian:

p(f|X) = 𝒩(f|μ, K)

Here, f = (f(x₁), ..., f(x_N)), μ = (m(x₁), ..., m(x_N)) and K_ij = κ(x_i, x_j). m is the mean function and usually considered to be zero function. κ is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (i.e. smoothness) is defined by K. If points x_i and x_j are considered to be similar by the kernel the function values at these points, f(x_i) and f(x_j), can be expected to be similar too. The kernel function κ has free hyper-parameters θ.
A GP prior p(f|X) can be converted into a GP posterior p(f|X, y) after having observed some data y. The posterior can then be used to make predictions f_* given new input X_*:

p(f_*|X_*, X, y) = ∫p(f_*|X_*, f)p(f|X, y) df = 𝒩(f_*|μ_*, Σ_*)

So the posterior predictive distribution is also a Gaussian with mean μ_* and Σ_*. By definition of the GP, the joint distribution of observed data y and predictions f_* is
$$\begin{aligned}
\begin{pmatrix}\mathbf{y} \\ \mathbf{f}_*\end{pmatrix} \sim \mathcal{N}
\left(\boldsymbol{0},
\begin{pmatrix}\mathbf{K}_y & \mathbf{K}_* \\ \mathbf{K}_*^T & \mathbf{K}_{**}\end{pmatrix}
\right)
\end{aligned}$$
With N training data and N_* new input data, K_y = κ(X, X) + σ_y²I = K + σ_y²I is N × N, K_* = κ(X, X_*) is N × N_* and K_* * = κ(X_*, X_*) is N_* × N_*. σ_y² is the noise term in the diagonal of K_y. It is the noise of target data y, so it is set to zero if training targets are noise-free and to a value greater than zero if observations are noisy. The mean is set to 0. The sufficient statistics of the posterior predictive distribution, μ_* and Σ_*, can be computed with

μ_* = K_*^TK_y^− 1y


Σ_* = K_* * − K_*^TK_y^− 1K_*

Given this model, there are two basic problems to be solved. First, we need to learn the hyper-parameters θ and σ_y². Second, we need to predict mean and variance of f(x_*) at test input points X_*.
The squared exponential covariance function is given by
$$k(\mathbf{x}_i, \mathbf{x}_j) = \sigma_f^2 \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2_ 2}{2\ell^2}\right)$$
where σ_f > 0 and ℓ > 0 are hyperparameters of the kernel. There are many other kernels that can be used for Gaussian processes. This specific covariance function is perhaps the most common covariance function used in statistics and machine learning. It is also known as the radial basis function kernel, the gaussian kernel, or the exponeniated quadratic kernel.
def gaussian_rbf(x1, x2, l=1, sigma_f=1):
    """Returns the NxM kernel matrix between the two sets of input X1 and X2. 

    Args:
        X1: NxD matrix
        X2: MxD matrix
        alpha: scalar parameter
        scale: scalar parameter

    Returns
        NxM kernel matrix
    """
    # distance between each rows
    dist_matrix = np.sum(np.square(x1), axis=1).reshape(-1, 1) + np.sum(np.square(x2), axis=1) - 2 * np.dot(x1, x2.T)
    return sigma_f**2 * np.exp(-1 / (2 * l**2) * dist_matrix)
kernel = gaussian_rbf
Let’s visualize the entries of the kernel matrix.
# create an Nx1 vector of equidistant points in [-3, 9]
N = 50
X = np.linspace(-3, 9, N).reshape(-1, 1)

cov = kernel(X, X)
plt.pcolor(chainerx.to_numpy(cov))
plt.colorbar()

<matplotlib.colorbar.Colorbar at 0x7f04781b4470>


Prior

Let’s generate samples from a Gaussian process prior. Specifically, we will consider a zero-mean Gaussian process prior for functions of the form f : ℝ → ℝ using the squared exponential kernel. That is,

f(x) ∼ 𝒢𝒫(0 , k(x,x′)).

Let f_n = f(x_n) ∈ ℝ be the value of the function f evaluated at a point x_n ∈ ℝ. Furthermore, let f ∈ ℝ^N be the vector of function values for each of the points of X .
The Gaussian process prior for f becomes

f ∼ 𝒩(0,K),

where K is the kernel matrix.
To draw random functions from the GP we draw random samples from the corresponding multivariate normal. The common way of drawing such samples is using Cholesky decomposition, the algorithm is described in wikipedia.
def generate_samples(mean, K, S):
    """Returns M samples from a Gaussian process with kernel matrix K.

    Args:
        K: NxN kernel matrix
        S: number of samples

    Returns:
        SxS matrix of samples
    """

    z = np.random.normal(size = (mean.shape[0], S))
    L = np.linalg.cholesky(K + 1e-6*np.eye(mean.shape[0])) # added `1e-6*eye(N)` to covariance matrix to fix non positive definitness
    w = mean + L.dot(z)
    return w
Next, let’s define a function for plotting a Gaussian process with specified mean and 95.45% confidence interval (2σ).
def plot_gp(Xp, mu, Sigma, title="", num_samples=0):

    mean, std = mu.ravel(), np.sqrt(np.diag(Sigma))


    # plot distribution
    plt.plot(Xp, mean, color='k', label='Mean')
    plt.plot(Xp, mean + 2*std, color='k', linestyle='--')
    plt.plot(Xp, mean - 2*std, color='k', linestyle='--')
    # plt.fill_between does not work well with chainerx arrays so we are converting the input to numpy arrays
    plt.fill_between(*map(chainerx.to_numpy, (Xp.ravel(), mean - 2*std, mean + 2*std)), alpha=0.1)

    # generate samples
    if num_samples > 0:
        fs = generate_samples(mu, Sigma, num_samples)
        plt.plot(Xp, fs, color='b', alpha=.25)

    plt.title(title)
    plt.legend()
# Finite number of points
X = np.arange(-1, 7, 0.025).reshape(-1, 1).astype('float64')

# Mean and covariance of the prior
mu = np.zeros(X.shape, dtype='float64')
cov = kernel(X, X)

# Plot GP mean, confidence interval and samples 
plot_gp(X, mu, cov, num_samples = 5)

Prediction from noise-free training data

Let’s implement computation of μ_* and Σ_* using previously defined formulas.
def posterior_predictive(X, X_train, Y_train, l=1.0, sigma_f=1.0, sigma_y=1e-8):
    '''Computes the suffifient statistics of the GP posterior predictive distribution 
    from m training data X_train and Y_train and n new inputs X.

    Args:
        X: New input locations (N x D).
        X_train: Training locations (M x D).
        Y_train: Training targets (M x 1).
        l: Kernel length parameter.
        sigma_f: Kernel vertical variation parameter.
        sigma_y: Noise parameter.

    Returns:
        Posterior mean vector (N x D) and covariance matrix (N x N).
    '''
    K = kernel(X_train, X_train, l, sigma_f) + sigma_y**2 * np.eye(len(X_train))
    K_s = kernel(X_train, X, l, sigma_f)
    K_ss = kernel(X, X, l, sigma_f)

    z = np.linalg.solve(K, K_s).T

    mu_s = z.dot(Y_train)
    cov_s = K_ss - z.dot(K_s)

    return mu_s, cov_s
and apply them to noise-free training data X_train and Y_train. The following example draws five samples from the posterior predictive and plots them along with the mean, confidence interval and training data. In a noise-free model, variance at the training points is zero and all random functions drawn from the posterior go through the trainig points.
# Noise free training data
X_train = np.linspace(0,6,5).reshape(-1,1)
Y_train = np.sin(0.07*X_train**3)

# Compute mean and covariance of the posterior predictive distribution
mu_s, cov_s = posterior_predictive(X, X_train, Y_train)

plot_gp(X, mu_s, cov_s, num_samples = 5)
plt.plot(X_train, Y_train, 'ro', label='Training data', markersize=5)
plt.legend()

<matplotlib.legend.Legend at 0x7f0478005518>


Prediction from noisy training data

Now let’s add some noise to the model, then training points are only approximated and the variance at the training points is non-zero.
# Noisy training data
noise = 0.3
X_train = np.linspace(0,6,100).reshape(-1,1)
Y_train = np.sin(0.07*X_train**3)  + noise * np.random.normal(size=X_train.shape)

# Compute mean and covariance of the posterior predictive distribution
mu_s, cov_s = posterior_predictive(X, X_train, Y_train, sigma_y=noise)

plot_gp(X, mu_s, cov_s, num_samples = 10)
plt.plot(X_train, Y_train, 'rx')

[<matplotlib.lines.Line2D at 0x7f047801ba58>]


Effect of kernel parameters and noise parameter

The following example shows the effect of kernel parameters l and σ_f as well as the noise parameter σ_y. Higher l values lead to smoother functions and therefore to coarser approximations of the training data. Lower l values make functions more wiggly with wide confidence intervals between training data points. σ_f controls the vertical variation of functions drawn from the GP. This can be seen by the wide confidence intervals outside the training data region in the right figure of the second row. σ_y represents the amount of noise in the training data. Higher σ_y values make more coarse approximations which avoids overfitting to noisy data.
params = [
    (0.3, 1.0, 0.2),
    (3.0, 1.0, 0.2),
    (1.0, 0.3, 0.2),
    (1.0, 3.0, 0.2),
    (1.0, 1.0, 0.05),
    (1.0, 1.0, 1.5),
]

plt.figure(figsize=(12, 5))

for i, (l, sigma_f, sigma_y) in enumerate(params):
    mu_s, cov_s = posterior_predictive(X, X_train, Y_train, l=l, 
                                       sigma_f=sigma_f, 
                                       sigma_y=sigma_y)
    plt.subplot(3, 2, i + 1)
    plt.subplots_adjust(top=2)
    plt.title(f'l = {l}, sigma_f = {sigma_f}, sigma_y = {sigma_y}')
    plot_gp(X, mu_s, cov_s)
    plt.plot(X_train, Y_train, 'rx')
    plt.legend()

We can define the learning criteria and estimate optimal values for these parameters. Optimal values can be estimated by maximizing the marginal log-likelihood which is given by
$$\log p(\mathbf{y} | \mathbf{X}) = 
\log \mathcal{N}(\mathbf{y} | \boldsymbol{0},\mathbf{K}_y) =
-\frac{1}{2} \mathbf{y}^T \mathbf{K}_y^{-1} \mathbf{y} 
-\frac{1}{2} \log \begin{vmatrix}\mathbf{K}_y\end{vmatrix} 
-\frac{N}{2} \log(2\pi)$$
Instead of the maximization problem we turn it into the minimization problem. We will minimize the negative marginal log-likelihood w.r.t. parameters l and σ_f, σ_y is set to the known noise level of the data. If the noise level is unknown, σ_y can be estimated as well along with the other parameters. The minimization problem is solved using gradient-based algorithm.
We avoid computing the determinant of $\begin{vmatrix}\mathbf{K}_y\end{vmatrix}$ by using the fact that the sum of the logarithms of the diagonal entries of the Cholesky decomposition of a matrix S equals half of the logarithm of determinant of matrix S. i.e. 0.5 * log(det(S)) and sum(log(diag(chol(S)))) are equal up to numerical precision.
from scipy.optimize import minimize

def nll_fn(X_train, Y_train, noise):
    '''Returns a function that computes the negative marginal log-
    likelihood for training data X_train and Y_train and given 
    noise level.

    Args:
        X_train: training locations (M x D).
        Y_train: training targets (M x 1).
        noise: known noise level of Y_train. 

    Returns:
        Minimization objective.
    '''
    def step(theta):
        K = kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
            noise**2 * np.eye(len(X_train))
        return np.sum(np.log(np.diag(np.linalg.cholesky(K)))) + \
               0.5 * Y_train.T.dot(np.linalg.inv(K)).dot(Y_train) + \
               0.5 * len(X_train) * np.log(np.array(2*np.pi, dtype='float64'))
    return step

def objective_and_grad(params):
    """Returns both value and gradient. Suitable for use
    in scipy.optimize"""
    theta = np.random.normal(size=params.size).require_grad()
    fun = nll_fn(X_train, Y_train, noise)
    out = fun(theta)
    out.backward()
    return chainerx.to_numpy(out), chainerx.to_numpy(theta.grad)

init_params = 0.1*np.random.normal(size=2)

# Minimize the negative log-likelihood w.r.t. parameters l and sigma_f.
res = minimize(objective_and_grad, init_params, method='L-BFGS-B', jac=True, options={'gtol': 1e-9, 'ftol': 0})

# Store the optimization results in global variables so that we can
# compare it later with the results from other implementations.
l_opt, sigma_f_opt = res.x

# Compute the prosterior predictive statistics with optimized kernel parameters and plot the results
mu_s, cov_s = posterior_predictive(X, X_train, Y_train, l=l_opt, sigma_f=sigma_f_opt, sigma_y=noise)
plot_gp(X, mu_s, cov_s)
plt.plot(X_train, Y_train, 'rx')

[<matplotlib.lines.Line2D at 0x7f04716afbe0>]


res

fun: array([[66.3643472]])


hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
jac: array([ 76.02868139, 160.05121638])

message: b'CONVERGENCE: REL_REDUCTION_OF_F<=_FACTR*EPSMCH'

nfev: 8
nit: 3


status: 0

success: True
x: array([0.40666028, 0.90399203])


plt.figure(figsize=(12, 8))
plt.plot(X_train, Y_train, 'rx')
plt.plot(X, np.sin(0.07*X**3), label='true')
plot_gp(X, mu_s, cov_s)
plt.legend()

<matplotlib.legend.Legend at 0x7f0471976400>


References

[1] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.

[1] Martin Krasser. Gaussian Processes.