boennecd/dmvnorm_arma.Rmd

## dmvnorm_arma.Rmd
---
title: Faster Multivariate Normal densities with RcppArmadillo and OpenMP
author: Nino Hardt, Dicko Ahmadou
license: GPL (>= 2)
tags: armadillo openmp featured
summary: Fast implementation of Multivariate Normal density using RcppArmadillo and OpenMP.
---

The Multivariate Normal density function is used frequently
for a number of problems. Especially for MCMC problems, fast
evaluation is important. Multivariate Normal Likelihoods,
Priors and mixtures of Multivariate Normals require numerous
evaluations, thus speed of computation is vital.
We show dramatic increases in speed by using efficient algorithms,
RcppArmadillo, and some extra gain by using OpenMP.
The code is based on the latest version of RcppArmadillo (0.3.910.0).

While the `dmvnorm()` function from the `mvtnorm` package is quite popular,
and in an earlier version of this article we demonstrated that an
Rcpp implementation would lead to faster computation.

Peter Rossi, author of `bayesm`, called our attention to the `bayesm` pure R
implementation which is much faster than `dmvnorm()`.
The function `dMvn()` is used internally by the mixture of normals model in
`bayesm`. It is the matrix-equivalent version of `lndMvn`:


```{r}
dMvn <- function(X,mu,Sigma) {
    k <- ncol(X)
    rooti <- backsolve(chol(Sigma),diag(k))
    quads <- colSums((crossprod(rooti,(t(X)-mu)))^2)
    return(exp(-(k/2)*log(2*pi) + sum(log(diag(rooti))) - .5*quads))
}
```


Translating the vectorized approach into RcppArmadillo,
we precompute the inverse of the Cholesky decomposition of the covariance
matrix ahead of the main loop over the rows of `X`.
It is worth remarking that multiplying with a precomputed the inverse of the
Cholesky decomposition of
the covariance matrix is faster but less numerically stable compared to
a backsolve as in the `Mahalanobis` function we will define later.
The loop can easily be parallelized, and the code is easy to read and
manipulate. For instance, the inverse Cholesky decomposition can be put
inside the main loop, if varying covariance matrices are necessary.
The use of `trimatu` allows to exploit that of the Cholesky
decomposition the covariance matrix is an upper triangular matrix in the
inversion.


```{r, engine='Rcpp'}
#include <RcppArmadillo.h>

static double const log2pi = std::log(2.0 * M_PI);

// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
arma::vec dmvnrm_arma(arma::mat const &x,
                      arma::rowvec const &mean,
                      arma::mat const &sigma,
                      bool logd = false) {
    int n = x.n_rows;
    int xdim = x.n_cols;
    arma::vec out(n);
    arma::mat const rooti = arma::inv(trimatu(arma::chol(sigma)));
    double const rootisum = arma::sum(log(rooti.diag())),
                constants = -xdim/2.0 * log2pi,
              other_terms = rootisum + constants;

    arma::rowvec z;
    for (int i=0; i < n; i++) {
        z      = (x.row(i) - mean) * rooti;
        out(i) = other_terms - 0.5 * arma::dot(z, z);
    }

    if (logd == false)
        out = exp(out);
    return(out);
}
```

Additionally, we can make use of the OpenMP library to use multiple
cores. For the OpenMP implementation, we need to enable OpenMP support.
One way of doing so is by adding the required compiler and linker
flags as follows:

```{r}
Sys.setenv("PKG_CXXFLAGS"="-fopenmp")
Sys.setenv("PKG_LIBS"="-fopenmp")
```

Rcpp version 0.10.5 and later will also provide a plugin to set these
variables for us:

```{r, engine="Rcpp"}
// [[Rcpp::plugins(openmp)]]
```

We also need to set the number of cores to be used before running the
compiled functions. One way is to use `detectCores()` from the `parallel`
package.

```{r}
cores <- parallel::detectCores(logical = FALSE)
```

Only two additional lines are needed to enable multicore processing.
In this example, a dynamic schedule is used for OpenMP.
A static schedule might be faster in some cases. However,this is
left to further experimentation.


```{r, engine='Rcpp'}
#include <RcppArmadillo.h>
#include <omp.h>

static double const log2pi = std::log(2.0 * M_PI);

// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
arma::vec dmvnrm_arma_mc(arma::mat const &x,
                         arma::rowvec const &mean,
                         arma::mat const &sigma,
                         bool const logd = false,
                         int const cores = 1) {
    omp_set_num_threads(cores);
    int n = x.n_rows;
    int xdim = x.n_cols;
    arma::vec out(n);
    arma::mat const rooti = arma::inv(trimatu(arma::chol(sigma)));
    double const rootisum = arma::sum(log(rooti.diag())),
                constants = -xdim/2.0 * log2pi,
              other_terms = rootisum + constants;

    arma::rowvec z;
    #pragma omp parallel for schedule(static) private(z)
    for (int i=0; i < n; i++) {
        z      = (x.row(i) - mean) * rooti;
        out(i) = other_terms - 0.5 * arma::dot(z, z);
    }

    if (logd==false)
        out=exp(out);
    return(out);
}
```


Likewise, it is easy to translate `dmvnorm` from the `mvtnorm`
package into Rcpp:

```{r, engine='Rcpp'}
#include <RcppArmadillo.h>
static double const log2pi = std::log(2.0 * M_PI);
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
arma::vec Mahalanobis(arma::mat const &x,
                      arma::vec const &center,
                      arma::mat const &cov) {
    arma::mat x_cen = x.t();
    x_cen.each_col() -= center;
    arma::solve(x_cen, arma::trimatl(chol(cov).t()), x_cen);
    x_cen.for_each( [](arma::mat::elem_type& val) { val = val * val; } );
    return arma::sum(x_cen, 0).t();
}
// [[Rcpp::export]]
arma::vec dmvnorm_arma(arma::mat const &x,
                       arma::vec const &mean,
                       arma::mat const &sigma,
                       bool const log = false) {
    arma::vec const distval = Mahalanobis(x,  mean, sigma);
    double const logdet = sum(arma::log(arma::eig_sym(sigma)));
    arma::vec const logretval =
      -( (x.n_cols * log2pi + logdet + distval)/2  ) ;

    if (log)
        return(logretval);
    return(exp(logretval));
}
```


Now we simulate some data for benchmarking:

```{r}
set.seed(123)
sigma <- bayesm::rwishart(10,diag(8))$IW
means <- rnorm(8)
X     <- mvtnorm::rmvnorm(900000, means, sigma)
```


And run the benchmark:

```{r}
print(paste0("Using ",cores," cores for _mc versions"))
require(rbenchmark)
benchmark(mvtnorm::dmvnorm(X,means,sigma,log=F),
          dmvnorm_arma(X,means,sigma,F),
          dmvnrm_arma(X,means,sigma,F) ,
          dmvnrm_arma_mc(X,means,sigma,F,cores),
          dMvn(X,means,sigma),
          order="relative", replications=100)[,1:4]
```


Lastly, we show that the functions yield the same results:

```{r}
all.equal(mvtnorm::dmvnorm(X,means,sigma,log=FALSE),
          dmvnorm_arma    (X,means,sigma,FALSE)[,1],
	      dmvnrm_arma     (X,means,sigma,FALSE)[,1],
	      dMvn            (X,means,sigma))
```


The use of RcppArmadillo brings about a significant increase
in speed. The addition of OpenMP leads to only little
additional performance.

This example also illustrates that Rcpp does not completely
substitute the need to look for faster algorithms. Basing the
code of `lndMvn` instead of `dmvnorm` leads to a significantly
faster function.
	---
	title: Faster Multivariate Normal densities with RcppArmadillo and OpenMP
	author: Nino Hardt, Dicko Ahmadou
	license: GPL (>= 2)
	tags: armadillo openmp featured
	summary: Fast implementation of Multivariate Normal density using RcppArmadillo and OpenMP.
	---

	The Multivariate Normal density function is used frequently
	for a number of problems. Especially for MCMC problems, fast
	evaluation is important. Multivariate Normal Likelihoods,
	Priors and mixtures of Multivariate Normals require numerous
	evaluations, thus speed of computation is vital.
	We show dramatic increases in speed by using efficient algorithms,
	RcppArmadillo, and some extra gain by using OpenMP.
	The code is based on the latest version of RcppArmadillo (0.3.910.0).

	While the `dmvnorm()` function from the `mvtnorm` package is quite popular,
	and in an earlier version of this article we demonstrated that an
	Rcpp implementation would lead to faster computation.

	Peter Rossi, author of `bayesm`, called our attention to the `bayesm` pure R
	implementation which is much faster than `dmvnorm()`.
	The function `dMvn()` is used internally by the mixture of normals model in
	`bayesm`. It is the matrix-equivalent version of `lndMvn`:


	```{r}
	dMvn <- function(X,mu,Sigma) {
	k <- ncol(X)
	rooti <- backsolve(chol(Sigma),diag(k))
	quads <- colSums((crossprod(rooti,(t(X)-mu)))^2)
	return(exp(-(k/2)log(2pi) + sum(log(diag(rooti))) - .5*quads))
	}
	```


	Translating the vectorized approach into RcppArmadillo,
	we precompute the inverse of the Cholesky decomposition of the covariance
	matrix ahead of the main loop over the rows of `X`.
	It is worth remarking that multiplying with a precomputed the inverse of the
	Cholesky decomposition of
	the covariance matrix is faster but less numerically stable compared to
	a backsolve as in the `Mahalanobis` function we will define later.
	The loop can easily be parallelized, and the code is easy to read and
	manipulate. For instance, the inverse Cholesky decomposition can be put
	inside the main loop, if varying covariance matrices are necessary.
	The use of `trimatu` allows to exploit that of the Cholesky
	decomposition the covariance matrix is an upper triangular matrix in the
	inversion.


	```{r, engine='Rcpp'}
	#include <RcppArmadillo.h>

	static double const log2pi = std::log(2.0 * M_PI);

	// [[Rcpp::depends("RcppArmadillo")]]
	// [[Rcpp::export]]
	arma::vec dmvnrm_arma(arma::mat const &x,
	arma::rowvec const &mean,
	arma::mat const &sigma,
	bool logd = false) {
	int n = x.n_rows;
	int xdim = x.n_cols;
	arma::vec out(n);
	arma::mat const rooti = arma::inv(trimatu(arma::chol(sigma)));
	double const rootisum = arma::sum(log(rooti.diag())),
	constants = -xdim/2.0 * log2pi,
	other_terms = rootisum + constants;

	arma::rowvec z;
	for (int i=0; i < n; i++) {
	z = (x.row(i) - mean) * rooti;
	out(i) = other_terms - 0.5 * arma::dot(z, z);
	}

	if (logd == false)
	out = exp(out);
	return(out);
	}
	```

	Additionally, we can make use of the OpenMP library to use multiple
	cores. For the OpenMP implementation, we need to enable OpenMP support.
	One way of doing so is by adding the required compiler and linker
	flags as follows:

	```{r}
	Sys.setenv("PKG_CXXFLAGS"="-fopenmp")
	Sys.setenv("PKG_LIBS"="-fopenmp")
	```

	Rcpp version 0.10.5 and later will also provide a plugin to set these
	variables for us:

	```{r, engine="Rcpp"}
	// [[Rcpp::plugins(openmp)]]
	```

	We also need to set the number of cores to be used before running the
	compiled functions. One way is to use `detectCores()` from the `parallel`
	package.

	```{r}
	cores <- parallel::detectCores(logical = FALSE)
	```

	Only two additional lines are needed to enable multicore processing.
	In this example, a dynamic schedule is used for OpenMP.
	A static schedule might be faster in some cases. However,this is
	left to further experimentation.


	```{r, engine='Rcpp'}
	#include <RcppArmadillo.h>
	#include <omp.h>

	static double const log2pi = std::log(2.0 * M_PI);

	// [[Rcpp::depends("RcppArmadillo")]]
	// [[Rcpp::export]]
	arma::vec dmvnrm_arma_mc(arma::mat const &x,
	arma::rowvec const &mean,
	arma::mat const &sigma,
	bool const logd = false,
	int const cores = 1) {
	omp_set_num_threads(cores);
	int n = x.n_rows;
	int xdim = x.n_cols;
	arma::vec out(n);
	arma::mat const rooti = arma::inv(trimatu(arma::chol(sigma)));
	double const rootisum = arma::sum(log(rooti.diag())),
	constants = -xdim/2.0 * log2pi,
	other_terms = rootisum + constants;

	arma::rowvec z;
	#pragma omp parallel for schedule(static) private(z)
	for (int i=0; i < n; i++) {
	z = (x.row(i) - mean) * rooti;
	out(i) = other_terms - 0.5 * arma::dot(z, z);
	}

	if (logd==false)
	out=exp(out);
	return(out);
	}
	```


	Likewise, it is easy to translate `dmvnorm` from the `mvtnorm`
	package into Rcpp:

	```{r, engine='Rcpp'}
	#include <RcppArmadillo.h>
	static double const log2pi = std::log(2.0 * M_PI);
	// [[Rcpp::depends("RcppArmadillo")]]
	// [[Rcpp::export]]
	arma::vec Mahalanobis(arma::mat const &x,
	arma::vec const &center,
	arma::mat const &cov) {
	arma::mat x_cen = x.t();
	x_cen.each_col() -= center;
	arma::solve(x_cen, arma::trimatl(chol(cov).t()), x_cen);
	x_cen.for_each( [](arma::mat::elem_type& val) { val = val * val; } );
	return arma::sum(x_cen, 0).t();
	}
	// [[Rcpp::export]]
	arma::vec dmvnorm_arma(arma::mat const &x,
	arma::vec const &mean,
	arma::mat const &sigma,
	bool const log = false) {
	arma::vec const distval = Mahalanobis(x, mean, sigma);
	double const logdet = sum(arma::log(arma::eig_sym(sigma)));
	arma::vec const logretval =
	-( (x.n_cols * log2pi + logdet + distval)/2 ) ;

	if (log)
	return(logretval);
	return(exp(logretval));
	}
	```



	Now we simulate some data for benchmarking:

	```{r}
	set.seed(123)
	sigma <- bayesm::rwishart(10,diag(8))$IW
	means <- rnorm(8)
	X <- mvtnorm::rmvnorm(900000, means, sigma)
	```



	And run the benchmark:

	```{r}
	print(paste0("Using ",cores," cores for _mc versions"))
	require(rbenchmark)
	benchmark(mvtnorm::dmvnorm(X,means,sigma,log=F),
	dmvnorm_arma(X,means,sigma,F),
	dmvnrm_arma(X,means,sigma,F) ,
	dmvnrm_arma_mc(X,means,sigma,F,cores),
	dMvn(X,means,sigma),
	order="relative", replications=100)[,1:4]
	```



	Lastly, we show that the functions yield the same results:

	```{r}
	all.equal(mvtnorm::dmvnorm(X,means,sigma,log=FALSE),
	dmvnorm_arma (X,means,sigma,FALSE)[,1],
	dmvnrm_arma (X,means,sigma,FALSE)[,1],
	dMvn (X,means,sigma))
	```


	The use of RcppArmadillo brings about a significant increase
	in speed. The addition of OpenMP leads to only little
	additional performance.

	This example also illustrates that Rcpp does not completely
	substitute the need to look for faster algorithms. Basing the
	code of `lndMvn` instead of `dmvnorm` leads to a significantly
	faster function.