Skip to content

Instantly share code, notes, and snippets.

@simeoncarstens
Last active April 18, 2020 20:17
Show Gist options
  • Save simeoncarstens/ffcbb952b766331fd2ceb30caf3944ea to your computer and use it in GitHub Desktop.
Save simeoncarstens/ffcbb952b766331fd2ceb30caf3944ea to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Efficiently evaluating log-likelihood gradients while keeping a clean design"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a problem I've been having for quite a while in my (purely personal) science projects and have swept under the rug so far."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Backstory"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'll keep this brief. My research is / was concerned with Bayesian methods for biomolecular structure determination. We're looking for the posterior probability distribution of a structure $X$ of a molecule (in my case, a part of a chromosome, a whole chromosome, or a whole genome), given some experimental data $D$ and prior information about the structure, such as the fact that it is a polymer. \n",
"So we're looking to characterize the probability distribution\n",
"$$\n",
"p(X|D,I) \\propto p(D|X,I) p(X|I) \\ \\mbox .\n",
"$$\n",
"Here I used Bayes' theorem ($P(A|B) = \\frac{P(B|A) P(A)}{P(B)}$) to decompose the posterior probability $p(X|D,I)$ into a _likelihood_ $p(D|X,I)$ and a _prior_ $p(X|I)$. We will, in the following, only be concerned with the likelihood, or, more precisely, the logarithm of the _likelihood function_ $\\mathcal L(X)$. The latter is the likelihood, but viewed as a function of the structure $X$ and with the data $D$ kept fixed:\n",
"$$\n",
"l(X) = \\log \\mathcal L(X) = \\log p(D|X,I)\n",
"$$\n",
"with $D, I = \\mathrm{const.}$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To efficiently draw samples from the posterior distribution, we use an _en vogue_ MCMC algorithm called [Hamiltonian Monte Carlo](https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo) (HMC). It, in turn, requires the gradient of the log-posterior probability and thus of the log-likelihood function. That's exactly where the dog lies buried, as we say in German."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Efficiently, and beautifully, evaluating the gradient of $l(X)$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We often decompose the likelihood further into a _forward model_ $f(X)$, which back-calculates idealized (or _mock_ ) data $\\hat D$ from a the structure $X$ and an _error model_ $g(\\hat D;D)$, which models by which the measured data $D$ deviates from the mock data $\\hat D$. The log-likelihood function is then given by\n",
"$$\n",
"l(X) = \\log \\mathcal L(X) = \\log g(f(X);D) \\ \\mbox .\n",
"$$\n",
"Now it's important to know the approximate dimensionality of the problem. In my current project, $X$ is a vector with $\\sim 2\\times10^4$ entries and there are $\\sim 10^3$ data points. Let's stick with these numbers for this example, but keep in mind that good scaling with both of these numbers is desired. That means that\n",
"$$\n",
"\\begin{align*}\n",
"f: \\mathbb{R}^{2\\times10^4} &\\rightarrow \\mathbb{R}^{10^3} \\ \\mbox , \\\\\n",
"g: \\mathbb{R}^{10^3} &\\rightarrow \\mathbb{R} \\mbox .\n",
"\\end{align*}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, the issue at hand is efficiently evaluating the gradient of $l(X)$, or $\\nabla l(X)$. Multivariate calculus tells us that\n",
"$$\n",
"\\nabla l(X) = \\nabla l(f(X)) = D_f(X) \\cdot \\nabla_y \\left.l(y)\\right|_{y=f(X)} \\ \\mbox .\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That means the gradient we're looking for is a matrix-vector product between the Jacobian matrix $D_f(X)$ of the forward model $f(X)$ and the gradient $\\nabla_y l(y)$ of the log-error model, evaluated at $y=f(X)$. In terms of dimensions, $D_f(X)$ is $(10^3, 2\\times10^4)$ and $\\nabla_y l(y)$ is of length $10^3$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It also turns out that the Jacobian $D_f(X)$ is relatively sparse. Entries in a given column of $D_f(X)$ are non-zero only for $2\\times3\\times20 = 120$ out of $2\\times10^4$ rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Solution 1: calculate $D_f(X)$ explicitly and save it in a sparse matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All this happens in Python with some Cython in the mix. So the most obvious solution would be use Cython to efficiently create a sparse matrix as implemented in `scipy.sparse`. The most convenient way to do this is to build a sparse matrix in coordinate format, meaning, you create a list of row- and column indices for which the matrix entries are non-zero and a third list containing the values at the specified matrix coordinates."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Advantages**:\n",
"- clear design: I have a `Likelihood` class, which has instance attributes `error_model` and `forward_model`. `error_model` has an instance attribute `gradient` and `forward model` has an instance attribute `jacobian`. Then, in `Likelihood`, I just have the gradient as\n",
"```\n",
"def gradient(self, X):\n",
" mock_data = self.forward_model.evaluate(X)\n",
" return self.forward_model.jacobian(X).dot(self.error_model.gradient(mock_data))\n",
"```\n",
"Easy peasy, clean and readable.\n",
"\n",
"**Disadvantages**:\n",
"- it's slow: according to the `scipy.sparse` documentation, the COO sparse matrix (`scipy.sparse.coo_matrix`) has to be converted to a, in my case, a sparse row matrix (`scipy.sparse.csr_matrix`) for efficient matrix-vector multiplication. Interestingly enough, the matrix-vector multiplication using the COO matrix is about twice as fast as converting to the CSR matrix and then doing the matrix-vector multiplication.\n",
"\n",
"So this would be my preferred solution, because it's very clean: it's very readable (if you have the maths background knowledge) and there's a clear separation of concerns in the three classes `Likelihood`, `ErrorModel` and `ForwardModel`. **But**, unfortunately, it's still around two times slower than..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Solution 2: explicitly code up the matrix-vector product in Cython"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's what I've been doing in \"production\" so far. Each combination of error model and gradient would lead to a separate, derived `Likelihood` class, with a `gradient` function coded up in Cython which just evaluates the matrix-vector product directly. By that I mean that I don't separately calculate the Jacobi matrix, but code up each scalar product of Jacobi matrix row and error model gradient vector directly.\n",
"\n",
"**Advantages**:\n",
"- avoids building the sparse matrix\n",
"- avoids creating a bunch of temporary `numpy` arrays\n",
"- all in all: about two times as fast as Solution 1\n",
"\n",
"**Disadvantages**:\n",
"- mixes concerns: the derived likelihood class contains error model- and forward model-specific code in `gradient`\n",
"- less readable: the code to calculate the gradient has to be extensively commented to make clear what's going on\n",
"- more analytical calculation involved: although it's only additional scalar products one has to figure out with pen and paper when implementing a new error model / forward model combination, it's an additional burden on the developer\n",
"\n",
"A dirty detail: evaluation of the Jacobi matrix requires quite a few expensive operations (evaluation of euclidic distances, sigmoidal functions) which are also needed for evaluating the forward model $f(X)$ itself, at which, in turn, the error model gradient $\\nabla_y l(y)$ is evaluated. So when calculating the Jacobian matrix , I calculate $f(X)$, too, cache it as an instance attribute of `forward_model`and reuse it when evaluating `Likelihood.gradient()`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Profiling results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's a rudimentary profiling of both solutions. Only the first few entries, ordered by total time, are shown:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Solution 1\n",
"```\n",
"In [4]: %prun for _ in range(1000): L.gradient(**init.variables)\n",
" 113003 function calls (112003 primitive calls) in 1.817 seconds\n",
"\n",
" Ordered by: internal time\n",
"\n",
" ncalls tottime percall cumtime percall filename:lineno(function)\n",
" 1000 1.203 0.001 1.203 0.001 {biochrose.forward_models.ensemble_contacts_c.jacobian}\n",
" 1000 0.311 0.000 0.311 0.000 {built-in method scipy.sparse._sparsetools.coo_matvec}\n",
" 1000 0.027 0.000 1.790 0.002 __init__.py:50(_evaluate_gradient)\n",
" 1000 0.021 0.000 0.021 0.000 {biochrose.error_models.poisson_c.gradient}\n",
" 1000 0.019 0.000 0.019 0.000 {method 'reduce' of 'numpy.ufunc' objects}\n",
" 1000 0.018 0.000 1.336 0.001 ensemble_contacts.py:100(_evaluate_jacobi_matrix)\n",
" 4000 0.014 0.000 0.014 0.000 {built-in method numpy.array}\n",
" 1000 0.012 0.000 0.064 0.000 coo.py:128(__init__)\n",
" 1000 0.011 0.000 0.011 0.000 {built-in method numpy.zeros}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Solution 2\n",
"Much less interesting:\n",
"```\n",
"In [3]: %prun for _ in range(1000): L.gradient(**init.variables)\n",
" 23038 function calls in 0.874 seconds\n",
"\n",
" Ordered by: internal time\n",
"\n",
" ncalls tottime percall cumtime percall filename:lineno(function)\n",
" 1000 0.825 0.001 0.825 0.001 {biochrose.likelihoods.ensemble_contacts_c.calculate_gradient}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So we find that the generation of the sparse matrix is already too slow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What I ended up doing\n",
"While writing this all up and, most of all, while doing some more profiling, it occured to me that there might be a better solution. I noticed that the most expensive thing is actually filling the coordinate and value arrays for the sparse matrices. The fact that, in my model, one data point stems from only one (actually, in this case, 20 similar) pairwise interactions between two monomers (units of which the polymer is made of) gives rise to symmetries I have been exploiting already before (basically, Newton's third law—when one monomer exerts a force on another, the other one exerts exactly the same force back). \n",
"\n",
"For this type of data-generating process, and all my model so far are of this type, I now implemented a mock matrix object with a function `.dot()`, which, given the gradient of the error model as an argument, basically performs the sparse matrix-vector multiplication, but requires only the array of values and not the coordinates. Furthermore, because of the afore-mentioned symmetry, the vector of values needs to contain only the results for one set of monomers, because the other values are just their negatives."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This means I do not need to create, fill and pass around the coordinate arrays of a COO matrix and the number of values being passed around is cut by half. Thanks to the matrix-emulating interface (using `.dot()`), the likelihood gradient can be agnostic of what kind of matrix $D_f(X)$ is. For tests or easy to evaluate models, it can just be a two-dimensional `numpy` array. Or it can take a `scipy.sparse` matrix or, well, my mock matrix object, all of which expose the `.dot()` interface for matrix(-vector) multiplication."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Epilogue\n",
"If you actually read until here, think this is interesting and you want to know more about what this is about, let me know—I'm happy to chat about that stuff!\n",
"\n",
"Or, you can read the following preprint: https://www.biorxiv.org/content/10.1101/493676v1. There's an updated version of that paper, too, but it's paywalled."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment