rasmusbergpalm/bits-back.ipynb

## bits-back.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bits back coding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assume we wish to use a VAE for compression. Our generative model is\n",
    "\n",
    "$$\n",
    "p(x,z) = p(x|z)p(z)\n",
    "$$\n",
    "\n",
    "And we approximate the true posterior $p(z|x)$ with $q(z|x)$\n",
    "\n",
    "Assume we have trained our model and sent all the parameters to the receiver.\n",
    "\n",
    "We now need to send $\\hat{x}$ to the receiver.\n",
    "\n",
    "We'll use two arithmetic coders to sample z from $q(z|\\hat{x})$.\n",
    "\n",
    "The first arithmetic coder uses $q(z|\\hat{x})$ to some precision (e.g. 32 bits floats) and runs in decode mode: given enough random bits it outputs a sample from $q(z|\\hat{x})$.\n",
    "\n",
    "The second arithmetic coder uses a probability model of something else we also wish to send, say text. It uses a $p(c_t|c_{t-1})$ conditional distribution of the frequencies of characters depending on the previous character. We run it in the encoder mode and input some text, say \"bits back coding is super easy to understand\", and it spits out a string of \"random\" bits, e.g. \"10100010101010010010\". \n",
    "\n",
    "We input text in the second encoder and feed the resulting bits into the first arithmetic coder. Once the first coder has enough bits it spits out a sample $\\hat{z} \\sim q(z|\\hat{x})$ and we stop.\n",
    "\n",
    "We transmit this $\\hat{z}$ to the receiver (using an arithmetic coder again), incurring $-log_2p(\\hat{z})$ bits cost. This cost will most likely be high since we chose a high precision $\\hat{z}$, e.g. 32 bit.\n",
    "\n",
    "We calculate $p(x|\\hat{z})$ and transmit $\\hat{x}$ using this distribution and arithmetic coding, incurring $-log_2 p(\\hat{x}|\\hat{z})$ cost. We've now transmitted $\\hat{x}$ losslessly at a cost of $-log_2 p(\\hat{z}) -log_2 p(\\hat{x}|\\hat{z})$.\n",
    "\n",
    "Now we calculate the posterior $p(z|\\hat{x})$ on the receiver side.\n",
    "\n",
    "Now we can run the second arithmetic coder in decode mode, using the posterior $p(z|\\hat{x})$ and inputting $\\hat{z}$. This will recover the message: \"bits back coding is super easy to understand\".\n",
    "\n",
    "We have now \"gotten some bits back\", the information content of the message. On average this will be $H(q(z|x))$ bits."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Unanswered questions\n",
    "\n",
    "Not all $\\hat{z}$ sampled from the posterior will be equally good. You might sample a highly unlikely $\\hat{z}$, which would make $\\hat{x}$ very unlikely under $p(x|\\hat{z})$, thus incurring an extra cost on the $-log_2 p(\\hat{x}|\\hat{z})$ term.\n",
    "\n",
    "However, an unlikely $z$ means you can send a bigger message.\n",
    "\n",
    "How much and when these two terms cancel out I don't know. Intuitively it feels like a zero sum game, and that sending the mode of the posterior instead of sampling from it should be as good as sending a sample + an extra message encoded in the sample."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Bits back coding"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Assume we wish to use a VAE for compression. Our generative model is\n",
	"\n",
	"$$\n",
	"p(x,z) = p(x\|z)p(z)\n",
	"$$\n",
	"\n",
	"And we approximate the true posterior $p(z\|x)$ with $q(z\|x)$\n",
	"\n",
	"Assume we have trained our model and sent all the parameters to the receiver.\n",
	"\n",
	"We now need to send $\\hat{x}$ to the receiver.\n",
	"\n",
	"We'll use two arithmetic coders to sample z from $q(z\|\\hat{x})$.\n",
	"\n",
	"The first arithmetic coder uses $q(z\|\\hat{x})$ to some precision (e.g. 32 bits floats) and runs in decode mode: given enough random bits it outputs a sample from $q(z\|\\hat{x})$.\n",
	"\n",
	"The second arithmetic coder uses a probability model of something else we also wish to send, say text. It uses a $p(c_t\|c_{t-1})$ conditional distribution of the frequencies of characters depending on the previous character. We run it in the encoder mode and input some text, say \"bits back coding is super easy to understand\", and it spits out a string of \"random\" bits, e.g. \"10100010101010010010\". \n",
	"\n",
	"We input text in the second encoder and feed the resulting bits into the first arithmetic coder. Once the first coder has enough bits it spits out a sample $\\hat{z} \\sim q(z\|\\hat{x})$ and we stop.\n",
	"\n",
	"We transmit this $\\hat{z}$ to the receiver (using an arithmetic coder again), incurring $-log_2p(\\hat{z})$ bits cost. This cost will most likely be high since we chose a high precision $\\hat{z}$, e.g. 32 bit.\n",
	"\n",
	"We calculate $p(x\|\\hat{z})$ and transmit $\\hat{x}$ using this distribution and arithmetic coding, incurring $-log_2 p(\\hat{x}\|\\hat{z})$ cost. We've now transmitted $\\hat{x}$ losslessly at a cost of $-log_2 p(\\hat{z}) -log_2 p(\\hat{x}\|\\hat{z})$.\n",
	"\n",
	"Now we calculate the posterior $p(z\|\\hat{x})$ on the receiver side.\n",
	"\n",
	"Now we can run the second arithmetic coder in decode mode, using the posterior $p(z\|\\hat{x})$ and inputting $\\hat{z}$. This will recover the message: \"bits back coding is super easy to understand\".\n",
	"\n",
	"We have now \"gotten some bits back\", the information content of the message. On average this will be $H(q(z\|x))$ bits."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"# Unanswered questions\n",
	"\n",
	"Not all $\\hat{z}$ sampled from the posterior will be equally good. You might sample a highly unlikely $\\hat{z}$, which would make $\\hat{x}$ very unlikely under $p(x\|\\hat{z})$, thus incurring an extra cost on the $-log_2 p(\\hat{x}\|\\hat{z})$ term.\n",
	"\n",
	"However, an unlikely $z$ means you can send a bigger message.\n",
	"\n",
	"How much and when these two terms cancel out I don't know. Intuitively it feels like a zero sum game, and that sending the mode of the posterior instead of sampling from it should be as good as sending a sample + an extra message encoded in the sample."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}