Peter O'Connor petered

## unbiased-online-recurrent-optimization
$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$
$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$
$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$
$\newcommand{\numel}[1]{|#1|}$

$\newcommand{\pderivdim}[2]{\overset{\big[\numel {#1} \times \numel {#2} \big]}{\frac{\partial #1}{\partial #2}}}$


$\newcommand{\pderivdimg}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$

## Distributed Parameter Tuning

# Distributed Low-Bit Computation

Suppose we're trying to communicate a scalar parameter $\theta$ from a worker $W$ to a server $S$.

$\theta$ changes with time $t$.  The worker simply communicates bits of theta asynchronously - so if it sends a bit $b\in {0, 1}$ at time $t\in \mathbb I^+$ we say that the worker communicated a message $(b, t)$.  If the worker sends M messages between times $t_1$ and $t_2$, we say $N_{t_1}^{t_2} = M$

The Server takes in these bits and uses them to build a distribution $p(\hat \theta)$ over the current value of theta.

**Can we create an encoding with the following properties?:**

## iterated-matrix-decomposition
$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$
$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$
$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$
$\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}$
$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
$\newcommand{blue}[1]{\color{blue}{#1}}$
$\newcommand{red}[1]{\color{red}{#1}}$
$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

## kasper-project


# 1) Simple Maximum Likelihood

    F --> X

$$
p(F=1 | X=x) = \frac{p(X=x|F=1) p(F=1)}{p(X=x)} = \frac{p(X=x|F=1) p(F=1)}{p(X=x|F=0)p(F=0) + p(X=x|F=1)p(F=1)}
$$

## testgist
# Temporal Networks

$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$

# The idea
Let
$(x, y)$ be the input, target data, and
$u_1, ... u_L$ be the pre-nonlinearity activations of a neural network, and
$w_1, ... w_L$ be the parameters and $\cdot w (x) \triangleq x\cdot w$
$h_l(\cdot)$ be the nonlinearity of the $l'th$ layer, and

## generative-models-assignment

# Generative Models

## Introduction
Generative models are models that learn the *distribution* of the data.

Suppose we have a collection of N D-Dimensional points: $\{x_1, ..., x_N\}$.  Each, $x_i$ might represent a vector of pixels in an image, or the words in a sentence.

In generative modeling, we imagine that these points are samples from a D-dimensional probability distribution.  The distribution represents whatever real-world process was used to generate that data.  Our objective is to learn the parameters of this distribution.  This allows us to do things like

## dl-course-a1-math
$$
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}}
\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}
\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{blue}[1]{\color{blue}{#1}}
\newcommand{red}[1]{\color{red}{#1}}

## kasper-em-on-graph
$$
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}}
\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{\blue}[1]{\color{blue}{#1}}
\newcommand{\red}[1]{\color{red}{#1}}

## fewfds
$$
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}}
\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{\blue}[1]{\color{blue}{#1}}
\newcommand{\red}[1]{\color{red}{#1}}

## low-var-online-learning
$$
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}}
\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{\blue}[1]{\color{blue}{#1}}
\newcommand{\red}[1]{\color{red}{#1}}
	$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$
	$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$
	$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$
	$\newcommand{\numel}[1]{\|#1\|}$

	$\newcommand{\pderivdim}[2]{\overset{\big[\numel {#1} \times \numel {#2} \big]}{\frac{\partial #1}{\partial #2}}}$


	$\newcommand{\pderivdimg}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$

	# Distributed Low-Bit Computation

	Suppose we're trying to communicate a scalar parameter $\theta$ from a worker $W$ to a server $S$.

	$\theta$ changes with time $t$. The worker simply communicates bits of theta asynchronously - so if it sends a bit $b\in {0, 1}$ at time $t\in \mathbb I^+$ we say that the worker communicated a message $(b, t)$. If the worker sends M messages between times $t_1$ and $t_2$, we say $N_{t_1}^{t_2} = M$

	The Server takes in these bits and uses them to build a distribution $p(\hat \theta)$ over the current value of theta.

	Can we create an encoding with the following properties?:


	# 1) Simple Maximum Likelihood

	F --> X

	$$
	p(F=1 \| X=x) = \frac{p(X=x\|F=1) p(F=1)}{p(X=x)} = \frac{p(X=x\|F=1) p(F=1)}{p(X=x\|F=0)p(F=0) + p(X=x\|F=1)p(F=1)}
	$$
	# Temporal Networks

	$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$

	# The idea
	Let
	$(x, y)$ be the input, target data, and
	$u_1, ... u_L$ be the pre-nonlinearity activations of a neural network, and
	$w_1, ... w_L$ be the parameters and $\cdot w (x) \triangleq x\cdot w$
	$h_l(\cdot)$ be the nonlinearity of the $l'th$ layer, and

	# Generative Models

	## Introduction
	Generative models are models that learn the distribution of the data.

	Suppose we have a collection of N D-Dimensional points: $\{x_1, ..., x_N\}$. Each, $x_i$ might represent a vector of pixels in an image, or the words in a sentence.

	In generative modeling, we imagine that these points are samples from a D-dimensional probability distribution. The distribution represents whatever real-world process was used to generate that data. Our objective is to learn the parameters of this distribution. This allows us to do things like
	$$
	\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
	\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
	\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
	\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right\|_{#3}}
	\newcommand{\norm}[1]{\frac12\\| #1 \\|_2^2}
	\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}
	\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}
	\newcommand{blue}[1]{\color{blue}{#1}}
	\newcommand{red}[1]{\color{red}{#1}}