eamartin/emartin_lisa_notes.txt

## emartin_lisa_notes.txt
Feel free to contact me with any questions about this material:
email: eric@ericmart.in
Skype: eric.a.martin

Here are some notes from a summer of work. I'm primarily writing these down
so the knowledge doesn't get lost. Most of this documented is describing what
did and did not work for training supervised GSNs. I'm also including a couple
of curious things I found this summer.

Supervised GSN training experience
----------------------------------
These supervised GSNs attempted to learn the joint distribution between 2 vectors.
One vector was put at the bottom of the GSN, and the other vector was put on top
of the GSN (as the top layer) across the hidden layers.
For my work, I was primarily trying to get results on MNIST. The 784 component image
was the bottom layer of the GSN, the 10 component prediction vector was the top layer.

I'll refer to the bottom layer (the images) as x and the top layer (predictions) as y.

All experiments were done with binary cross entropy costs on both the top and
bottom layers. Training was done by SGD with a momentum term. The
MonitorBasedLRAdjuster of pylearn2 was used to adjust the learning rate. Various
combinations of network depths, layer sizes, and noise types and magnitudes were
explored. The cross entropy costs on all layers were normalized by the vector size
(so cost was average over each component of the vector).

I was primarily looking at the classification accuracy on MNIST, while training to
maintain models that appeared to both sample (and mix) well. Training was done with
noise, but prediction was done without any noise. Classification was done by running
the GSN and average the various computed prediction vectors and then taking the
argmax. Classification performance was best when only the first computed prediction
vector was used. Classification also involved clamping (ie fixing at each iteration)
the x vector.

Note that there are quite a few different ways to train a joint model such as a GSN.
Possibilities include setting both x and y, setting just x and predicting y, and
vice versa.

Key takeaways:
Setting both x and y during training resulted in bad classification performance.
I was never able to get under ~3.8% with this sort of training. The 3.8% error
happened on a network with a single hidden layer, and the error became much worse
as the network became deeper (50% for network with 3 hidden layers). I believe this
happened because both x and y learned to autoencode themselves seperately and the
network never learned to communicate between layers. Adding large amounts of noise
to the top layer did not help with this problem. I believe this happened because of
the very small amount of information at the top layer (its just a 10 component
1-hot vector, so log_2(10) ~= 3.3 bits) and the relatively large layers (at least
100 neurons) next to the top layer.

The non-communicating layers hypothesis was also supported by the relative success
of training where x was given and y was predicted (backprop on both x and y). This
sort of training forced information from the bottom layer to reach the top layer for
the top layer to have predictive power. I achieved 1.25% error (125 errors) on MNIST
using this sort of training on a network with a single hidden layer. Notably, there
seemed to be a trade-off between quantity of noise (and how well the model mixed)
and classification accuracy. Again, networks with just a single hidden layer
performed better than networks with multiple hidden layers, but the difference wasn't
nearly as large in this case. Results varied greatly for networks with 2 or 3 layers,
but I had a couple of tricks that got about 1.6% error.

This trick involved doing training very similarly to Ian Goodfellow's paper on
jointly training deep Boltzmann machines. I would set some subset of the inputs,
run the network and compute the cost function of the complement of that subset.
I would generally keep about 75% of the x units and only 20% of the y units (the
1-hot vector). Computing costs on all of the elements rather than just the
complement had no significant impact on performance (loss of .05%, probably not
meaningful). All of these trials were done with no noise anywhere within the
network (and adding noise hurt classification). Running with costs evaluated on
all elements seems like it should be identical to just applying dropout to the
top and bottom layers, but the results were considerably worse when I just applied
dropout to the top and bottom layers. The only difference between these 2 approaches
is that the standard dropout solution applies the corruption at every iteration of
the model, while my DBM inspired approach only applied the corruption before
initializing the network.

Other things of note:
I generally only applied noise on the input, the top hidden layer, and the prediction
layer because this appeared to work best.

To corrupt the y (1-hot) layer, I added gaussian noise of magnitude .75 and then
took the softmax of the vector (to give an magnitude=1 probability vector).

Training a model with both x and y and then doing more training trying to predict y
given x didn't get the absolute best classification results but did produce models
that both sampled well and classified fairly well.

Other things I found throughout the summer
------------------------------------------
One interesting thing happened while I debugging my GSN. I reduced the case until
I was dealing with a single autoencoder. I was attempting to learn the identity
map with an autoencoder with no noise, linear activations, tied weights,
mean squared error cost function, and SGD. The input data was uniformly distributed
within the n=10 dimensional unit hypercube, and there were n hidden and output layers.
Ideally, this autoencoder should easily learn the identity function. However, when
I was attempting to train this autoencoder (with bias terms included), the weights
went to 0 and the biases just learned the center of the hypercube (mean squared error
was equal to variance of distribution). The output with bias is equal to
W'(Wx + b_1) + b_2, where W' is the transpose of W. When I ran the same network
without bias terms (or the bias constrained to 0, same thing), the network fairly
quickly learned a W such that W'Wx = x => W'W = I => W is a unitary matrix. This
was an interesting case of add power to the model (with the bias term) resulting in
worse performance due to creation of a new local minima (when W = 0).

Also, I experimented with Radford Neal's funnel distribution (code at
https://github.com/lightcatcher/funnel_gsn ). This distribution is known for being
difficult to sample from. My various attempts to sample this distribution were
unsuccessful (couldn't get the GSN to capture any structure). I could get correct
marginal distributions for some of the variables, but the joint distribution would
be all wrong.
	Feel free to contact me with any questions about this material:
	email: eric@ericmart.in
	Skype: eric.a.martin

	Here are some notes from a summer of work. I'm primarily writing these down
	so the knowledge doesn't get lost. Most of this documented is describing what
	did and did not work for training supervised GSNs. I'm also including a couple
	of curious things I found this summer.

	Supervised GSN training experience
	----------------------------------
	These supervised GSNs attempted to learn the joint distribution between 2 vectors.
	One vector was put at the bottom of the GSN, and the other vector was put on top
	of the GSN (as the top layer) across the hidden layers.
	For my work, I was primarily trying to get results on MNIST. The 784 component image
	was the bottom layer of the GSN, the 10 component prediction vector was the top layer.

	I'll refer to the bottom layer (the images) as x and the top layer (predictions) as y.

	All experiments were done with binary cross entropy costs on both the top and
	bottom layers. Training was done by SGD with a momentum term. The
	MonitorBasedLRAdjuster of pylearn2 was used to adjust the learning rate. Various
	combinations of network depths, layer sizes, and noise types and magnitudes were
	explored. The cross entropy costs on all layers were normalized by the vector size
	(so cost was average over each component of the vector).

	I was primarily looking at the classification accuracy on MNIST, while training to
	maintain models that appeared to both sample (and mix) well. Training was done with
	noise, but prediction was done without any noise. Classification was done by running
	the GSN and average the various computed prediction vectors and then taking the
	argmax. Classification performance was best when only the first computed prediction
	vector was used. Classification also involved clamping (ie fixing at each iteration)
	the x vector.

	Note that there are quite a few different ways to train a joint model such as a GSN.
	Possibilities include setting both x and y, setting just x and predicting y, and
	vice versa.

	Key takeaways:
	Setting both x and y during training resulted in bad classification performance.
	I was never able to get under ~3.8% with this sort of training. The 3.8% error
	happened on a network with a single hidden layer, and the error became much worse
	as the network became deeper (50% for network with 3 hidden layers). I believe this
	happened because both x and y learned to autoencode themselves seperately and the
	network never learned to communicate between layers. Adding large amounts of noise
	to the top layer did not help with this problem. I believe this happened because of
	the very small amount of information at the top layer (its just a 10 component
	1-hot vector, so log_2(10) ~= 3.3 bits) and the relatively large layers (at least
	100 neurons) next to the top layer.

	The non-communicating layers hypothesis was also supported by the relative success
	of training where x was given and y was predicted (backprop on both x and y). This
	sort of training forced information from the bottom layer to reach the top layer for
	the top layer to have predictive power. I achieved 1.25% error (125 errors) on MNIST
	using this sort of training on a network with a single hidden layer. Notably, there
	seemed to be a trade-off between quantity of noise (and how well the model mixed)
	and classification accuracy. Again, networks with just a single hidden layer
	performed better than networks with multiple hidden layers, but the difference wasn't
	nearly as large in this case. Results varied greatly for networks with 2 or 3 layers,
	but I had a couple of tricks that got about 1.6% error.

	This trick involved doing training very similarly to Ian Goodfellow's paper on
	jointly training deep Boltzmann machines. I would set some subset of the inputs,
	run the network and compute the cost function of the complement of that subset.
	I would generally keep about 75% of the x units and only 20% of the y units (the
	1-hot vector). Computing costs on all of the elements rather than just the
	complement had no significant impact on performance (loss of .05%, probably not
	meaningful). All of these trials were done with no noise anywhere within the
	network (and adding noise hurt classification). Running with costs evaluated on
	all elements seems like it should be identical to just applying dropout to the
	top and bottom layers, but the results were considerably worse when I just applied
	dropout to the top and bottom layers. The only difference between these 2 approaches
	is that the standard dropout solution applies the corruption at every iteration of
	the model, while my DBM inspired approach only applied the corruption before
	initializing the network.

	Other things of note:
	I generally only applied noise on the input, the top hidden layer, and the prediction
	layer because this appeared to work best.

	To corrupt the y (1-hot) layer, I added gaussian noise of magnitude .75 and then
	took the softmax of the vector (to give an magnitude=1 probability vector).

	Training a model with both x and y and then doing more training trying to predict y
	given x didn't get the absolute best classification results but did produce models
	that both sampled well and classified fairly well.

	Other things I found throughout the summer
	------------------------------------------
	One interesting thing happened while I debugging my GSN. I reduced the case until
	I was dealing with a single autoencoder. I was attempting to learn the identity
	map with an autoencoder with no noise, linear activations, tied weights,
	mean squared error cost function, and SGD. The input data was uniformly distributed
	within the n=10 dimensional unit hypercube, and there were n hidden and output layers.
	Ideally, this autoencoder should easily learn the identity function. However, when
	I was attempting to train this autoencoder (with bias terms included), the weights
	went to 0 and the biases just learned the center of the hypercube (mean squared error
	was equal to variance of distribution). The output with bias is equal to
	W'(Wx + b_1) + b_2, where W' is the transpose of W. When I ran the same network
	without bias terms (or the bias constrained to 0, same thing), the network fairly
	quickly learned a W such that W'Wx = x => W'W = I => W is a unitary matrix. This
	was an interesting case of add power to the model (with the bias term) resulting in
	worse performance due to creation of a new local minima (when W = 0).

	Also, I experimented with Radford Neal's funnel distribution (code at
	https://github.com/lightcatcher/funnel_gsn ). This distribution is known for being
	difficult to sample from. My various attempts to sample this distribution were
	unsuccessful (couldn't get the GSN to capture any structure). I could get correct
	marginal distributions for some of the variables, but the joint distribution would
	be all wrong.