hausdorff/gist:ea02d45b9744a69c3442693c25d79c29

## gistfile1.txt
Deep Q-Learning (Space Invaders)
09 MAR 2016
Ever since I learned about neural networks playing Atari games I wanted to reimplemnted it and learn how it works. Below you can see an AI playing Space Invaders. I trained it during my batch at Recurse Center on little over 50M frames.


0:08

It is more awesome if you realize that the AI was trained in a similar way a human would learn: the only inputs are screen and number of gained (or lost) points after each action taken by the AI.

DQN does much better then a best-action strategy (do nothing but shoot) and random strategy.

algorithm	points
DQN	550
Best action	240
Random	150
You can play with my implementation here: Deep Q-Learning. This is first post on the topic, stay tuned for the next ones!

0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
0
50
100
150
200
250
300
350
400
450
500
550
600
650
Average game reward
Average game reward (600 games) after N games played. Blue line is random strategy baseline, red line is best-action strategy baseline.

#Algorithm So what is Deep Q-Learning (DQN)? Below you will find a gentle introduction. I omit certain details for the sake of simplicity and I encourage you to read the original paper.

The task for Neural Network in DQN is to learn which action to take based on the screen and previous experience. Obviously the nueral network should choose the best action but how to learn which one is best?

Turns out your neural network can be pretty simple: the input is game screen and hidden layers consists of 3 convolutional layers and a single fully connected layer. The number of neurons in last layer corresponds to number of actions that can be taken. In the case of Space Invaders there were 4 actions (do nothing, shoot, go left, go right), therefore there were 4 neurons in the output layer.

The tricky and crucial part is the loss function. In ‘normal’ neural networks the loss function is straigtforward as for each training example XX there is a known expected outcome yy. Nothing like that is available in our case but we can deal with it thanks to some insights from Q-Learning!

Loss function
The best measure of how good an action is accumulated future reward

∑i=t0ri
∑i=t0ri
where sum is taken over time from t0t0 until the end of the game and riri is reward gained at time ii. There is a couple of problems with that simplified definition and we’ll deal with them one by one.

For one there is no way to calculate that sum as we don’t know the future. With this, we’ll deal at the end though.

Intuitively speaking the immediate reward rt0rt0 should be more valuable then a very distant one. After all, future is uncertain and we might never get this distant reward at all.
Let’s introduce discounted accumulated future reward.

∑i=t0γiri
∑i=t0γiri
Where γγ is between 0 and 1. We’ll set γγ to 0.990.99, though, as the distant rewards are very important.

Finally our game is stochastic (we don’t know when an enemy shoots a laser beam) therefore we should rather think in terms of expected value. Ideally, what we want the neural network to learn is function Q defined as:

Q(s)(a)=𝔼(∑i=t0γiri)Expected discounted accumulated future reward
Q(s)(a)=E(∑i=t0γiri)Expected discounted accumulated future reward
where ss is the input game screen at time t0t0, aa indicates the neuron corresponding with action aa, riri is reward obtained after action taken at time ii. That is what we want each neuron of the output layer to learn.

To do that efficiently we need to realise that Q(s)(a)=r+γmaxa′Q(s′)(a′)Q(s)(a)=r+γmaxa′Q(s′)(a′) where s′s′ is game screen experienced after taking action aa after seeing game screen ss.

Now if Q∗Q∗ is our neural network we can treat Q∗(s)(a)−(r+γmaxa′Q∗(s′)(a′))Q∗(s)(a)−(r+γmaxa′Q∗(s′)(a′)) as a measure of surprise and therefore a loss function (after squaring).

Note that the loss depends on the neural network itself in an untypical way.

Correlation
Since we play the game online it is tempting to simply update the network after each taken action or in mini-batches of, say, 32 actions taken. Such mini-batches would be highly correlated and any stochastic update algorithm would fail on that.

To deal with that issue we keep previous experiences in memory and after each action taken we draw a mini-batch of experiences from that memory to perform the update step.

Other technicalities
provide only every 4th frame to the neural network
stack 4 frames one on top of the other to make the neural network aware of time. This could be avoided if you used LSTM.
keep a stale network at hand and calculate loss with regards to that stale network
gradient clipping (to avoid blowing up gradients)
DeepMind Rmsprop (instead of normal one) - improved performance by 40% in my case.
I used Arcade Learning Environment to play space invaders.
Lessons Learned
That was my first exposure to training non trivial neural networks so there is plenty of things that I learned.

Nan’s as weights is no good. Seems obvious but it does not mean that it’s easy to track down such problems. Theano provides means of doing that efficiently. Of course an NaN usually means that you divide ∞∞ by 00.

Test your Theano code. It’s not so hard! And here is relevant documentation.

If your network converges or diverges to ∞∞ very quickly it’s probably caused by suboptimal learning rates applied in your update function. Little is known about how to correctly choose network’s hyperparameters so trial, error and verification is what’s left.

Update method might play a gigantic role in performance of your neural network. In my case, learning curve of my DQN implementation flattened after 10M frames around 400 points for traditional RMSProp. “DeepMind” RmsProp was learning slower but boosted the performance to over 550 points on average after 50M frames and one can clearly see that it kept learning all the time.
	Deep Q-Learning (Space Invaders)
	09 MAR 2016
	Ever since I learned about neural networks playing Atari games I wanted to reimplemnted it and learn how it works. Below you can see an AI playing Space Invaders. I trained it during my batch at Recurse Center on little over 50M frames.


	0:08

	It is more awesome if you realize that the AI was trained in a similar way a human would learn: the only inputs are screen and number of gained (or lost) points after each action taken by the AI.

	DQN does much better then a best-action strategy (do nothing but shoot) and random strategy.

	algorithm points
	DQN 550
	Best action 240
	Random 150
	You can play with my implementation here: Deep Q-Learning. This is first post on the topic, stay tuned for the next ones!

	0
	10,000
	20,000
	30,000
	40,000
	50,000
	60,000
	70,000
	80,000
	90,000
	0
	50
	100
	150
	200
	250
	300
	350
	400
	450
	500
	550
	600
	650
	Average game reward
	Average game reward (600 games) after N games played. Blue line is random strategy baseline, red line is best-action strategy baseline.

	#Algorithm So what is Deep Q-Learning (DQN)? Below you will find a gentle introduction. I omit certain details for the sake of simplicity and I encourage you to read the original paper.

	The task for Neural Network in DQN is to learn which action to take based on the screen and previous experience. Obviously the nueral network should choose the best action but how to learn which one is best?

	Turns out your neural network can be pretty simple: the input is game screen and hidden layers consists of 3 convolutional layers and a single fully connected layer. The number of neurons in last layer corresponds to number of actions that can be taken. In the case of Space Invaders there were 4 actions (do nothing, shoot, go left, go right), therefore there were 4 neurons in the output layer.

	The tricky and crucial part is the loss function. In ‘normal’ neural networks the loss function is straigtforward as for each training example XX there is a known expected outcome yy. Nothing like that is available in our case but we can deal with it thanks to some insights from Q-Learning!

	Loss function
	The best measure of how good an action is accumulated future reward

	∑i=t0ri
	∑i=t0ri
	where sum is taken over time from t0t0 until the end of the game and riri is reward gained at time ii. There is a couple of problems with that simplified definition and we’ll deal with them one by one.

	For one there is no way to calculate that sum as we don’t know the future. With this, we’ll deal at the end though.

	Intuitively speaking the immediate reward rt0rt0 should be more valuable then a very distant one. After all, future is uncertain and we might never get this distant reward at all.
	Let’s introduce discounted accumulated future reward.

	∑i=t0γiri
	∑i=t0γiri
	Where γγ is between 0 and 1. We’ll set γγ to 0.990.99, though, as the distant rewards are very important.

	Finally our game is stochastic (we don’t know when an enemy shoots a laser beam) therefore we should rather think in terms of expected value. Ideally, what we want the neural network to learn is function Q defined as:

	Q(s)(a)=𝔼(∑i=t0γiri)Expected discounted accumulated future reward
	Q(s)(a)=E(∑i=t0γiri)Expected discounted accumulated future reward
	where ss is the input game screen at time t0t0, aa indicates the neuron corresponding with action aa, riri is reward obtained after action taken at time ii. That is what we want each neuron of the output layer to learn.

	To do that efficiently we need to realise that Q(s)(a)=r+γmaxa′Q(s′)(a′)Q(s)(a)=r+γmaxa′Q(s′)(a′) where s′s′ is game screen experienced after taking action aa after seeing game screen ss.

	Now if Q∗Q∗ is our neural network we can treat Q∗(s)(a)−(r+γmaxa′Q∗(s′)(a′))Q∗(s)(a)−(r+γmaxa′Q∗(s′)(a′)) as a measure of surprise and therefore a loss function (after squaring).

	Note that the loss depends on the neural network itself in an untypical way.

	Correlation
	Since we play the game online it is tempting to simply update the network after each taken action or in mini-batches of, say, 32 actions taken. Such mini-batches would be highly correlated and any stochastic update algorithm would fail on that.

	To deal with that issue we keep previous experiences in memory and after each action taken we draw a mini-batch of experiences from that memory to perform the update step.

	Other technicalities
	provide only every 4th frame to the neural network
	stack 4 frames one on top of the other to make the neural network aware of time. This could be avoided if you used LSTM.
	keep a stale network at hand and calculate loss with regards to that stale network
	gradient clipping (to avoid blowing up gradients)
	DeepMind Rmsprop (instead of normal one) - improved performance by 40% in my case.
	I used Arcade Learning Environment to play space invaders.
	Lessons Learned
	That was my first exposure to training non trivial neural networks so there is plenty of things that I learned.

	Nan’s as weights is no good. Seems obvious but it does not mean that it’s easy to track down such problems. Theano provides means of doing that efficiently. Of course an NaN usually means that you divide ∞∞ by 00.

	Test your Theano code. It’s not so hard! And here is relevant documentation.

	If your network converges or diverges to ∞∞ very quickly it’s probably caused by suboptimal learning rates applied in your update function. Little is known about how to correctly choose network’s hyperparameters so trial, error and verification is what’s left.

	Update method might play a gigantic role in performance of your neural network. In my case, learning curve of my DQN implementation flattened after 10M frames around 400 points for traditional RMSProp. “DeepMind” RmsProp was learning slower but boosted the performance to over 550 points on average after 50M frames and one can clearly see that it kept learning all the time.