neph1/cartpole_writeup

## cartpole_writeup
Brief:
I trained a logistic neural network with two hidden layers. Input nodes were the observations of the last 5 steps.
I believe I'm using a gradient policy approach.

Introduction and background:
I'm a novice when it comes to Reinforcement Learning but it's a subject I've become obsessed with lately. Pretty much all of the 'literature' on the subject I've read is Andrej Karpathy’s Pong from pixels (which is also how I found the AI Gym, incidentally).
I very much believe in the openess of the platform and even if I'm not schooled in this matter (or schooled much at all), if there is even grain of usefulness in this material, I would like to contribute it back.
Feedback is appreciated, especially since I see this as a learning experience. I’m also open to improving the documentation if you see anything you think needs clarification.
I should also mention I had some serious issues with crashes in the simulator. This is why the evaluation is so short and probably why it converges so early (it forced me to optimize so it would succeed before crashing).

Neural Network Layout:
The network had to be carefully designed. Deviating from the described network by only one or two nodes can cause severe degradation in performance. See “Alternatives and Red Herrings” for information about other designs.
20 input nodes (observations from 5 latest steps)
26 hidden nodes in first layer
14 hidden nodes in second layer
2 output nodes
All nodes are logistic. Layers are fully interconnected

Initial Weights:
All weights are randomly initialized to a value between -1.0 and 1.0.

Learning Rate:
Learning rate matters, but it is also quite forgiving. Anything between 0.15 and 0.7 works. But it seems keeping it around 0.25 is slightly better. This evaluation used 0.28 I believe.

Input:
The input each step is divided  by the (observed) bounds of that range. The exception is value 2 and 4 where I found that blowing these up (dividing by 0.05) led to a significant performance in learning. I expect this is due to enforcing a low velocity. Those values are then also capped to between -5 and 5.

Scoring:
I tried a number of different scoring methods.
What worked best in the end was:
Compare position of each observed input with the one from the previous.
If better (closer to zero), set gradients to positive (1.0), otherwise negative (-1.0). I’ve tried more variable gradients, like using the score directly, but it didn’t work as well.

public void score() {
    float score = 0;
    for (int i = 0; i < 4; i++) {
        float delta = (Math.abs(observation[i + 4]) - Math.abs(observation[i]));
        score += delta;
    }
    if (score > 0) {
        score = 1f;
    } else {
        score = -1f;
    }
    float[] scores = lastAction == 0 ? TEMP_SCORE_LEFT : TEMP_SCORE_RIGHT;
    decisions.add(new float[][]{lastInput, scores});
    updateScore(3, score);
}

Then also update the previous two inputs with a diminishing value in the updateScore method. The factor seems to be forgiving, 0.55 to 0.95 have been tried, where 0.8 seems to be a good level).

private void updateScore(int turnsAgo, float scoreChange) {
    int size = decisions.size();
    int min = Math.min(size, turnsAgo);
    for (int i = 0; i < min; i++) {
        float[] scores = decisions.get(size - i - 1)[1];
        // increase scores by scoreChange
        scoreChange *= 0.8f;
    }
}

Training:
When it came to training and using the training data I tried two different models.
1. Keep neural network between episodes. Train using only the data from the latest episode.
2. Reinitialize network each episode (keep initial weights). Keep all training data and train on whole set (all episodes).
In both cases the training data is shuffled before use and the back propagated.
Apart from the obvious benefit of less total training time on model 1, it also led to much better learning performance.

Soft Max:
For a long time I used Soft Max to determine the action. Once the network improved, it became obvious that a deterministic approach led to better outcome (just use whichever output is higher).

Framework:
I used my own Neural Network implementation, designed for back propagation written in Java.
Hence, I used deeplearning4j's implementation of the AI Gym.

Debugging:
To observe performance I printed out the following information after each episode:
Episode number
Score for episode
Average over all episodes so far
Rolling average (this differed from between 10 and 100 episodes)
Highest value so far (less interesting once reaching highest score was frequent. Still interesting to see when it reaches 200 the first time).
Total reward


Alternatives and red herrings:

Possible to train with 6 steps (but not at same level of performance)
24 input nodes
32 hidden nodes in first layer
16 hidden nodes in second layer
2 output nodes

One scoring method I tried was to compare relative movement with the previous step. The idea was that even if the model made a good move, if the velocity in the opposite direction was too high, it wouldn't get noticed when just watching the position of the rod and cradle. So, by comparing delta v of the previous move, I thought it might be possible to spot those things. But either my theory was wrong, or I didn't manage to find good values for the scoring.

Another thing I experimented with was comparing the current position(s) with that more than one step back. Again, I was not able to find any good way of factoring this information into the scoring. So instead I placed my bets on the network figuring this out for itself.

At one point when I was thinking about how to improve the learning, I thought that more tranining examples would be a good thing (it's one of the fundamental rules in supervised learning, after all). Since the observation space is symmetric (or so I thought), I should be able to mirror all observations and scorings and get twice as many training examples. Awesome! It turned out that wasn't the case. I was not able to use the mirrored data.


Further improvements:

I still think mirroring the training examples should be possible. Doubling the training data should improve it significantly.

Given the design of the network, it seems a Recurrent Neural Network (RNN) would be a good alternative.
	Brief:
	I trained a logistic neural network with two hidden layers. Input nodes were the observations of the last 5 steps.
	I believe I'm using a gradient policy approach.

	Introduction and background:
	I'm a novice when it comes to Reinforcement Learning but it's a subject I've become obsessed with lately. Pretty much all of the 'literature' on the subject I've read is Andrej Karpathy’s Pong from pixels (which is also how I found the AI Gym, incidentally).
	I very much believe in the openess of the platform and even if I'm not schooled in this matter (or schooled much at all), if there is even grain of usefulness in this material, I would like to contribute it back.
	Feedback is appreciated, especially since I see this as a learning experience. I’m also open to improving the documentation if you see anything you think needs clarification.
	I should also mention I had some serious issues with crashes in the simulator. This is why the evaluation is so short and probably why it converges so early (it forced me to optimize so it would succeed before crashing).

	Neural Network Layout:
	The network had to be carefully designed. Deviating from the described network by only one or two nodes can cause severe degradation in performance. See “Alternatives and Red Herrings” for information about other designs.
	20 input nodes (observations from 5 latest steps)
	26 hidden nodes in first layer
	14 hidden nodes in second layer
	2 output nodes
	All nodes are logistic. Layers are fully interconnected

	Initial Weights:
	All weights are randomly initialized to a value between -1.0 and 1.0.

	Learning Rate:
	Learning rate matters, but it is also quite forgiving. Anything between 0.15 and 0.7 works. But it seems keeping it around 0.25 is slightly better. This evaluation used 0.28 I believe.

	Input:
	The input each step is divided by the (observed) bounds of that range. The exception is value 2 and 4 where I found that blowing these up (dividing by 0.05) led to a significant performance in learning. I expect this is due to enforcing a low velocity. Those values are then also capped to between -5 and 5.

	Scoring:
	I tried a number of different scoring methods.
	What worked best in the end was:
	Compare position of each observed input with the one from the previous.
	If better (closer to zero), set gradients to positive (1.0), otherwise negative (-1.0). I’ve tried more variable gradients, like using the score directly, but it didn’t work as well.

	public void score() {
	float score = 0;
	for (int i = 0; i < 4; i++) {
	float delta = (Math.abs(observation[i + 4]) - Math.abs(observation[i]));
	score += delta;
	}
	if (score > 0) {
	score = 1f;
	} else {
	score = -1f;
	}
	float[] scores = lastAction == 0 ? TEMP_SCORE_LEFT : TEMP_SCORE_RIGHT;
	decisions.add(new float[][]{lastInput, scores});
	updateScore(3, score);
	}

	Then also update the previous two inputs with a diminishing value in the updateScore method. The factor seems to be forgiving, 0.55 to 0.95 have been tried, where 0.8 seems to be a good level).

	private void updateScore(int turnsAgo, float scoreChange) {
	int size = decisions.size();
	int min = Math.min(size, turnsAgo);
	for (int i = 0; i < min; i++) {
	float[] scores = decisions.get(size - i - 1)[1];
	// increase scores by scoreChange
	scoreChange *= 0.8f;
	}
	}

	Training:
	When it came to training and using the training data I tried two different models.
	1. Keep neural network between episodes. Train using only the data from the latest episode.
	2. Reinitialize network each episode (keep initial weights). Keep all training data and train on whole set (all episodes).
	In both cases the training data is shuffled before use and the back propagated.
	Apart from the obvious benefit of less total training time on model 1, it also led to much better learning performance.

	Soft Max:
	For a long time I used Soft Max to determine the action. Once the network improved, it became obvious that a deterministic approach led to better outcome (just use whichever output is higher).

	Framework:
	I used my own Neural Network implementation, designed for back propagation written in Java.
	Hence, I used deeplearning4j's implementation of the AI Gym.

	Debugging:
	To observe performance I printed out the following information after each episode:
	Episode number
	Score for episode
	Average over all episodes so far
	Rolling average (this differed from between 10 and 100 episodes)
	Highest value so far (less interesting once reaching highest score was frequent. Still interesting to see when it reaches 200 the first time).
	Total reward



	Alternatives and red herrings:

	Possible to train with 6 steps (but not at same level of performance)
	24 input nodes
	32 hidden nodes in first layer
	16 hidden nodes in second layer
	2 output nodes

	One scoring method I tried was to compare relative movement with the previous step. The idea was that even if the model made a good move, if the velocity in the opposite direction was too high, it wouldn't get noticed when just watching the position of the rod and cradle. So, by comparing delta v of the previous move, I thought it might be possible to spot those things. But either my theory was wrong, or I didn't manage to find good values for the scoring.

	Another thing I experimented with was comparing the current position(s) with that more than one step back. Again, I was not able to find any good way of factoring this information into the scoring. So instead I placed my bets on the network figuring this out for itself.

	At one point when I was thinking about how to improve the learning, I thought that more tranining examples would be a good thing (it's one of the fundamental rules in supervised learning, after all). Since the observation space is symmetric (or so I thought), I should be able to mirror all observations and scorings and get twice as many training examples. Awesome! It turned out that wasn't the case. I was not able to use the mirrored data.


	Further improvements:

	I still think mirroring the training examples should be possible. Doubling the training data should improve it significantly.

	Given the design of the network, it seems a Recurrent Neural Network (RNN) would be a good alternative.