Rickard Edén neph1

## gist:9202f89acb6762922c356e49f8f60e98

To a large extent, the algorithm is based on the same as is described here: https://gym.openai.com/evaluations/eval_REzxUUxrRX6OB9IKCVgSg

This is an improved algorithm and implementation over this one: https://gym.openai.com/evaluations/eval_vY5lfwRaSJSk7XzPEQLLUw

I had noticed during testing that while reaching 400-450 generally was no problem, getting that last 50 was a much slower process. I also thought that since the algo when reaching 450 could be considered fairly good, I didn't want it to change too much each iteration.
The previous evaluation showed that lowering the learning rate over time generally seemed a good thing, but the implementation was flawed.
I already kept track of the rolling average score over the last 100 episodes to know when it would be solved. Now I kept track of the change in this value between episodes, the delta.
Based on whether the delta increased or decreased, I increased or decreased the learning rate respectively (again by multiplying with 1.02 or 0.945), starting with

## gist:7e015e156e32bdbf32d4d8bce96618f3
To a large extent, the algorithm is based on the same as is described here: https://gym.openai.com/evaluations/eval_REzxUUxrRX6OB9IKCVgSg

The one change in this evaluation is that I tried to have a dynamic learning rate. While the idea might be correct, you will see that the implementation was not.

I used the collected error rate on each iteration and checked if it was higher or lower than the previous one.
If it was lower, I would increase the learning rate slightly (multiply by 1.02). If it was higher I instead decreased it (multiply by 0.945).

This increased performance and seemed more reliable than when it was removed. The big flaw in this thinking is that in normal backprop, the error rate is checked against the test set. I noticed that the learning rate decreased over time and that this is probably the desired behavior.

The next evaluation has a better implementation: https://gym.openai.com/evaluations/eval_REzxUUxrRX6OB9IKCVgSg

## cartpole_writeup
Brief:
I trained a logistic neural network with two hidden layers. Input nodes were the observations of the last 5 steps.
I believe I'm using a gradient policy approach.

Introduction and background:
I'm a novice when it comes to Reinforcement Learning but it's a subject I've become obsessed with lately. Pretty much all of the 'literature' on the subject I've read is Andrej Karpathy’s Pong from pixels (which is also how I found the AI Gym, incidentally).
I very much believe in the openess of the platform and even if I'm not schooled in this matter (or schooled much at all), if there is even grain of usefulness in this material, I would like to contribute it back.
Feedback is appreciated, especially since I see this as a learning experience. I’m also open to improving the documentation if you see anything you think needs clarification.
I should also mention I had some serious issues with crashes in the simulator. This is why the evaluation is so short and probably why it converges so early (it forced me to opt

	To a large extent, the algorithm is based on the same as is described here: https://gym.openai.com/evaluations/eval_REzxUUxrRX6OB9IKCVgSg

	This is an improved algorithm and implementation over this one: https://gym.openai.com/evaluations/eval_vY5lfwRaSJSk7XzPEQLLUw

	I had noticed during testing that while reaching 400-450 generally was no problem, getting that last 50 was a much slower process. I also thought that since the algo when reaching 450 could be considered fairly good, I didn't want it to change too much each iteration.
	The previous evaluation showed that lowering the learning rate over time generally seemed a good thing, but the implementation was flawed.
	I already kept track of the rolling average score over the last 100 episodes to know when it would be solved. Now I kept track of the change in this value between episodes, the delta.
	Based on whether the delta increased or decreased, I increased or decreased the learning rate respectively (again by multiplying with 1.02 or 0.945), starting with
	To a large extent, the algorithm is based on the same as is described here: https://gym.openai.com/evaluations/eval_REzxUUxrRX6OB9IKCVgSg

	The one change in this evaluation is that I tried to have a dynamic learning rate. While the idea might be correct, you will see that the implementation was not.

	I used the collected error rate on each iteration and checked if it was higher or lower than the previous one.
	If it was lower, I would increase the learning rate slightly (multiply by 1.02). If it was higher I instead decreased it (multiply by 0.945).

	This increased performance and seemed more reliable than when it was removed. The big flaw in this thinking is that in normal backprop, the error rate is checked against the test set. I noticed that the learning rate decreased over time and that this is probably the desired behavior.

	The next evaluation has a better implementation: https://gym.openai.com/evaluations/eval_REzxUUxrRX6OB9IKCVgSg
	Brief:
	I trained a logistic neural network with two hidden layers. Input nodes were the observations of the last 5 steps.
	I believe I'm using a gradient policy approach.

	Introduction and background:
	I'm a novice when it comes to Reinforcement Learning but it's a subject I've become obsessed with lately. Pretty much all of the 'literature' on the subject I've read is Andrej Karpathy’s Pong from pixels (which is also how I found the AI Gym, incidentally).
	I very much believe in the openess of the platform and even if I'm not schooled in this matter (or schooled much at all), if there is even grain of usefulness in this material, I would like to contribute it back.
	Feedback is appreciated, especially since I see this as a learning experience. I’m also open to improving the documentation if you see anything you think needs clarification.
	I should also mention I had some serious issues with crashes in the simulator. This is why the evaluation is so short and probably why it converges so early (it forced me to opt