Skip to content

Instantly share code, notes, and snippets.

@bioothod
Last active April 21, 2017 20:05
Show Gist options
  • Save bioothod/1d1d43de360ee56f5e0e93a57986368a to your computer and use it in GitHub Desktop.
Save bioothod/1d1d43de360ee56f5e0e93a57986368a to your computer and use it in GitHub Desktop.
MountainCar solution
I tested various approaches and found that properly tuned DQN plus cross-entropy pool solves
this problem in the fastest way.
By DQN+CE I mean common DQN technique, but batches sampled each time for experience reply
are selected proportionally to how good their appropriate episode was compared to
the worst one with -200 total reward.
In common cross-entropy we basically select the best episodes and learn network
to correctly predict action based on those steps. This drops experience for the
wrong/non-existing steps and actions, which might be good to learn too.
Another approach I tested was main/follower networks, i.e. when you 'read'
from the follower network which is slowly follows main network used for learning.
This does not really change convergence noticeably,
maybe reduces oscillations, which always happen likely because of overfitting.
Another approach was a3c and actor-critic model. This never reached performance of DQN
and more commonly rises serious questions about its applicability for this task,
since there is no clear convergence logic and proper reward match - it is always possible
to create a set of steps which will have the same discounted reward, but will not lead
to the winning strategy.
DQN is a simple network consisting of 3 layers: 50, 190 and 3 neurons, the first two have biases.
I use `tanh` nonlinearity, `relu` never congested and in my opinion it is very overvalued function.
Tanh has a nice property of being positive and negative, which is useful in this problem.
The last layer uses linear activation function since Q-values are large enough.
I use l1+l2 regularization and RMSProp optimizer.
By tuning number of neurons you can heavily affect performance,
so I believe my numbers can further be improved.
Code for all mentioned ideas is included in the repository
https://github.com/bioothod/openai-mountain-car-v0,
it is implemented in pure python + TF. To run the winning solution just type
```
$ python mc0.py
```
It will create `mc0` and `mc0_wrappers` directories in the current dir, the former can be used
with `tensorboard`, the latter is used for OpenAI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment