Created
August 25, 2016 20:30
-
-
Save awjuliani/4d69edad4d0ed9a5884f3cdcf0ea0874 to your computer and use it in GitHub Desktop.
Basic Q-Learning algorithm using Tensorflow
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've come across an oddity that I'm having trouble understanding. Running this code as written in Tensorflow works just fine for me, but trying to re-implement it in another framework (Pytorch, Keras) left me with a network that seemed unable to learn the game. It looks to me like the randomly initialized weights in the linear layer pass on bogus future reward estimates when the agent loses the game.
Explained another way, when the agent lost the game, instead of getting 0 reward for that action, it was getting 0 plus the max future reward for the "next step" of the game instead of just 0. I was able to get the agent to learn the game with this modification:
Based on the explanation I've come up with, this modification makes perfect sense to me, but I'm left wondering why the example here does not have the same issue. Thoughts?
Reference code: