Created
September 11, 2016 00:20
-
-
Save awjuliani/902fe41c3a9efe27299e72aee1b3158c to your computer and use it in GitHub Desktop.
Policy gradient method for solving n-armed bandit problems.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
f(x) = 0 doesn't mean f'(0) = 0, as long as the gradient of loss is not 0, eventually the weights will change.
In the example, as gradient(loss) = gradient(-log(weight)*reward)) = - reward * 1/weight (since d[lnx, x] = 1/x) and reward is const)
so gradient(loss) for (1) = -reward * 1/1 = -reward.
if reward == 1 (positive feedback), gradient = -1, so by gradient descend, it will subtract learning_rate * gradient, which is equivalent to adding learning_rate 0.001, so the new weight will become 1.001, giving it a little higher chance to be selected by argmax(weights). And so on.