Skip to content

Instantly share code, notes, and snippets.

@awjuliani
Created September 11, 2016 00:20
Show Gist options
  • Star 19 You must be signed in to star a gist
  • Fork 17 You must be signed in to fork a gist
  • Save awjuliani/902fe41c3a9efe27299e72aee1b3158c to your computer and use it in GitHub Desktop.
Save awjuliani/902fe41c3a9efe27299e72aee1b3158c to your computer and use it in GitHub Desktop.
Policy gradient method for solving n-armed bandit problems.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@mapa17
Copy link

mapa17 commented Sep 1, 2018

Hi,

I have troubles to understand how the optimizer can tune the weights variable.

To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start
out at 1.0, causing the initial loss value to be already 0.0.

loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0

weights = tf.Variable(tf.ones([num_bandits]))
...
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)

What do I miss or get wrong?

thx,
Manuel

@wzzhu
Copy link

wzzhu commented Oct 29, 2019

To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start
out at 1.0, causing the initial loss value to be already 0.0.

loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0

What do I miss or get wrong?

f(x) = 0 doesn't mean f'(0) = 0, as long as the gradient of loss is not 0, eventually the weights will change.

In the example, as gradient(loss) = gradient(-log(weight)*reward)) = - reward * 1/weight (since d[lnx, x] = 1/x) and reward is const)
so gradient(loss) for (1) = -reward * 1/1 = -reward.

if reward == 1 (positive feedback), gradient = -1, so by gradient descend, it will subtract learning_rate * gradient, which is equivalent to adding learning_rate 0.001, so the new weight will become 1.001, giving it a little higher chance to be selected by argmax(weights). And so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment