-
-
Save awjuliani/902fe41c3a9efe27299e72aee1b3158c to your computer and use it in GitHub Desktop.
About
#List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward.
bandits = [0.2,0,-0.2,-5]
pullBandit method is defined as
def pullBandit(bandit):
#Get a random number.
result = np.random.randn(1)
if result > bandit:
#return a positive reward.
return 1
else:
#return a negative reward.
return -1
if you look carefully, result gives you a random positive or negative number.
Since bandits[3] = -5 which is more generous offset than bandits[1]=0, bandits[3] gives best chance.
Try this code and you will get is
for i in range(100): print np.random.randn(1)
Hi,
I have troubles to understand how the optimizer can tune the weights variable.
To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start
out at 1.0, causing the initial loss value to be already 0.0.
loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0
weights = tf.Variable(tf.ones([num_bandits]))
...
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
What do I miss or get wrong?
thx,
Manuel
To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start
out at 1.0, causing the initial loss value to be already 0.0.loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0
What do I miss or get wrong?
f(x) = 0 doesn't mean f'(0) = 0, as long as the gradient of loss is not 0, eventually the weights will change.
In the example, as gradient(loss) = gradient(-log(weight)*reward)) = - reward * 1/weight (since d[lnx, x] = 1/x) and reward is const)
so gradient(loss) for (1) = -reward * 1/1 = -reward.
if reward == 1 (positive feedback), gradient = -1, so by gradient descend, it will subtract learning_rate * gradient, which is equivalent to adding learning_rate 0.001, so the new weight will become 1.001, giving it a little higher chance to be selected by argmax(weights). And so on.
@bahriddin that is due to the first selection choice, remember we initialize all of the weight to be one hence the argmax is 0. And since e is 0.1 small number we are not gonna explore that much, hence the agent will most likely choose the first one always and be wrong. If you increase the e value than it will be good.