Skip to content
{{ message }}

Instantly share code, notes, and snippets.

# awjuliani/SimplePolicy.ipynb

Created Sep 11, 2016
Policy gradient method for solving n-armed bandit problems.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

### qiaoruntao commented Mar 8, 2017

 maybe there is a typo. does "proceedure" in the third part of the code means "procedure"?

### mynameisvinn commented Oct 15, 2017 • edited

 the action is currently selected in a deterministic manner: ``````chosen_action = tf.argmax(weights,0) `````` as this is a policy network, shouldnt the action be drawn from a probability distribution, in a stochastic manner? it could look something like this: ``````# first, convert raw weights to softmax probs softmax_probs = tf.nn.softmax(weights) # then, draw from probability distribution possible_actions = tf.convert_to_tensor([0,1,2,3]) # indices for possible actions samples = tf.multinomial(tf.log([softmax_probs]), 1) # draw according to weights chosen_action = possible_actions[tf.cast(samples, tf.int32)] `````` (not trying to overcomplicate things - just trying to understand the thought process behind this helpful example.)

### tywadd commented Oct 20, 2017

 @mynameisvinn I like your solution, also. The example does something similar in the lines ```if np.random.rand(1) < e: action = np.random.randint(num_bandits)``` I suppose it's still doing it stochastically, since the `randn` and `randint` draw from some distribution. It might be a neat experiment to try each and compare the differences.

### retnuh commented Dec 8, 2017

 @mynameisvinn - I believe you could do what you're suggesting, and it would be interesting to see the results & the differences in the learning rates, but Arthur does explicitly say in the post that he's using an "e-greedy" policy: To update our network, we will simply try an arm with an e-greedy policy. This means that most of the time our agent will choose the action that corresponds to the largest expected value, but occasionally, with e probability, it will choose randomly.

### fredthedead commented Jan 11, 2018

 The lower the bandit number, the more likely a positive reward will be returned vs. `Currently bandit 4 (index#3) is set to most often provide a positive reward.` Shouldn't the first line be the opposite? the higher the bandit number the more likely a positive reward will be returned?

### jackleekopij commented Mar 18, 2018

 A great tutorial! I understand this is an introductory tutorial, however, I have found it an interesting outcome finding boundary conditions by playing with the reward probabilities (bandits) used in the pullBandits reward function along with the epsilon greedy parameter. Tweaking these parameters and observing the most promising bandit proved a great exercise for myself to understand sensitivities of the algorithm. This for the post awjuliani

### bahriddin commented Apr 20, 2018

 I tried with this details: ``````bandits = [-0.9, 0, -0.2, -1] total_episodes = 100000 learning_rate=.01/total_episodes `````` But still, it can't find the global optimum. Are there any suggestions to improve algorithm? Regards!

### JaeDukSeo commented Jun 28, 2018

 One of the reason why this example might be confusing is due to the fact that tf can only minimize when performing auto differentiation. Thats that why the prob is flipped -5 being the best prop.

### JaeDukSeo commented Jun 28, 2018

 @bahriddin that is due to the first selection choice, remember we initialize all of the weight to be one hence the argmax is 0. And since e is 0.1 small number we are not gonna explore that much, hence the agent will most likely choose the first one always and be wrong. If you increase the e value than it will be good.

### dhl8282 commented Jul 10, 2018 • edited

 @fredthedead About #List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward. bandits = [0.2,0,-0.2,-5] pullBandit method is defined as def pullBandit(bandit): #Get a random number. result = np.random.randn(1) if result > bandit: #return a positive reward. return 1 else: #return a negative reward. return -1 if you look carefully, result gives you a random positive or negative number. Since bandits = -5 which is more generous offset than bandits=0, bandits gives best chance. Try this code and you will get is `for i in range(100): print np.random.randn(1)`

### mapa17 commented Sep 1, 2018

 Hi, I have troubles to understand how the optimizer can tune the weights variable. To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start out at 1.0, causing the initial loss value to be already 0.0. loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0 ``````weights = tf.Variable(tf.ones([num_bandits])) ... responsible_weight = tf.slice(weights,action_holder,) loss = -(tf.log(responsible_weight)*reward_holder) optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) `````` What do I miss or get wrong? thx, Manuel

### wzzhu commented Oct 29, 2019 • edited

 To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start out at 1.0, causing the initial loss value to be already 0.0. loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0 What do I miss or get wrong? f(x) = 0 doesn't mean f'(0) = 0, as long as the gradient of loss is not 0, eventually the weights will change. In the example, as gradient(loss) = gradient(-log(weight)*reward)) = - reward * 1/weight (since d[lnx, x] = 1/x) and reward is const) so gradient(loss) for (1) = -reward * 1/1 = -reward. if reward == 1 (positive feedback), gradient = -1, so by gradient descend, it will subtract learning_rate * gradient, which is equivalent to adding learning_rate 0.001, so the new weight will become 1.001, giving it a little higher chance to be selected by argmax(weights). And so on.
to join this conversation on GitHub. Already have an account? Sign in to comment