Skip to content

Instantly share code, notes, and snippets.

@awjuliani
Created September 11, 2016 00:20
Show Gist options
  • Star 19 You must be signed in to star a gist
  • Fork 17 You must be signed in to fork a gist
  • Save awjuliani/902fe41c3a9efe27299e72aee1b3158c to your computer and use it in GitHub Desktop.
Save awjuliani/902fe41c3a9efe27299e72aee1b3158c to your computer and use it in GitHub Desktop.
Policy gradient method for solving n-armed bandit problems.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple Reinforcement Learning in Tensorflow Part 1: \n",
"## The Multi-armed bandit\n",
"This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the multi-armed bandit problem. For more information, see this [Medium post](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149).\n",
"\n",
"For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, [DeepRL-Agents](https://github.com/awjuliani/DeepRL-Agents). "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Bandits\n",
"Here we define our bandits. For this example we are using a four-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward.\n",
"bandits = [0.2,0,-0.2,-5]\n",
"num_bandits = len(bandits)\n",
"def pullBandit(bandit):\n",
" #Get a random number.\n",
" result = np.random.randn(1)\n",
" if result > bandit:\n",
" #return a positive reward.\n",
" return 1\n",
" else:\n",
" #return a negative reward.\n",
" return -1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Agent\n",
"The code below established our simple neural agent. It consists of a set of values for each of the bandits. Each value is an estimate of the value of the return from choosing the bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tf.reset_default_graph()\n",
"\n",
"#These two lines established the feed-forward part of the network. This does the actual choosing.\n",
"weights = tf.Variable(tf.ones([num_bandits]))\n",
"chosen_action = tf.argmax(weights,0)\n",
"\n",
"#The next six lines establish the training proceedure. We feed the reward and chosen action into the network\n",
"#to compute the loss, and use it to update the network.\n",
"reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)\n",
"action_holder = tf.placeholder(shape=[1],dtype=tf.int32)\n",
"responsible_weight = tf.slice(weights,action_holder,[1])\n",
"loss = -(tf.log(responsible_weight)*reward_holder)\n",
"optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)\n",
"update = optimizer.minimize(loss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training the Agent"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will train our agent by taking actions in our environment, and recieving rewards. Using the rewards and actions, we can know how to properly update our network in order to more often choose actions that will yield the highest rewards over time."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running reward for the 4 bandits: [ 1. 0. 0. 0.]\n",
"Running reward for the 4 bandits: [ 0. -2. -1. 38.]\n",
"Running reward for the 4 bandits: [ 0. -4. -2. 83.]\n",
"Running reward for the 4 bandits: [ 0. -6. -1. 128.]\n",
"Running reward for the 4 bandits: [ 0. -8. 1. 172.]\n",
"Running reward for the 4 bandits: [ -1. -9. 2. 219.]\n",
"Running reward for the 4 bandits: [ -1. -10. 4. 264.]\n",
"Running reward for the 4 bandits: [ 0. -11. 4. 312.]\n",
"Running reward for the 4 bandits: [ 2. -10. 4. 357.]\n",
"Running reward for the 4 bandits: [ 2. -9. 4. 406.]\n",
"Running reward for the 4 bandits: [ 0. -11. 4. 448.]\n",
"Running reward for the 4 bandits: [ -1. -10. 3. 495.]\n",
"Running reward for the 4 bandits: [ -3. -10. 2. 540.]\n",
"Running reward for the 4 bandits: [ -3. -10. 3. 585.]\n",
"Running reward for the 4 bandits: [ -3. -8. 3. 629.]\n",
"Running reward for the 4 bandits: [ -2. -7. 1. 673.]\n",
"Running reward for the 4 bandits: [ -4. -7. 2. 720.]\n",
"Running reward for the 4 bandits: [ -4. -7. 3. 769.]\n",
"Running reward for the 4 bandits: [ -6. -8. 3. 814.]\n",
"Running reward for the 4 bandits: [ -7. -7. 3. 858.]\n",
"The agent thinks bandit 4 is the most promising....\n",
"...and it was right!\n"
]
}
],
"source": [
"total_episodes = 1000 #Set total number of episodes to train agent on.\n",
"total_reward = np.zeros(num_bandits) #Set scoreboard for bandits to 0.\n",
"e = 0.1 #Set the chance of taking a random action.\n",
"\n",
"init = tf.initialize_all_variables()\n",
"\n",
"# Launch the tensorflow graph\n",
"with tf.Session() as sess:\n",
" sess.run(init)\n",
" i = 0\n",
" while i < total_episodes:\n",
" \n",
" #Choose either a random action or one from our network.\n",
" if np.random.rand(1) < e:\n",
" action = np.random.randint(num_bandits)\n",
" else:\n",
" action = sess.run(chosen_action)\n",
" \n",
" reward = pullBandit(bandits[action]) #Get our reward from picking one of the bandits.\n",
" \n",
" #Update the network.\n",
" _,resp,ww = sess.run([update,responsible_weight,weights], feed_dict={reward_holder:[reward],action_holder:[action]})\n",
" \n",
" #Update our running tally of scores.\n",
" total_reward[action] += reward\n",
" if i % 50 == 0:\n",
" print \"Running reward for the \" + str(num_bandits) + \" bandits: \" + str(total_reward)\n",
" i+=1\n",
"print \"The agent thinks bandit \" + str(np.argmax(ww)+1) + \" is the most promising....\"\n",
"if np.argmax(ww) == np.argmax(-np.array(bandits)):\n",
" print \"...and it was right!\"\n",
"else:\n",
" print \"...and it was wrong!\""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@qiaoruntao
Copy link

maybe there is a typo.
does "proceedure" in the third part of the code means "procedure"?

@mynameisvinn
Copy link

mynameisvinn commented Oct 15, 2017

the action is currently selected in a deterministic manner:

chosen_action = tf.argmax(weights,0)

as this is a policy network, shouldnt the action be drawn from a probability distribution, in a stochastic manner? it could look something like this:

# first, convert raw weights to softmax probs
softmax_probs = tf.nn.softmax(weights)

# then, draw from probability distribution 
possible_actions = tf.convert_to_tensor([0,1,2,3])  # indices for possible actions
samples = tf.multinomial(tf.log([softmax_probs]), 1)   # draw according to weights
chosen_action = possible_actions[tf.cast(samples[0][0], tf.int32)]

(not trying to overcomplicate things - just trying to understand the thought process behind this helpful example.)

@tywadd
Copy link

tywadd commented Oct 20, 2017

@mynameisvinn
I like your solution, also. The example does something similar in the lines

if np.random.rand(1) < e:
    action = np.random.randint(num_bandits)

I suppose it's still doing it stochastically, since the randn and randint draw from some distribution. It might be a neat experiment to try each and compare the differences.

@retnuh
Copy link

retnuh commented Dec 8, 2017

@mynameisvinn - I believe you could do what you're suggesting, and it would be interesting to see the results & the differences in the learning rates, but Arthur does explicitly say in the post that he's using an "e-greedy" policy:

To update our network, we will simply try an arm with an e-greedy policy. This means that most of the time our agent will choose the action that corresponds to the largest expected value, but occasionally, with e probability, it will choose randomly.

@fredthedead
Copy link

The lower the bandit number, the more likely a positive reward will be returned

vs.

Currently bandit 4 (index#3) is set to most often provide a positive reward.

Shouldn't the first line be the opposite? the higher the bandit number the more likely a positive reward will be returned?

@jackleekopij
Copy link

A great tutorial!

I understand this is an introductory tutorial, however, I have found it an interesting outcome finding boundary conditions by playing with the reward probabilities (bandits) used in the pullBandits reward function along with the epsilon greedy parameter. Tweaking these parameters and observing the most promising bandit proved a great exercise for myself to understand sensitivities of the algorithm.

This for the post awjuliani

@bahriddin
Copy link

I tried with this details:

bandits = [-0.9, 0, -0.2, -1]
total_episodes = 100000
learning_rate=.01/total_episodes

But still, it can't find the global optimum. Are there any suggestions to improve algorithm?
Regards!

@JaeDukSeo
Copy link

One of the reason why this example might be confusing is due to the fact that tf can only minimize when performing auto differentiation. Thats that why the prob is flipped -5 being the best prop.

@JaeDukSeo
Copy link

@bahriddin that is due to the first selection choice, remember we initialize all of the weight to be one hence the argmax is 0. And since e is 0.1 small number we are not gonna explore that much, hence the agent will most likely choose the first one always and be wrong. If you increase the e value than it will be good.

@dhl8282
Copy link

dhl8282 commented Jul 10, 2018

@fredthedead

About

#List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward.
bandits = [0.2,0,-0.2,-5]

pullBandit method is defined as

def pullBandit(bandit):
#Get a random number.
result = np.random.randn(1)
if result > bandit:
#return a positive reward.
return 1
else:
#return a negative reward.
return -1

if you look carefully, result gives you a random positive or negative number.
Since bandits[3] = -5 which is more generous offset than bandits[1]=0, bandits[3] gives best chance.
Try this code and you will get is
for i in range(100): print np.random.randn(1)

@mapa17
Copy link

mapa17 commented Sep 1, 2018

Hi,

I have troubles to understand how the optimizer can tune the weights variable.

To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start
out at 1.0, causing the initial loss value to be already 0.0.

loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0

weights = tf.Variable(tf.ones([num_bandits]))
...
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)

What do I miss or get wrong?

thx,
Manuel

@wzzhu
Copy link

wzzhu commented Oct 29, 2019

To my understanding the Optimizer will try to minimize (target loss=0.0) the loss function, but in the example above the weights start
out at 1.0, causing the initial loss value to be already 0.0.

loss = -(log(weight) * reward) = - (0.0 * reward) = - 0.0

What do I miss or get wrong?

f(x) = 0 doesn't mean f'(0) = 0, as long as the gradient of loss is not 0, eventually the weights will change.

In the example, as gradient(loss) = gradient(-log(weight)*reward)) = - reward * 1/weight (since d[lnx, x] = 1/x) and reward is const)
so gradient(loss) for (1) = -reward * 1/1 = -reward.

if reward == 1 (positive feedback), gradient = -1, so by gradient descend, it will subtract learning_rate * gradient, which is equivalent to adding learning_rate 0.001, so the new weight will become 1.001, giving it a little higher chance to be selected by argmax(weights). And so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment