Skip to content

Instantly share code, notes, and snippets.

@dhpollack
Forked from awjuliani/ContextualPolicy.ipynb
Created February 9, 2017 14:06
Show Gist options
  • Save dhpollack/0bf9ba76f99261b534486d0777fb2ec5 to your computer and use it in GitHub Desktop.
Save dhpollack/0bf9ba76f99261b534486d0777fb2ec5 to your computer and use it in GitHub Desktop.
A Policy-Gradient algorithm that solves Contextual Bandit problems.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple Reinforcement Learning in Tensorflow Part 1.5: \n",
"## The Contextual Bandits\n",
"This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the contextual bandit problem. For more information, see this [Medium post](https://medium.com/p/bff01d1aad9c).\n",
"\n",
"For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, [DeepRL-Agents](https://github.com/awjuliani/DeepRL-Agents). "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"import tensorflow.contrib.slim as slim\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Contextual Bandits\n",
"Here we define our contextual bandits. In this example, we are using three four-armed bandit. What this means is that each bandit has four arms that can be pulled. Each bandit has different success probabilities for each arm, and as such requires different actions to obtain the best result. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit-arm that will most often give a positive reward, depending on the Bandit presented."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class contextual_bandit():\n",
" def __init__(self):\n",
" self.state = 0\n",
" #List out our bandits. Currently arms 4, 2, and 1 (respectively) are the most optimal.\n",
" self.bandits = np.array([[0.2,0,-0.0,-5],[0.1,-5,1,0.25],[-5,5,5,5]])\n",
" self.num_bandits = self.bandits.shape[0]\n",
" self.num_actions = self.bandits.shape[1]\n",
" \n",
" def getBandit(self):\n",
" self.state = np.random.randint(0,len(self.bandits)) #Returns a random state for each episode.\n",
" return self.state\n",
" \n",
" def pullArm(self,action):\n",
" #Get a random number.\n",
" bandit = self.bandits[self.state,action]\n",
" result = np.random.randn(1)\n",
" if result > bandit:\n",
" #return a positive reward.\n",
" return 1\n",
" else:\n",
" #return a negative reward.\n",
" return -1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Policy-Based Agent\n",
"The code below established our simple neural agent. It takes as input the current state, and returns an action. This allows the agent to take actions which are conditioned on the state of the environment, a critical step toward being able to solve full RL problems. The agent uses a single set of weights, within which each value is an estimate of the value of the return from choosing a particular arm given a bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class agent():\n",
" def __init__(self, lr, s_size,a_size):\n",
" #These lines established the feed-forward part of the network. The agent takes a state and produces an action.\n",
" self.state_in= tf.placeholder(shape=[1],dtype=tf.int32)\n",
" state_in_OH = slim.one_hot_encoding(self.state_in,s_size)\n",
" output = slim.fully_connected(state_in_OH,a_size,\\\n",
" biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())\n",
" self.output = tf.reshape(output,[-1])\n",
" self.chosen_action = tf.argmax(self.output,0)\n",
"\n",
" #The next six lines establish the training proceedure. We feed the reward and chosen action into the network\n",
" #to compute the loss, and use it to update the network.\n",
" self.reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)\n",
" self.action_holder = tf.placeholder(shape=[1],dtype=tf.int32)\n",
" self.responsible_weight = tf.slice(self.output,self.action_holder,[1])\n",
" self.loss = -(tf.log(self.responsible_weight)*self.reward_holder)\n",
" optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)\n",
" self.update = optimizer.minimize(self.loss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training the Agent"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will train our agent by getting a state from the environment, take an action, and recieve a reward. Using these three things, we can know how to properly update our network in order to more often choose actions given states that will yield the highest rewards over time."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean reward for the 3 bandits: [ 0. -0.25 0. ]\n",
"Mean reward for the 3 bandits: [ 9. 42. 33.75]\n",
"Mean reward for the 3 bandits: [ 45.5 80. 67.75]\n",
"Mean reward for the 3 bandits: [ 86.25 116.75 101.25]\n",
"Mean reward for the 3 bandits: [ 122.5 153.25 139.5 ]\n",
"Mean reward for the 3 bandits: [ 161.75 186.25 179.25]\n",
"Mean reward for the 3 bandits: [ 201. 224.75 216. ]\n",
"Mean reward for the 3 bandits: [ 240.25 264. 250. ]\n",
"Mean reward for the 3 bandits: [ 280.25 301.75 285.25]\n",
"Mean reward for the 3 bandits: [ 317.75 340.25 322.25]\n",
"Mean reward for the 3 bandits: [ 356.5 377.5 359.25]\n",
"Mean reward for the 3 bandits: [ 396.25 415.25 394.75]\n",
"Mean reward for the 3 bandits: [ 434.75 451.5 430.5 ]\n",
"Mean reward for the 3 bandits: [ 476.75 490. 461.5 ]\n",
"Mean reward for the 3 bandits: [ 513.75 533.75 491.75]\n",
"Mean reward for the 3 bandits: [ 548.25 572. 527.5 ]\n",
"Mean reward for the 3 bandits: [ 587.5 610.75 562. ]\n",
"Mean reward for the 3 bandits: [ 628.75 644.25 600.25]\n",
"Mean reward for the 3 bandits: [ 665.75 684.75 634.75]\n",
"Mean reward for the 3 bandits: [ 705.75 719.75 668.25]\n",
"The agent thinks action 4 for bandit 1 is the most promising....\n",
"...and it was right!\n",
"The agent thinks action 2 for bandit 2 is the most promising....\n",
"...and it was right!\n",
"The agent thinks action 1 for bandit 3 is the most promising....\n",
"...and it was right!\n"
]
}
],
"source": [
"tf.reset_default_graph() #Clear the Tensorflow graph.\n",
"\n",
"cBandit = contextual_bandit() #Load the bandits.\n",
"myAgent = agent(lr=0.001,s_size=cBandit.num_bandits,a_size=cBandit.num_actions) #Load the agent.\n",
"weights = tf.trainable_variables()[0] #The weights we will evaluate to look into the network.\n",
"\n",
"total_episodes = 10000 #Set total number of episodes to train agent on.\n",
"total_reward = np.zeros([cBandit.num_bandits,cBandit.num_actions]) #Set scoreboard for bandits to 0.\n",
"e = 0.1 #Set the chance of taking a random action.\n",
"\n",
"init = tf.initialize_all_variables()\n",
"\n",
"# Launch the tensorflow graph\n",
"with tf.Session() as sess:\n",
" sess.run(init)\n",
" i = 0\n",
" while i < total_episodes:\n",
" s = cBandit.getBandit() #Get a state from the environment.\n",
" \n",
" #Choose either a random action or one from our network.\n",
" if np.random.rand(1) < e:\n",
" action = np.random.randint(cBandit.num_actions)\n",
" else:\n",
" action = sess.run(myAgent.chosen_action,feed_dict={myAgent.state_in:[s]})\n",
" \n",
" reward = cBandit.pullArm(action) #Get our reward for taking an action given a bandit.\n",
" \n",
" #Update the network.\n",
" feed_dict={myAgent.reward_holder:[reward],myAgent.action_holder:[action],myAgent.state_in:[s]}\n",
" _,ww = sess.run([myAgent.update,weights], feed_dict=feed_dict)\n",
" \n",
" #Update our running tally of scores.\n",
" total_reward[s,action] += reward\n",
" if i % 500 == 0:\n",
" print \"Mean reward for each of the \" + str(cBandit.num_bandits) + \" bandits: \" + str(np.mean(total_reward,axis=1))\n",
" i+=1\n",
"for a in range(cBandit.num_bandits):\n",
" print \"The agent thinks action \" + str(np.argmax(ww[a])+1) + \" for bandit \" + str(a+1) + \" is the most promising....\"\n",
" if np.argmax(ww[a]) == np.argmin(cBandit.bandits[a]):\n",
" print \"...and it was right!\"\n",
" else:\n",
" print \"...and it was wrong!\""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@dhpollack
Copy link
Author

Here's a python3 version of the tutorial.

import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np

class contextual_bandit(object):
    def __init__(self):
        self.state=0
        self.bandits=np.array([[0.2,0.,-0.0,-5.],[0.1,-5.,1.,0.25],[-5.,5.,5.,5.]])
        self.num_bandits=self.bandits.shape[0]
        self.num_actions=self.bandits.shape[1]
        
    def getBandit(self):
        self.state = np.random.randint(len(self.bandits))
        return(self.state)
    
    def pullArm(self, action):
        bandit = self.bandits[self.state,action]
        result = np.random.randn()
        if result > bandit:
            return 1.
        else:
            return -1.

class agent(object):
    def __init__(self, lr, s_size, a_size):
        #These lines established the feed-forward part of the network. The agent takes a state and produces an action.
        self.state_in=tf.placeholder(shape=[1], dtype=tf.int32)
        state_in_OH=slim.one_hot_encoding(self.state_in, s_size)
        output=slim.fully_connected(state_in_OH, a_size, 
                                    biases_initializer=None, 
                                    activation_fn=tf.nn.sigmoid, 
                                    weights_initializer=tf.ones_initializer())
        self.output=tf.reshape(output, [-1])
        self.chosen_action=tf.argmax(self.output, 0)
        
        #The next six lines establish the training proceedure. We feed the reward and chosen action into the network
        #to compute the loss, and use it to update the network.
    
        self.reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
        self.action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
        self.responsible_weight = tf.slice(self.output,self.action_holder,[1])
        self.loss = -(tf.log(self.responsible_weight)*self.reward_holder)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)
        self.update = optimizer.minimize(self.loss)
        

tf.reset_default_graph()

cBandit = contextual_bandit()
myAgent = agent(lr=0.001, s_size=cBandit.num_bandits, a_size=cBandit.num_actions)
weights = tf.trainable_variables()[0]

total_rounds = 10000
#total_reward = tf.Variable(np.zeros((cBandit.num_bandits, cBandit.num_actions)))
#update_reward = tf.scatter_add(total_reward,[action_holder],[reward_holder])
total_reward = np.zeros((cBandit.num_bandits, cBandit.num_actions))
e = 0.1

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < total_rounds:
        s = cBandit.getBandit() # get state
        if np.random.rand() < e:
            action = np.random.randint(cBandit.num_bandits)
        else:
            action = sess.run(myAgent.chosen_action, feed_dict={myAgent.state_in:[s]})
        
        reward = cBandit.pullArm(action)
        
        #Update the network.
        fd = {myAgent.reward_holder:[reward], myAgent.action_holder:[action], myAgent.state_in:[s]}
        _, ww = sess.run([myAgent.update, weights], feed_dict = fd)
        
        #Update our running tally of scores.
        #sess.run(update_reward, feed_dict = {reward_holder: [reward], action_holder: [action]}) # need to feed variables into scoreboard update
        total_reward[s,action] += reward
        
        if i%(total_rounds//10) == 0:
            #print("Running reward for the " + str(cBandit.num_bandits) + " bandits: " + str(sess.run(total_reward))) # using sess.run to print variable
            print("Running reward for the " + str(cBandit.num_bandits) + " bandits: " + str(total_reward)) # why a mean?
        i += 1

for a in range(cBandit.num_bandits):
    print("The agent thinks action " + str(np.argmax(ww[a])+1) + " for bandit " + str(a+1) + " is the most promising....")
    if np.argmax(ww[a]) == np.argmin(cBandit.bandits[a]):
        print("...and it was right!")
    else:
        print("...and it was wrong!")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment