Skip to content

Instantly share code, notes, and snippets.

@xkrishnam
Last active September 29, 2022 06:17
Show Gist options
  • Save xkrishnam/d9a62d52d28eb943c3965c6cf631ad30 to your computer and use it in GitHub Desktop.
Save xkrishnam/d9a62d52d28eb943c3965c6cf631ad30 to your computer and use it in GitHub Desktop.
tensorflow 2 implementation of Policy gradient method for solving n-armed bandit problems.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Contextual Bandits\n",
"This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the contextual bandit problem. "
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"import numpy as np\n",
"from tensorflow import keras\n",
"from tensorflow.keras import layers\n",
"import datetime"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### The Contextual Bandits\n",
"Here we define our contextual bandits. In this example, we are using three four-armed bandit. What this means is that each bandit has four arms that can be pulled. Each bandit has different success probabilities for each arm, and as such requires different actions to obtain the best result. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit-arm that will most often give a positive reward, depending on the Bandit presented."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"class contextual_bandit():\n",
" def __init__(self):\n",
" self.state = 0\n",
" #List out our bandits. Currently arms 4, 2, and 1 (respectively) are the most optimal.\n",
" self.bandits = np.array([[0.2,0,-0.0,-5],[0.1,-5,1,0.25],[-5,5,5,5]])\n",
" self.num_bandits = self.bandits.shape[0]\n",
" self.num_actions = self.bandits.shape[1]\n",
" \n",
" def getBandit(self):\n",
" self.state = np.random.randint(0,len(self.bandits)) #Returns a random state for each episode.\n",
" return self.state\n",
" \n",
" def pullArm(self,action):\n",
" #Get a random number.\n",
" bandit = self.bandits[self.state,action]\n",
" result = np.random.randn(1)\n",
" if result > bandit:\n",
" #return a positive reward.\n",
" return 1\n",
" else:\n",
" #return a negative reward.\n",
" return -1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The Policy-Based Agent\n",
"The code below established our simple neural agent. It takes as input the current state, and returns an action. This allows the agent to take actions which are conditioned on the state of the environment, a critical step toward being able to solve full RL problems. The agent uses a single set of weights, within which each value is an estimate of the value of the return from choosing a particular arm given a bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward."
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"cBandit = contextual_bandit()\n",
"layer1 = layers.Dense(cBandit.num_actions, input_shape=(cBandit.num_bandits,), activation='sigmoid', use_bias=False, kernel_initializer=tf.ones_initializer())\n",
"optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"def loss(opt,act):\n",
" sliced = tf.slice(opt,[act],[1])\n",
" logged = -(tf.math.log(sliced))\n",
" multiplied = tf.multiply(logged, reward)\n",
" return multiplied\n",
"\n",
"total_episodes = 10000#Set total number of episodes to train agent on.\n",
"total_reward = np.zeros([cBandit.num_bandits,cBandit.num_actions]) #Set scoreboard for bandits to 0.\n",
"train_loss = tf.keras.metrics.Mean('train_loss', dtype=tf.float32)\n",
"current_time = datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
"train_log_dir = 'logs/gradient_tape/' + current_time + '/train'\n",
"train_summary_writer = tf.summary.create_file_writer(train_log_dir)\n",
"e = 0.1 #Set the chance of taking a random action.\n",
"i = 0\n"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean reward for each of the 3 bandits: [0. 0.25 0. ]\n",
"Mean reward for each of the 3 bandits: [36. 42. 31.75]\n",
"Mean reward for each of the 3 bandits: [75.75 80.25 70.75]\n",
"Mean reward for each of the 3 bandits: [113.75 122.25 106.75]\n",
"Mean reward for each of the 3 bandits: [149. 162.5 143.25]\n",
"Mean reward for each of the 3 bandits: [186. 205. 180.25]\n",
"Mean reward for each of the 3 bandits: [223.25 242.75 215.75]\n",
"Mean reward for each of the 3 bandits: [264.5 278.5 251.25]\n",
"Mean reward for each of the 3 bandits: [303.5 309.5 292.75]\n",
"Mean reward for each of the 3 bandits: [341. 347.5 325.75]\n",
"Mean reward for each of the 3 bandits: [378.75 382.25 367.25]\n",
"Mean reward for each of the 3 bandits: [420.5 422.75 400.5 ]\n",
"Mean reward for each of the 3 bandits: [461.75 458.25 434.75]\n",
"Mean reward for each of the 3 bandits: [497.75 498. 471. ]\n",
"Mean reward for each of the 3 bandits: [537. 535. 505.25]\n",
"Mean reward for each of the 3 bandits: [573.5 573.25 540.5 ]\n",
"Mean reward for each of the 3 bandits: [613. 611.75 574. ]\n",
"Mean reward for each of the 3 bandits: [653.25 647.5 609.5 ]\n",
"Mean reward for each of the 3 bandits: [692.5 685.75 639. ]\n",
"Mean reward for each of the 3 bandits: [731.25 718.75 675.75]\n",
"The agent thinks action 4 for bandit 1 is the most promising....\n",
"...and it was right!\n",
"The agent thinks action 2 for bandit 2 is the most promising....\n",
"...and it was right!\n",
"The agent thinks action 1 for bandit 3 is the most promising....\n",
"...and it was right!\n"
]
}
],
"source": [
"while i < total_episodes:\n",
" s = cBandit.getBandit() #Get a state from the environment.\n",
" #Choose either a random action or one from our network.\n",
" x = tf.one_hot([s],cBandit.num_bandits)\n",
" \n",
" with tf.GradientTape() as tape:\n",
" output = layer1(x)\n",
" output = tf.reshape(output,[-1])\n",
" action = tf.argmax(output,0)\n",
" action = np.int32(action)\n",
" if np.random.rand(1) < e:\n",
" action = np.random.randint(cBandit.num_actions)\n",
" reward = cBandit.pullArm(action) #Get our reward for taking an action given a bandit.\n",
" loss_value = loss(output,action)\n",
" \n",
" grads = tape.gradient(loss_value, layer1.trainable_variables)\n",
" optimizer.apply_gradients(zip(grads, layer1.trainable_variables))\n",
" train_loss(loss_value)\n",
" with train_summary_writer.as_default():\n",
" tf.summary.scalar('loss', train_loss.result(), step=i)\n",
" #Update our running tally of scores.\n",
" total_reward[s,action] += reward\n",
" if i % 500 == 0:\n",
" print (\"Mean reward for each of the \" + str(cBandit.num_bandits) + \" bandits: \" + str(np.mean(total_reward,axis=1)))\n",
" i+=1\n",
"for a in range(cBandit.num_bandits):\n",
" ww = layer1.get_weights()\n",
" print (\"The agent thinks action \" + str(np.argmax(ww[0][a])+1) + \" for bandit \" + str(a+1) + \" is the most promising....\")\n",
" if np.argmax(ww[0][a]) == np.argmin(cBandit.bandits[a]):\n",
" print (\"...and it was right!\")\n",
" else:\n",
" print (\"...and it was wrong!\")"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <iframe id=\"tensorboard-frame-aef7486fcea7c5f0\" width=\"100%\" height=\"800\" frameborder=\"0\">\n",
" </iframe>\n",
" <script>\n",
" (function() {\n",
" const frame = document.getElementById(\"tensorboard-frame-aef7486fcea7c5f0\");\n",
" const url = new URL(\"/\", window.location);\n",
" url.port = 6006;\n",
" frame.src = url;\n",
" })();\n",
" </script>\n",
" "
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%load_ext tensorboard\n",
"%tensorboard --logdir logs/gradient_tape"
]
},
{
"cell_type": "markdown",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ["#### see image in comment for loss graph"]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
@xkrishnam
Copy link
Author

image

loss graph

@daniel-xion
Copy link

daniel-xion commented Aug 22, 2022

Thanks for the code. May I know what is the context in this case (since it is called contextual)?
Also, I tried to add several more layers to the neural network (so it works better for large number of bandits), but was unable to do so correctly.
In the above code, the get_weights() function for ww returns a 3x4 array. If hidden layer of other sizes are added (let's say one with size 8), the ww.get_weights() for the last layer gives 8x4 array. Then one cannot use np.argmax(ww[0][a]) to find the best action from this new array.

@xkrishnam
Copy link
Author

Context here simply means that algorithm also considers information about the state of environment (context) to generate actions for getting higher rewards (i.e. not only generating random actions and optimizing loss).

2nd you can add more layers but output from last layer should be number_of_bandicts * number_of_possible_actions that means you can put layers before current first layer (i.e. layer1).

Or you can make code more generic to use it more effectively as when I coded it my only intention was to covert existing TF1 solution using TF2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment