Skip to content

Instantly share code, notes, and snippets.

@dhpollack
Forked from awjuliani/SimplePolicy.ipynb
Last active February 9, 2017 12:23
Show Gist options
  • Save dhpollack/2d7b906cd51bbdc870cafd0801da0693 to your computer and use it in GitHub Desktop.
Save dhpollack/2d7b906cd51bbdc870cafd0801da0693 to your computer and use it in GitHub Desktop.
Policy gradient method for solving n-armed bandit problems.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple Reinforcement Learning in Tensorflow Part 1: \n",
"## The Multi-armed bandit\n",
"This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the multi-armed bandit problem. For more information, see this [Medium post](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149).\n",
"\n",
"For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, [DeepRL-Agents](https://github.com/awjuliani/DeepRL-Agents). "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Bandits\n",
"Here we define our bandits. For this example we are using a four-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward.\n",
"bandits = [0.2, 0., -0.2, -5]\n",
"num_bandits = len(bandits)\n",
"\n",
"def pullBandit(bandit):\n",
" #Get a random number.\n",
" result = np.random.randn()\n",
" if result > bandit:\n",
" #return a positive reward.\n",
" return 1.\n",
" else:\n",
" #return a negative reward.\n",
" return -1."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Agent\n",
"The code below established our simple neural agent. It consists of a set of values for each of the bandits. Each value is an estimate of the value of the return from choosing the bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tf.reset_default_graph()\n",
"\n",
"#These two lines established the feed-forward part of the network. This does the actual choosing.\n",
"weights = tf.Variable(tf.ones([num_bandits]))\n",
"chosen_action = tf.argmax(weights,0)\n",
"\n",
"#The next six lines establish the training proceedure. We feed the reward and chosen action into the network\n",
"#to compute the loss, and use it to update the network.\n",
"reward_holder = tf.placeholder(shape=[1], dtype=tf.float32)\n",
"action_holder = tf.placeholder(shape=[1], dtype=tf.int32)\n",
"responsible_weight = tf.slice(weights, action_holder, [1])\n",
"loss = -(tf.log(responsible_weight)*reward_holder)\n",
"optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001)\n",
"update = optimizer.minimize(loss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training the Agent"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will train our agent by taking actions in our environment, and recieving rewards. Using the rewards and actions, we can know how to properly update our network in order to more often choose actions that will yield the highest rewards over time."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running reward for the 4 bandits: [ 1. 0. 0. 0.]\n",
"Running reward for the 4 bandits: [ 0. -2. -1. 38.]\n",
"Running reward for the 4 bandits: [ 0. -4. -2. 83.]\n",
"Running reward for the 4 bandits: [ 0. -6. -1. 128.]\n",
"Running reward for the 4 bandits: [ 0. -8. 1. 172.]\n",
"Running reward for the 4 bandits: [ -1. -9. 2. 219.]\n",
"Running reward for the 4 bandits: [ -1. -10. 4. 264.]\n",
"Running reward for the 4 bandits: [ 0. -11. 4. 312.]\n",
"Running reward for the 4 bandits: [ 2. -10. 4. 357.]\n",
"Running reward for the 4 bandits: [ 2. -9. 4. 406.]\n",
"Running reward for the 4 bandits: [ 0. -11. 4. 448.]\n",
"Running reward for the 4 bandits: [ -1. -10. 3. 495.]\n",
"Running reward for the 4 bandits: [ -3. -10. 2. 540.]\n",
"Running reward for the 4 bandits: [ -3. -10. 3. 585.]\n",
"Running reward for the 4 bandits: [ -3. -8. 3. 629.]\n",
"Running reward for the 4 bandits: [ -2. -7. 1. 673.]\n",
"Running reward for the 4 bandits: [ -4. -7. 2. 720.]\n",
"Running reward for the 4 bandits: [ -4. -7. 3. 769.]\n",
"Running reward for the 4 bandits: [ -6. -8. 3. 814.]\n",
"Running reward for the 4 bandits: [ -7. -7. 3. 858.]\n",
"The agent thinks bandit 4 is the most promising....\n",
"...and it was right!\n"
]
}
],
"source": [
"total_rounds = 1000\n",
"total_reward = tf.Variable(np.zeros(num_bandits), dtype=tf.float32) # Set scoreboard for bandits to 0... changed from tf.zeros() \n",
"update_reward = tf.scatter_add(total_reward,[action_holder],[reward_holder]) # Update scoreboard, simple assignment did not work\n",
"e = 0.5 #Set the chance of taking a random action.\n",
"\n",
"\n",
"init = tf.global_variables_initializer() # updated due to deprecation message\n",
"\n",
"with tf.Session() as sess:\n",
" sess.run(init)\n",
" i = 0\n",
" while i < total_rounds:\n",
" #Choose either a random action or one from our network.\n",
" if np.random.rand() < e:\n",
" action = np.random.randint(num_bandits)\n",
" else:\n",
" action = sess.run(chosen_action)\n",
" \n",
" reward = pullBandit(bandits[action])\n",
" \n",
" #Update the network.\n",
" _, resp, ww = sess.run([update, responsible_weight, weights], feed_dict = {reward_holder: [reward], action_holder: [action]})\n",
" \n",
" sess.run(update_reward, feed_dict = {reward_holder: [reward], action_holder: [action]})\n",
" \n",
" if i%50 == 0:\n",
" print(\"Running reward for the \" + str(num_bandits) + \" bandits: \" + str(sess.run(total_reward)))\n",
" i += 1\n",
"\n",
"print(\"The agent thinks bandit \" + str(np.argmax(ww)+1) + \" is the most promising....\")\n",
"if np.argmax(ww) == np.argmax(-np.array(bandits)):\n",
" print(\"...and it was right!\")\n",
"else:\n",
" print(\"...and it was wrong!\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Anaconda (Python 3)",
"language": "python",
"name": "anaconda3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@dhpollack
Copy link
Author

using python 3 with tensor flow r0.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment