Last active
February 2, 2020 05:04
-
-
Save awjuliani/16608e1c4968baaa692b9b8c7dd94d04 to your computer and use it in GitHub Desktop.
Reinforcement Learning Tutorial 1 (Two-armed bandit problem)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Reinforcement Learning in Tensorflow Tutorial 1\n", | |
"## The two-armed bandit" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import tensorflow as tf" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### The Bandits" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here we define our bandits. For this example we are using a two-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"#List out our two bandits. Current bandit 0 is the optimal choice.\n", | |
"bandits = [-0.5,0.5]\n", | |
"def pullBandit(bandit):\n", | |
" #Get a random number.\n", | |
" result = np.random.randn(1)\n", | |
" if result > bandit:\n", | |
" #return a positive reward.\n", | |
" return 1\n", | |
" else:\n", | |
" #return a negative reward.\n", | |
" return -1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### The neural network agent\n", | |
"\n", | |
"Here we set up all the parameters that will be used for training our network." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# hyperparameters\n", | |
"learning_rate = 0.1\n", | |
"gamma = 0.99 # discount factor for reward" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Next we define our very simple neural network." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"tf.reset_default_graph()\n", | |
"#While there aren't any states to this task, we will still use an x input as a placeholder.\n", | |
"input_x = tf.placeholder(tf.float32, [None,1] , name=\"input_x\")\n", | |
"W = tf.Variable(tf.random_normal([1,1]),name='W') # Our single variable we are training\n", | |
"score = tf.matmul(input_x,W)\n", | |
"probability = tf.nn.sigmoid(score) # This is the liklihood of choosing bandit 1 over bandit 0\n", | |
"\n", | |
"#Below we compute and set the gradients to use for adjusting the network towards a succesful policy. \n", | |
"input_y = tf.placeholder(tf.float32,[None,1], name=\"input_y\")\n", | |
"rewardSig = tf.placeholder(tf.float32,name=\"reward_sig\")\n", | |
"\n", | |
"#The computation below is the key to processing the gradients properly. \n", | |
"theGrad = tf.gradients(probability,W,grad_ys= ((input_y*rewardSig)/probability) - rewardSig)\n", | |
"\n", | |
"adam = tf.train.AdamOptimizer(learning_rate=learning_rate)\n", | |
"updateGrads = adam.apply_gradients([(theGrad[0],W)])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Running the agent and environment" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Total reward 4.000000. Action: 1.000000. Prob Before 0.143842. Prob After 0.131960.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.131960. Prob After 0.122341.\n", | |
"Total reward 0.000000. Action: 0.000000. Prob Before 0.122341. Prob After 0.115324.\n", | |
"Total reward 6.000000. Action: 0.000000. Prob Before 0.115324. Prob After 0.107501.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.107501. Prob After 0.099853.\n", | |
"Total reward 0.000000. Action: 0.000000. Prob Before 0.099853. Prob After 0.093116.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.093116. Prob After 0.086765.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.086765. Prob After 0.080318.\n", | |
"Total reward 0.000000. Action: 0.000000. Prob Before 0.080318. Prob After 0.074615.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.074615. Prob After 0.068990.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.068990. Prob After 0.063545.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.063545. Prob After 0.058583.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.058583. Prob After 0.053903.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.053903. Prob After 0.049375.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.049375. Prob After 0.045196.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.045196. Prob After 0.041266.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.041266. Prob After 0.037747.\n", | |
"Total reward 8.000000. Action: 0.000000. Prob Before 0.037747. Prob After 0.034351.\n", | |
"Total reward 0.000000. Action: 0.000000. Prob Before 0.034351. Prob After 0.031632.\n", | |
"Total reward 10.000000. Action: 1.000000. Prob Before 0.031632. Prob After 0.028917.\n", | |
"Total reward 0.000000. Action: 0.000000. Prob Before 0.028917. Prob After 0.026569.\n", | |
"Total reward -2.000000. Action: 0.000000. Prob Before 0.026569. Prob After 0.024750.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.024750. Prob After 0.022972.\n", | |
"Total reward 6.000000. Action: 0.000000. Prob Before 0.022972. Prob After 0.021186.\n", | |
"Total reward 6.000000. Action: 0.000000. Prob Before 0.021186. Prob After 0.019408.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.019408. Prob After 0.017758.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.017758. Prob After 0.016229.\n", | |
"Total reward 6.000000. Action: 0.000000. Prob Before 0.016229. Prob After 0.014770.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.014770. Prob After 0.013436.\n", | |
"Total reward 6.000000. Action: 0.000000. Prob Before 0.013436. Prob After 0.012180.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.012180. Prob After 0.011061.\n", | |
"Total reward 0.000000. Action: 0.000000. Prob Before 0.011061. Prob After 0.010130.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.010130. Prob After 0.009267.\n", | |
"Total reward -4.000000. Action: 0.000000. Prob Before 0.009267. Prob After 0.008642.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.008642. Prob After 0.008035.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.008035. Prob After 0.007482.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.007482. Prob After 0.006977.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.006977. Prob After 0.006498.\n", | |
"Total reward 6.000000. Action: 0.000000. Prob Before 0.006498. Prob After 0.006010.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.006010. Prob After 0.005570.\n", | |
"Total reward 8.000000. Action: 0.000000. Prob Before 0.005570. Prob After 0.005113.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.005113. Prob After 0.004696.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.004696. Prob After 0.004325.\n", | |
"Total reward 2.000000. Action: 0.000000. Prob Before 0.004325. Prob After 0.003994.\n", | |
"Total reward -4.000000. Action: 0.000000. Prob Before 0.003994. Prob After 0.003756.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.003756. Prob After 0.003518.\n", | |
"Total reward 4.000000. Action: 0.000000. Prob Before 0.003518. Prob After 0.003284.\n", | |
"Total reward 0.000000. Action: 0.000000. Prob Before 0.003284. Prob After 0.003084.\n", | |
"Total reward 0.000000. Action: 0.000000. Prob Before 0.003084. Prob After 0.002913.\n", | |
"Total reward 0.000000. Action: 0.000000. Prob Before 0.002913. Prob After 0.002766.\n" | |
] | |
} | |
], | |
"source": [ | |
"xs,drs,ys = [],[],[]\n", | |
"total_episodes = 500\n", | |
"running_reward = None\n", | |
"reward_sum = 0\n", | |
"episode_number = 1\n", | |
"\n", | |
"init = tf.initialize_all_variables()\n", | |
"\n", | |
"# Launch the graph\n", | |
"with tf.Session() as sess:\n", | |
" sess.run(init)\n", | |
"\n", | |
" while episode_number <= total_episodes:\n", | |
" #Generate a placeholder state\n", | |
" x = np.ones([1,1])\n", | |
" xs.append(x) \n", | |
"\n", | |
" # forward the policy network and sample an action from the returned probability\n", | |
" prob = sess.run(probability,feed_dict={input_x:np.ones([1,1])})\n", | |
" action = 1 if np.random.uniform() < prob else 0\n", | |
"\n", | |
" y = 1 if action == 0 else 0 # a \"fake label\"\n", | |
" ys.append(y)\n", | |
" \n", | |
" # Take our action in the environment, and get a reward\n", | |
" reward = np.float64(pullBandit(bandits[action]))\n", | |
" reward_sum += reward\n", | |
" drs.append(reward) # record reward \n", | |
"\n", | |
" \n", | |
" if episode_number % 10 == 0: # Periodically update the network policy\n", | |
" epx = np.vstack(xs)\n", | |
" epy = np.vstack(ys)\n", | |
" epr = np.vstack(drs)\n", | |
" xs,drs,ys = [],[],[] # reset array memory\n", | |
"\n", | |
" # Update the network gradient towards choosing more ideal actions, given what it has observed.\n", | |
" probBefore = sess.run(probability, feed_dict={input_x: epx, input_y: epy, rewardSig: epr})\n", | |
" sess.run(updateGrads,feed_dict={input_x: epx, input_y: epy, rewardSig: epr})\n", | |
" probAfter = sess.run(probability, feed_dict={input_x: epx, input_y: epy, rewardSig: epr})\n", | |
"\n", | |
" # Keep a record of rewards, and give some feedback\n", | |
" running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01\n", | |
" print 'Total reward %f. Action: %f. Prob Before %f. Prob After %f.' % (reward_sum, action,probBefore[0],probAfter[0])\n", | |
" reward_sum = 0\n", | |
" prev_x = None\n", | |
" \n", | |
" episode_number += 1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"You should see that the agent learns to almost always choose action 0, and the probability of choosing action 1 decreases to near zero. Feel free to play with the two bandit values, and see how the agent changes what it learns." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment