Skip to content

Instantly share code, notes, and snippets.

@ruotianluo
Forked from awjuliani/rl-tutorial-1.ipynb
Created June 16, 2016 04:19
Show Gist options
  • Save ruotianluo/c557884409a6c55c1d35d4cecb45cc31 to your computer and use it in GitHub Desktop.
Save ruotianluo/c557884409a6c55c1d35d4cecb45cc31 to your computer and use it in GitHub Desktop.
Reinforcement Learning Tutorial 1 (Two-armed bandit problem)
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reinforcement Learning in Tensorflow Tutorial 1\n",
"## The two-armed bandit"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import tensorflow as tf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Bandits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we define our bandits. For this example we are using a two-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#List out our two bandits. Current bandit 0 is the optimal choice.\n",
"bandits = [-0.5,0.5]\n",
"def pullBandit(bandit):\n",
" #Get a random number.\n",
" result = np.random.randn(1)\n",
" if result > bandit:\n",
" #return a positive reward.\n",
" return 1\n",
" else:\n",
" #return a negative reward.\n",
" return -1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The neural network agent\n",
"\n",
"Here we set up all the parameters that will be used for training our network."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# hyperparameters\n",
"learning_rate = 0.1\n",
"gamma = 0.99 # discount factor for reward"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we define our very simple neural network."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tf.reset_default_graph()\n",
"#While there aren't any states to this task, we will still use an x input as a placeholder.\n",
"input_x = tf.placeholder(tf.float32, [None,1] , name=\"input_x\")\n",
"W = tf.Variable(tf.random_normal([1,1]),name='W') # Our single variable we are training\n",
"score = tf.matmul(input_x,W)\n",
"probability = tf.nn.sigmoid(score) # This is the liklihood of choosing bandit 1 over bandit 0\n",
"\n",
"#Below we compute and set the gradients to use for adjusting the network towards a succesful policy. \n",
"input_y = tf.placeholder(tf.float32,[None,1], name=\"input_y\")\n",
"rewardSig = tf.placeholder(tf.float32,name=\"reward_sig\")\n",
"\n",
"#The computation below is the key to processing the gradients properly. \n",
"theGrad = tf.gradients(probability,W,grad_ys= ((input_y*rewardSig)/probability) - rewardSig)\n",
"\n",
"adam = tf.train.AdamOptimizer(learning_rate=learning_rate)\n",
"updateGrads = adam.apply_gradients([(theGrad[0],W)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Running the agent and environment"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total reward 4.000000. Action: 1.000000. Prob Before 0.143842. Prob After 0.131960.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.131960. Prob After 0.122341.\n",
"Total reward 0.000000. Action: 0.000000. Prob Before 0.122341. Prob After 0.115324.\n",
"Total reward 6.000000. Action: 0.000000. Prob Before 0.115324. Prob After 0.107501.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.107501. Prob After 0.099853.\n",
"Total reward 0.000000. Action: 0.000000. Prob Before 0.099853. Prob After 0.093116.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.093116. Prob After 0.086765.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.086765. Prob After 0.080318.\n",
"Total reward 0.000000. Action: 0.000000. Prob Before 0.080318. Prob After 0.074615.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.074615. Prob After 0.068990.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.068990. Prob After 0.063545.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.063545. Prob After 0.058583.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.058583. Prob After 0.053903.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.053903. Prob After 0.049375.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.049375. Prob After 0.045196.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.045196. Prob After 0.041266.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.041266. Prob After 0.037747.\n",
"Total reward 8.000000. Action: 0.000000. Prob Before 0.037747. Prob After 0.034351.\n",
"Total reward 0.000000. Action: 0.000000. Prob Before 0.034351. Prob After 0.031632.\n",
"Total reward 10.000000. Action: 1.000000. Prob Before 0.031632. Prob After 0.028917.\n",
"Total reward 0.000000. Action: 0.000000. Prob Before 0.028917. Prob After 0.026569.\n",
"Total reward -2.000000. Action: 0.000000. Prob Before 0.026569. Prob After 0.024750.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.024750. Prob After 0.022972.\n",
"Total reward 6.000000. Action: 0.000000. Prob Before 0.022972. Prob After 0.021186.\n",
"Total reward 6.000000. Action: 0.000000. Prob Before 0.021186. Prob After 0.019408.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.019408. Prob After 0.017758.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.017758. Prob After 0.016229.\n",
"Total reward 6.000000. Action: 0.000000. Prob Before 0.016229. Prob After 0.014770.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.014770. Prob After 0.013436.\n",
"Total reward 6.000000. Action: 0.000000. Prob Before 0.013436. Prob After 0.012180.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.012180. Prob After 0.011061.\n",
"Total reward 0.000000. Action: 0.000000. Prob Before 0.011061. Prob After 0.010130.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.010130. Prob After 0.009267.\n",
"Total reward -4.000000. Action: 0.000000. Prob Before 0.009267. Prob After 0.008642.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.008642. Prob After 0.008035.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.008035. Prob After 0.007482.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.007482. Prob After 0.006977.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.006977. Prob After 0.006498.\n",
"Total reward 6.000000. Action: 0.000000. Prob Before 0.006498. Prob After 0.006010.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.006010. Prob After 0.005570.\n",
"Total reward 8.000000. Action: 0.000000. Prob Before 0.005570. Prob After 0.005113.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.005113. Prob After 0.004696.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.004696. Prob After 0.004325.\n",
"Total reward 2.000000. Action: 0.000000. Prob Before 0.004325. Prob After 0.003994.\n",
"Total reward -4.000000. Action: 0.000000. Prob Before 0.003994. Prob After 0.003756.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.003756. Prob After 0.003518.\n",
"Total reward 4.000000. Action: 0.000000. Prob Before 0.003518. Prob After 0.003284.\n",
"Total reward 0.000000. Action: 0.000000. Prob Before 0.003284. Prob After 0.003084.\n",
"Total reward 0.000000. Action: 0.000000. Prob Before 0.003084. Prob After 0.002913.\n",
"Total reward 0.000000. Action: 0.000000. Prob Before 0.002913. Prob After 0.002766.\n"
]
}
],
"source": [
"xs,drs,ys = [],[],[]\n",
"total_episodes = 500\n",
"running_reward = None\n",
"reward_sum = 0\n",
"episode_number = 1\n",
"\n",
"init = tf.initialize_all_variables()\n",
"\n",
"# Launch the graph\n",
"with tf.Session() as sess:\n",
" sess.run(init)\n",
"\n",
" while episode_number <= total_episodes:\n",
" #Generate a placeholder state\n",
" x = np.ones([1,1])\n",
" xs.append(x) \n",
"\n",
" # forward the policy network and sample an action from the returned probability\n",
" prob = sess.run(probability,feed_dict={input_x:np.ones([1,1])})\n",
" action = 1 if np.random.uniform() < prob else 0\n",
"\n",
" y = 1 if action == 0 else 0 # a \"fake label\"\n",
" ys.append(y)\n",
" \n",
" # Take our action in the environment, and get a reward\n",
" reward = np.float64(pullBandit(bandits[action]))\n",
" reward_sum += reward\n",
" drs.append(reward) # record reward \n",
"\n",
" \n",
" if episode_number % 10 == 0: # Periodically update the network policy\n",
" epx = np.vstack(xs)\n",
" epy = np.vstack(ys)\n",
" epr = np.vstack(drs)\n",
" xs,drs,ys = [],[],[] # reset array memory\n",
"\n",
" # Update the network gradient towards choosing more ideal actions, given what it has observed.\n",
" probBefore = sess.run(probability, feed_dict={input_x: epx, input_y: epy, rewardSig: epr})\n",
" sess.run(updateGrads,feed_dict={input_x: epx, input_y: epy, rewardSig: epr})\n",
" probAfter = sess.run(probability, feed_dict={input_x: epx, input_y: epy, rewardSig: epr})\n",
"\n",
" # Keep a record of rewards, and give some feedback\n",
" running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01\n",
" print 'Total reward %f. Action: %f. Prob Before %f. Prob After %f.' % (reward_sum, action,probBefore[0],probAfter[0])\n",
" reward_sum = 0\n",
" prev_x = None\n",
" \n",
" episode_number += 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"You should see that the agent learns to almost always choose action 0, and the probability of choosing action 1 decreases to near zero. Feel free to play with the two bandit values, and see how the agent changes what it learns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment