Create a gist now

Instantly share code, notes, and snippets.

Training a Neural Network ATARI Pong agent with Policy Gradients from raw pixels
""" Trains an agent with (stochastic) Policy Gradients on Pong. Uses OpenAI Gym. """
import numpy as np
import cPickle as pickle
import gym
# hyperparameters
H = 200 # number of hidden layer neurons
batch_size = 10 # every how many episodes to do a param update?
learning_rate = 1e-4
gamma = 0.99 # discount factor for reward
decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2
resume = False # resume from previous checkpoint?
render = False
# model initialization
D = 80 * 80 # input dimensionality: 80x80 grid
if resume:
model = pickle.load(open('save.p', 'rb'))
else:
model = {}
model['W1'] = np.random.randn(H,D) / np.sqrt(D) # "Xavier" initialization
model['W2'] = np.random.randn(H) / np.sqrt(H)
grad_buffer = { k : np.zeros_like(v) for k,v in model.iteritems() } # update buffers that add up gradients over a batch
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.iteritems() } # rmsprop memory
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x)) # sigmoid "squashing" function to interval [0,1]
def prepro(I):
""" prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
I = I[35:195] # crop
I = I[::2,::2,0] # downsample by factor of 2
I[I == 144] = 0 # erase background (background type 1)
I[I == 109] = 0 # erase background (background type 2)
I[I != 0] = 1 # everything else (paddles, ball) just set to 1
return I.astype(np.float).ravel()
def discount_rewards(r):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(xrange(0, r.size)):
if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
def policy_forward(x):
h = np.dot(model['W1'], x)
h[h<0] = 0 # ReLU nonlinearity
logp = np.dot(model['W2'], h)
p = sigmoid(logp)
return p, h # return probability of taking action 2, and hidden state
def policy_backward(eph, epdlogp):
""" backward pass. (eph is array of intermediate hidden states) """
dW2 = np.dot(eph.T, epdlogp).ravel()
dh = np.outer(epdlogp, model['W2'])
dh[eph <= 0] = 0 # backpro prelu
dW1 = np.dot(dh.T, epx)
return {'W1':dW1, 'W2':dW2}
env = gym.make("Pong-v0")
observation = env.reset()
prev_x = None # used in computing the difference frame
xs,hs,dlogps,drs = [],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 0
while True:
if render: env.render()
# preprocess the observation, set input to network to be difference image
cur_x = prepro(observation)
x = cur_x - prev_x if prev_x is not None else np.zeros(D)
prev_x = cur_x
# forward the policy network and sample an action from the returned probability
aprob, h = policy_forward(x)
action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
# record various intermediates (needed later for backprop)
xs.append(x) # observation
hs.append(h) # hidden state
y = 1 if action == 2 else 0 # a "fake label"
dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused)
# step the environment and get new measurements
observation, reward, done, info = env.step(action)
reward_sum += reward
drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)
if done: # an episode finished
episode_number += 1
# stack together all inputs, hidden states, action gradients, and rewards for this episode
epx = np.vstack(xs)
eph = np.vstack(hs)
epdlogp = np.vstack(dlogps)
epr = np.vstack(drs)
xs,hs,dlogps,drs = [],[],[],[] # reset array memory
# compute the discounted reward backwards through time
discounted_epr = discount_rewards(epr)
# standardize the rewards to be unit normal (helps control the gradient estimator variance)
discounted_epr -= np.mean(discounted_epr)
discounted_epr /= np.std(discounted_epr)
epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.)
grad = policy_backward(eph, epdlogp)
for k in model: grad_buffer[k] += grad[k] # accumulate grad over batch
# perform rmsprop parameter update every batch_size episodes
if episode_number % batch_size == 0:
for k,v in model.iteritems():
g = grad_buffer[k] # gradient
rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g**2
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
grad_buffer[k] = np.zeros_like(v) # reset batch gradient buffer
# boring book-keeping
running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
print 'resetting env. episode reward total was %f. running mean: %f' % (reward_sum, running_reward)
if episode_number % 100 == 0: pickle.dump(model, open('save.p', 'wb'))
reward_sum = 0
observation = env.reset() # reset env
prev_x = None
if reward != 0: # Pong has either +1 or -1 reward exactly when game ends.
print ('ep %d: game finished, reward: %f' % (episode_number, reward)) + ('' if reward == -1 else ' !!!!!!!!')
@liangfu
liangfu commented Jun 1, 2016

Great work, thanks @karpathy for your simple gist code that just works !~

@micoolcho

Awesome! Thanks @karpathy for your generosity again!

@lakehanne
lakehanne commented Jun 1, 2016 edited

Thanks! I am a student in control theory and much to my chagrin, supervised learning for approximating dynamical systems, particularly in robot control is royally disappointing. I am just combing through Sutton's book myself (in chapter 3 now). This is priceless!

@jwjohnson314

Really nice work.

@domluna
domluna commented Jun 3, 2016

@karpathy I'm curious why you left out the no-op action. It explains in the video why the agent looks like it just drank 100 coffee cups :)

Does it make it harder to learn if no-op is left in?

@etienne87

@karpathy, great post! (big fan of cs231n btw). just wondering : if you want this to work for many actions, does it make sense to replace sigmoid by softmax & then replace line 87 : dlogs.append(softmaxloss_gradient) ?

@yuzcccc
yuzcccc commented Jun 7, 2016

great post!. I try to run this script using the given hyper-parameters, however, after 10000+ episode, the running mean is still about -16 ~ -18, and is far from converge from the visualization. Any suggestions?

@etienne87
etienne87 commented Jun 9, 2016 edited

@domluna, it works with noops as well if you replace sigmoid by softmax (this needs some minor modification like modify W2 matrix with more outputs, gradient is the same but at the right output). @yuzcccc Also setting the learning-rate to 1-3 works better (like 10x better in my trial)
EDIT : test here : https://gist.github.com/etienne87/6803a65653975114e6c6f08bb25e1522

@pitybea
pitybea commented Jun 17, 2016

great example!
I wonder what will happen if negative training examples (lost games) are sub sampled?

@greydanus

Thanks @karpathy, this is a generous and well thought-out example. As with your rnn gist, it captures the essentials of a very difficult and exciting field

@Atcold
Atcold commented Jun 21, 2016 edited

Why are you calling the logit np.dot(model['W2'], h) logp? This is output is not constrained by ]-∞, 0], is it?
Otherwise, p = exp(logp), no?

@dorajam
dorajam commented Jul 1, 2016

It might not make a big difference, but why don't we backpropagate through the sigmoid layer? Seems like the backprop function just assumes the gradients to be the errors before the sigmoid activation. Any ideas?

@SelvamArul

Hi All,
I have a small question regarding downsampling done in prepro function.

I = I[::2,::2,0] # downsample by factor of 2

Here why only the R channel from RGB in considered. Could someone help me in understanding the idea behind considering only one channel for downsampling.

Thanks

@nickc92
nickc92 commented Jul 11, 2016

@SelvamArul, note that a few lines later he has I[I != 0] = 1, so he's basically turning the image into a black/white binary image; he could have picked any channel, R, G, or B.

@greydanus

I'm using your gist to get started with Policy Gradients and I have a question about the policy_backward function: why aren't you backprop-ing though the sigmoid function? I assume you leave the \sigma*(1-\sigma) out as an approximation, but I want to make sure! Thanks again

@NHDaly
NHDaly commented Aug 5, 2016

Thanks so much for this and for your awesome article. I'm sure you hear this often, but your articles are what got me excited about machine learning, especially deep learning! Thanks!! :)

A bug in your code:
Line 61 in the policy_backward function references epx, which is defined on line 99. I think this should be passed into the function, just like eph and dlogps, not referenced globally like it is.

c: Thanks again!

@12digimon

Hi Is it possible to speak with you

@sohero
sohero commented Aug 29, 2016

Hi, I am confused about Line 86 y = 1 if action == 2 else 0 # a "fake label". why make a fake label? and there is no y label in the formula, only p(x).

Could you give some hints?

@taey16
taey16 commented Aug 30, 2016

@sohero

Why don't you read the blog post in Section. "Policy Gradients"

Okay, but what do we do if we do not have the correct label in the Reinforcement Learning setting? Here is the Policy Gradients solution (again refer to diagram below). Our policy network calculated probability of going UP as 30% (logprob -1.2) and DOWN as 70% (logprob -0.36). We will now sample an action from this distribution; E.g. suppose we sample DOWN, and we will execute it in the game. At this point notice one interesting fact: We could immediately fill in a gradient of 1.0 for DOWN as we did in supervised learning, and find the gradient vector that would encourage the network to be slightly more likely to do the DOWN action in the future. So we can immediately evaluate this gradient and that’s great, but the problem is that at least for now we do not yet know if going DOWN is good. But the critical point is that that’s okay, because we can simply wait a bit and see! For example in Pong we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case). In the example below, going DOWN ended up to us losing the game (-1 reward). So if we fill in -1 for log probability of DOWN and do backprop we will find a gradient that discourages the network to take the DOWN action for that input in the future (and rightly so, since taking that action led to us losing the game).

y = 1 if action == 2 else 0 # a "fake label" # line 86
# So we can immediately evaluate this gradient by introducing fake label
dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken
...
...
epdlogp = np.vstack(dlogps) # line 101
...
# we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), 
# and enter that scalar as the gradient for the action we have taken
epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.) # line 111
grad = policy_backward(eph, epdlogp)
...
@MichaelBurge

On this line:

dlogps.append(y - aprob)

Aren't y and aprob probabilities? So it is incorrect to subtract them to get a difference in log probabilities?

@hqm
hqm commented Sep 25, 2016

When you do the preprocessing where you compute the difference between the frame and the previous frame, doesn't that cause
the location of the paddles to be removed, if they have not moved from one frame to the next? It seems counter-intuitive that the learning wouldn't be affected by the location of the paddles...

@farscape2012

Very excellent scripts.
I am wondering one thing regarding to learning speed. Hopefully you can give me some suggestions.
In your scripts, random action is sampled given an environment state. Normally it takes a long time to explore. What if the actions are guided by human intelligence instead of exploring by itself. After human teaching for a while, let machine learn by itself. In this case, the learning speed can be improved quite a lot. Do you have any comments about this approach ? If it is possible then how to proceed ?

@DanielTakeshi
DanielTakeshi commented Oct 21, 2016 edited

I don't know if this was written with a different API, but it's worth noting that the print statement in the penultimate line of the script isn't quite accurate. The +1 or -1 happens each time either player scores, +1 if we score, -1 if the computer scores. That is distinct from an episode which terminates after someone gets 21 points.

@NHDaly it actually still works, epx can still be a global variable and its value will be passed implicitly into the method.

@finardi
finardi commented Nov 2, 2016 edited

Great job!
I have a question.
In CS231 the RMSprop algorithm is defined by:
x += - learning_rate * dx / (np.sqrt(cache) + eps).

But here, in line 120 we have:
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)

Why has not the negative sign in the right side of this equation?

note: I try train with the minus signal and after 1,000 episodes the running_reward still -21.

@zerolocker

@dorajam and @greydanus
Actually the code does backprop through the sigmoid layer. Please see http://cs231n.github.io/neural-networks-2/#losses and search for the string "as you can double check yourself by taking the derivatives". The formula there matches exactly with the line dlogps.append(y - aprob) .

Thus the variable dlogps is the gradient w.r.t. the logit.

@neighborzhang
neighborzhang commented Dec 10, 2016 edited

@karpathy, your blog is awesome, and thanks for sharing the code. I am a newbee in deep learning. I am wondering is your code utilized GPU? Thanks

@petercerno

What I find quite fascinating is that, since the neural network makes no assumptions about the pixel spatial structure, the same algorithm would work equally well even if we randomly permuted the pixels on the screen. I bet that the algorithm would converge faster (to a stronger policy) If we used CNNs.

@gokayhuz

@hqm: There are 2 padddles in the game, ours and the computer's.

When we compute the difference between the current frame and the previous frame, we DO lose the the computer's paddle IF it is stationary. Our paddle however is never stationary (at every step, our action is either UP or DOWN) and we can keep track of our paddle's location (which is the one we care about)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment