Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Training a Neural Network ATARI Pong agent with Policy Gradients from raw pixels
""" Trains an agent with (stochastic) Policy Gradients on Pong. Uses OpenAI Gym. """
import numpy as np
import cPickle as pickle
import gym
# hyperparameters
H = 200 # number of hidden layer neurons
batch_size = 10 # every how many episodes to do a param update?
learning_rate = 1e-4
gamma = 0.99 # discount factor for reward
decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2
resume = False # resume from previous checkpoint?
render = False
# model initialization
D = 80 * 80 # input dimensionality: 80x80 grid
if resume:
model = pickle.load(open('save.p', 'rb'))
model = {}
model['W1'] = np.random.randn(H,D) / np.sqrt(D) # "Xavier" initialization
model['W2'] = np.random.randn(H) / np.sqrt(H)
grad_buffer = { k : np.zeros_like(v) for k,v in model.iteritems() } # update buffers that add up gradients over a batch
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.iteritems() } # rmsprop memory
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x)) # sigmoid "squashing" function to interval [0,1]
def prepro(I):
""" prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
I = I[35:195] # crop
I = I[::2,::2,0] # downsample by factor of 2
I[I == 144] = 0 # erase background (background type 1)
I[I == 109] = 0 # erase background (background type 2)
I[I != 0] = 1 # everything else (paddles, ball) just set to 1
return I.astype(np.float).ravel()
def discount_rewards(r):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(xrange(0, r.size)):
if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
def policy_forward(x):
h =['W1'], x)
h[h<0] = 0 # ReLU nonlinearity
logp =['W2'], h)
p = sigmoid(logp)
return p, h # return probability of taking action 2, and hidden state
def policy_backward(eph, epdlogp):
""" backward pass. (eph is array of intermediate hidden states) """
dW2 =, epdlogp).ravel()
dh = np.outer(epdlogp, model['W2'])
dh[eph <= 0] = 0 # backpro prelu
dW1 =, epx)
return {'W1':dW1, 'W2':dW2}
env = gym.make("Pong-v0")
observation = env.reset()
prev_x = None # used in computing the difference frame
xs,hs,dlogps,drs = [],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 0
while True:
if render: env.render()
# preprocess the observation, set input to network to be difference image
cur_x = prepro(observation)
x = cur_x - prev_x if prev_x is not None else np.zeros(D)
prev_x = cur_x
# forward the policy network and sample an action from the returned probability
aprob, h = policy_forward(x)
action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
# record various intermediates (needed later for backprop)
xs.append(x) # observation
hs.append(h) # hidden state
y = 1 if action == 2 else 0 # a "fake label"
dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see if confused)
# step the environment and get new measurements
observation, reward, done, info = env.step(action)
reward_sum += reward
drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)
if done: # an episode finished
episode_number += 1
# stack together all inputs, hidden states, action gradients, and rewards for this episode
epx = np.vstack(xs)
eph = np.vstack(hs)
epdlogp = np.vstack(dlogps)
epr = np.vstack(drs)
xs,hs,dlogps,drs = [],[],[],[] # reset array memory
# compute the discounted reward backwards through time
discounted_epr = discount_rewards(epr)
# standardize the rewards to be unit normal (helps control the gradient estimator variance)
discounted_epr -= np.mean(discounted_epr)
discounted_epr /= np.std(discounted_epr)
epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.)
grad = policy_backward(eph, epdlogp)
for k in model: grad_buffer[k] += grad[k] # accumulate grad over batch
# perform rmsprop parameter update every batch_size episodes
if episode_number % batch_size == 0:
for k,v in model.iteritems():
g = grad_buffer[k] # gradient
rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g**2
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
grad_buffer[k] = np.zeros_like(v) # reset batch gradient buffer
# boring book-keeping
running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
print 'resetting env. episode reward total was %f. running mean: %f' % (reward_sum, running_reward)
if episode_number % 100 == 0: pickle.dump(model, open('save.p', 'wb'))
reward_sum = 0
observation = env.reset() # reset env
prev_x = None
if reward != 0: # Pong has either +1 or -1 reward exactly when game ends.
print ('ep %d: game finished, reward: %f' % (episode_number, reward)) + ('' if reward == -1 else ' !!!!!!!!')

liangfu commented Jun 1, 2016

Great work, thanks @karpathy for your simple gist code that just works !~

Awesome! Thanks @karpathy for your generosity again!

lakehanne commented Jun 1, 2016

Thanks! I am a student in control theory and much to my chagrin, supervised learning for approximating dynamical systems, particularly in robot control is royally disappointing. I am just combing through Sutton's book myself (in chapter 3 now). This is priceless!

Really nice work.

domluna commented Jun 3, 2016

@karpathy I'm curious why you left out the no-op action. It explains in the video why the agent looks like it just drank 100 coffee cups :)

Does it make it harder to learn if no-op is left in?

@karpathy, great post! (big fan of cs231n btw). just wondering : if you want this to work for many actions, does it make sense to replace sigmoid by softmax & then replace line 87 : dlogs.append(softmaxloss_gradient) ?

yuzcccc commented Jun 7, 2016

great post!. I try to run this script using the given hyper-parameters, however, after 10000+ episode, the running mean is still about -16 ~ -18, and is far from converge from the visualization. Any suggestions?

etienne87 commented Jun 9, 2016

@domluna, it works with noops as well if you replace sigmoid by softmax (this needs some minor modification like modify W2 matrix with more outputs, gradient is the same but at the right output). @yuzcccc Also setting the learning-rate to 1-3 works better (like 10x better in my trial)
EDIT : test here :

pitybea commented Jun 17, 2016

great example!
I wonder what will happen if negative training examples (lost games) are sub sampled?

Thanks @karpathy, this is a generous and well thought-out example. As with your rnn gist, it captures the essentials of a very difficult and exciting field

Atcold commented Jun 21, 2016

Why are you calling the logit['W2'], h) logp? This is output is not constrained by ]-∞, 0], is it?
Otherwise, p = exp(logp), no?

dorajam commented Jul 1, 2016

It might not make a big difference, but why don't we backpropagate through the sigmoid layer? Seems like the backprop function just assumes the gradients to be the errors before the sigmoid activation. Any ideas?

Hi All,
I have a small question regarding downsampling done in prepro function.

I = I[::2,::2,0] # downsample by factor of 2

Here why only the R channel from RGB in considered. Could someone help me in understanding the idea behind considering only one channel for downsampling.


nickc92 commented Jul 11, 2016

@SelvamArul, note that a few lines later he has I[I != 0] = 1, so he's basically turning the image into a black/white binary image; he could have picked any channel, R, G, or B.

NHDaly commented Aug 5, 2016

Thanks so much for this and for your awesome article. I'm sure you hear this often, but your articles are what got me excited about machine learning, especially deep learning! Thanks!! :)

A bug in your code:
Line 61 in the policy_backward function references epx, which is defined on line 99. I think this should be passed into the function, just like eph and dlogps, not referenced globally like it is.

c: Thanks again!

Hi Is it possible to speak with you

sohero commented Aug 29, 2016

Hi, I am confused about Line 86 y = 1 if action == 2 else 0 # a "fake label". why make a fake label? and there is no y label in the formula, only p(x).

Could you give some hints?

taey16 commented Aug 30, 2016


Why don't you read the blog post in Section. "Policy Gradients"

Okay, but what do we do if we do not have the correct label in the Reinforcement Learning setting? Here is the Policy Gradients solution (again refer to diagram below). Our policy network calculated probability of going UP as 30% (logprob -1.2) and DOWN as 70% (logprob -0.36). We will now sample an action from this distribution; E.g. suppose we sample DOWN, and we will execute it in the game. At this point notice one interesting fact: We could immediately fill in a gradient of 1.0 for DOWN as we did in supervised learning, and find the gradient vector that would encourage the network to be slightly more likely to do the DOWN action in the future. So we can immediately evaluate this gradient and that’s great, but the problem is that at least for now we do not yet know if going DOWN is good. But the critical point is that that’s okay, because we can simply wait a bit and see! For example in Pong we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case). In the example below, going DOWN ended up to us losing the game (-1 reward). So if we fill in -1 for log probability of DOWN and do backprop we will find a gradient that discourages the network to take the DOWN action for that input in the future (and rightly so, since taking that action led to us losing the game).

y = 1 if action == 2 else 0 # a "fake label" # line 86
# So we can immediately evaluate this gradient by introducing fake label
dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken
epdlogp = np.vstack(dlogps) # line 101
# we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), 
# and enter that scalar as the gradient for the action we have taken
epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.) # line 111
grad = policy_backward(eph, epdlogp)

On this line:

dlogps.append(y - aprob)

Aren't y and aprob probabilities? So it is incorrect to subtract them to get a difference in log probabilities?

hqm commented Sep 25, 2016

When you do the preprocessing where you compute the difference between the frame and the previous frame, doesn't that cause
the location of the paddles to be removed, if they have not moved from one frame to the next? It seems counter-intuitive that the learning wouldn't be affected by the location of the paddles...

Very excellent scripts.
I am wondering one thing regarding to learning speed. Hopefully you can give me some suggestions.
In your scripts, random action is sampled given an environment state. Normally it takes a long time to explore. What if the actions are guided by human intelligence instead of exploring by itself. After human teaching for a while, let machine learn by itself. In this case, the learning speed can be improved quite a lot. Do you have any comments about this approach ? If it is possible then how to proceed ?

DanielTakeshi commented Oct 21, 2016

I don't know if this was written with a different API, but it's worth noting that the print statement in the penultimate line of the script isn't quite accurate. The +1 or -1 happens each time either player scores, +1 if we score, -1 if the computer scores. That is distinct from an episode which terminates after someone gets 21 points.

@NHDaly it actually still works, epx can still be a global variable and its value will be passed implicitly into the method.

finardi commented Nov 2, 2016

Great job!
I have a question.
In CS231 the RMSprop algorithm is defined by:
x += - learning_rate * dx / (np.sqrt(cache) + eps).

But here, in line 120 we have:
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)

Why has not the negative sign in the right side of this equation?

note: I try train with the minus signal and after 1,000 episodes the running_reward still -21.

@dorajam and @greydanus
Actually the code does backprop through the sigmoid layer. Please see and search for the string "as you can double check yourself by taking the derivatives". The formula there matches exactly with the line dlogps.append(y - aprob) .

Thus the variable dlogps is the gradient w.r.t. the logit.

neighborzhang commented Dec 10, 2016

@karpathy, your blog is awesome, and thanks for sharing the code. I am a newbee in deep learning. I am wondering is your code utilized GPU? Thanks

What I find quite fascinating is that, since the neural network makes no assumptions about the pixel spatial structure, the same algorithm would work equally well even if we randomly permuted the pixels on the screen. I bet that the algorithm would converge faster (to a stronger policy) If we used CNNs.

@hqm: There are 2 padddles in the game, ours and the computer's.

When we compute the difference between the current frame and the previous frame, we DO lose the the computer's paddle IF it is stationary. Our paddle however is never stationary (at every step, our action is either UP or DOWN) and we can keep track of our paddle's location (which is the one we care about)

irwenqiang commented Feb 9, 2017

I trained the neural network after days, the reward mean is about 2.
The pre-trained model file was published at

JiaqiLiu commented Feb 21, 2017


I think it wrong to put the following lines at the end.

if reward != 0: # Pong has either +1 or -1 reward exactly when game ends.
    print ('ep %d: game finished, reward: %f' % (episode_number, reward)) + ('' if reward == -1 else ' !!!!!!!!')

Instead, they should be put before 'if done:'.

Am I right, please?

JiaqiLiu commented Feb 22, 2017

Hi all,

I found that it helpful to speed up if you turn the learning_rate to 1e-3. May that help.

Hi, your codes is awesome! I am a fresh in deep learning. And I learn a lot from your codes. Thanks a lot!

Hi, I have a question.
Does I can see the match of AI vs. agent?
How to do to watch it?


schinger commented Mar 2, 2017

I write an actor-critic version:

shailesh commented Mar 6, 2017

I'm getting problem with cPickle. Help me to solve the issue please..

rogaha commented Mar 7, 2017

Thanks for sharing @karpathy! Nice work!

same question as @finardi.

4SkyNet commented Mar 20, 2017

Andrej @karpathy >> thx for your great work for the community!
But I think there are some implicit trouble in your example --> when you are computing discounted rewards.
It is good that gym returns rewards as floats or if we clip them as usual.
Buuut, if rewards represented as integers we can got some wrong behavior.

The easiest way to fix it:

def discount_rewards(r):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r, dtype=np.float32)  # or np.float64

thanks for sharing

jpeg729 commented Apr 22, 2017

I would love to see what would happen if you used a recurrent network and fed it ordinary frames rather than difference frames.
Seems like an obvious approach to me.

jjuel commented Apr 26, 2017

I am getting this error ValueError: operands could not be broadcast together with shapes (200,6400) (200,200) (200,6400)
It says it is happening on line 113.

Anyone seen this error or know why it could be happening?

Greetings fellow ML travellers,

I must commend @karpathy on his blog post and associated code. It’s certainly a feat to condense so many complex ideas so elegantly into such short texts.

I’m currently attempting to replicate the code in order to further my own understanding. However, I’ve run into a few issues and would appreciate any clarifications.

The most prominent issue appears to be (at least to me) confusion over the notion of “episode”. Pong has at least two natural notions of episode: “rounds” (the period within which exactly one player scores a point), and “games” (the period within which a player first scores 21 points). Obviously, a game consists of a variable number of rounds.

Let us say that an episode corresponds to a game. Now as I understand it, the Monte-Carlo policy gradient approach as applied here involves:

  1. sampling the policy network for an action at each decision point (every 2 to 4 frames)
  2. receiving an immediate reward in {-1,0,1} at each decision point upon acting.
  3. at the end of the game, computing the discounted return for each decision point
  4. using the vector of discounted returns (or a function thereof) to scale the policy gradients in gradient ascent at appropriate update-points
  5. looping the previous steps

In the given code, the end of an episode corresponds to the end of a game, yet returns are computed using rounds. Is this an error or am I misunderstanding what is going on?

A related issue concerns the use of “advantages” to modify the gradients in gradient ascent. I cannot find a justification in the given citation for the specific form used here, nor for full normalization. Strictly speaking, these are not advantages but simply discounted returns. I believe a more accessible justification can be found in Section 2.7 and Chapter 13 of Sutton’s book (neglecting the discount factor). So, subtracting the average of the returns as a baseline from every return of a given episode is somewhat justified as a variance reduction technique; however, I’m doubtful that dividing by the standard deviation (as one might do in preprocessing) is justified or necessary, particularly if returns are bounded in [-21,21] (if episodes are games) or [-1,1] (if episodes are rounds).

In short,

  1. If episodes are taken to be games, line 44 if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!) should be omitted, and the penultimate line should indeed be if done: (as stated by @JiaqiLiu and @DanielTakeshi).
  2. Is it really necessary in this case to fully normalize the “returns” before modifying the gradient?

Again, I would appreciate any clarification of the above. Thanks in advance!

@finardi and @kris-singh:

RMSProp is presented in CS231 in the context of gradient descent, wherein the goal is to move the parameters downward (in the negative direction of the gradient) in order to minimize a loss function.

Here, in the Monte-Carlo Policy Gradient method, we are using gradient ascent; we are trying to move the parameters upward (in the positive direction of the gradient) in order to maximize an objective function.

This is why a plus sign is used here, whereas a minus sign is used in the class notes.

Hey @karpathy,

Thanks for the code. About 10 years ago (when I was 13) I was really into rubik's cubes and I used your website to learn many algorithms.

Today (I'm 23) I'm very interested in deep learning, and I find myself again learning from your (writeup of) algorithms.

Thank you man!


hyonaldo commented Jun 21, 2017

Hi, @normferns:
I appreciate your kind reply, but I need to modify some of your answers about RMSProp & positive/negative sign.
That is not due to maximize the loss(obj func), but because of the line

dlogps.append(y - aprob)

If the line was "dlogps.append( -1 * (y - aprob) )" or "dlogps.append( aprob - y )"
, the line 120 should be

model[k] -= learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)

Because "x += - learning_rate * dx / (np.sqrt(cache) + eps)." in CS231.

"dx" means gradient of loss function.

dx =  -1 * ( t - y ) * g' * x # see
dx ~ -1 * ( t - y ) # negative sign

but, in karpathy's code,

g = (y - aprob) * g' * x
g ~ (y - aprob) # positive sign

As you said, in the Monte-Carlo Policy Gradient method, we are using gradient ascent to maximize ∑ A * logp(y∣x).
That's right!
However, this code uses ∑ 1/2 (y - aprob)^2 (i.e SSE for RMSProp) instead, like just a Supervised Learning!

@finardi and @kris-singh
What do you think about this?

Hi THis is great work I am able to run the whole thing in AWS EC2.
Can you please provide me how to run get the output of the game/? as its a remote computer.

Thanks for sharing this Andrej.

xmfbit commented Aug 7, 2017

@hyonaldo It is not SSE(maybe you mean MSE) loss. Actually, the author used y-aprob because it is the gradient of logSoftmax(x). See Computing the Analytic Gradient with Backpropagation section in for detail.

You're right, I meant MSE. But whether SSE or MSE or RMSE does not matter.
The point is that why the author used "y - aprob" is not to maximize the gain ( i.e. ∑ A * logp(y∣x) ), but to minimize the loss ( e.g. SSE or MSE or RMSE or Cross entropy, ...... or whatever )

ibmua commented Aug 17, 2017

Can anyone, please, say how many "episodes" it took them to get to 50%, or some other win rate?

rihardsbar commented Aug 30, 2017

When you discount the rewards running_add = running_add * gamma + r[t] and they all happen to be positive/negative, you will produce a mean which will be bigger than the smallest reward (consider a corner case when it is the first reward in the list). When you standardise it discounted_epr -= np.mean(discounted_epr) you are going to invert the signs the smallest negative/positive rewards, is it an issue in terms of calculating derivatives later on, namely it will produce a derivative which will contradict to the value of the reward? Would it be better to simply norm them, so that signs don't get inverted ?

@ibmua the algorithm trains very slowly with the default learning_rate of 1e-4. It takes me around 20,000 (3 nights of training) to get a score of around -10. If you change the learning rate to 1e-3 you should get an average score of 0 (i.e. competitive with AI) in around 6,000 episodes. 3e-3 works too but seems less stable once we get above 0.

Hope that helps :)


Yugnaynehc commented Nov 3, 2017

Hello everyone! I want to give an explanation about "dlogps.append(y - aprob)", which I had been very curious about. Firstly, here we want to produce a probability to take certain action. In this case, Karpathy use only one output to represent the probability to take UP action, and denote it as 'aprob' (I think it is better to use up_prob), hence the probability to take DOWN is '1-aprob'. Then, for the origin REINFORCE formula, we could think it as using log p(action) as the objected function (just temporarily leave rewards aside). Now the problem becomes "how to maximize the log p(action)", and we can calculate the gradient w.r.t. logit :

When the taken action was UP, the probability is aprob, so the gradient is 1-aprob, and when the taken action was DOWN, the probability is 1-aprob, so the gradient is 1 - (1-aprob) = -aprob. Now it is clear why the code introduce the fake label y.

If I was wrong, please correct me. Many thanks!

hello people whenever I try to run this code for pong on my windows machine this error pops up
" \", line 24, in
grad_buffer = { k : np.zeros_like(v) for k,v in model.iteritems() } # update buffers that add up gradients over a batch
AttributeError: 'dict' object has no attribute 'iteritems' "
can someone help me

pankajb64 commented Dec 10, 2017

@AbhishekAshokDubey thanks for the explanation, that helped me understand the policy_backward method.
@karpathy great post! Thank you so much!

@kartikeyasaxena1012 you're probably using Python 3, in which case you should use model.items() instead.

A silly question, looking forward to reply.

I am quite astonished to see that when I run the code, it is occupying all the cpus in my server. Since i know numpy is only using 1 cpu, which part of the code enables the parallel computing?

I see many comments about "dlogps.append(y - aprob)" and I'm also confused about that line.
Let's suppose aprob is 0.9, and award is 1 for UP and -1 for DOWN. If y is 1, then gradient epdlogp is (1-0.9) 1 = 0.1; but if y happens to be 0, then gradient epdlogp is (0-0.9) -1 = 0.9. The direction is still correct but not in same scale.
y comes from np.random.uniform() < aprob,I think no matter what random number we get, it should not affect back propagation of W1 and W2, using aprod directly maybe better.
I don't have much experience in RL, welcome any comments if I misunderstand anything :)

@taey16 Thank you for your explanation :)

@YongHuangSJTU I have the confusion same as yours

SoajanII commented Feb 6, 2018


In some papers they are using cost (where negative values wanted, and cost minimized), in this one Karpathy used rewards (where positive values wanted, rewards maximized). Sign discrepancy probably caused from that.

HadXu commented Feb 13, 2018

could anyone share pytorch verison? thanks

@YongHuangSJTU, @Kelvin-Zhong
It works like this..
The 'aprob' gives the probability of taking the UP action and thus aprob=1 tells that the UP action must be taken(which means DOWN action must not be taken). Likewise, aprob=0 tells that the UP action must not be taken(which means DOWN action must be taken). This line "action = 2 if np.random.uniform() < aprob else 3 " is to take randomly a UP action or a DOWN action.

Suppose if we get aprob=0.4 and on random, we get to sample for DOWN action, then we do gradient (0-0.4=-0.4) since this -ve 0.4 tells to move down to the lower boundary 0(which is highest probability of taking the DOWN action).

Suppose if we get aprob=0.4 and on random, we get to sample for UP action, then we do gradient(1-0.4=0.6) since this +ve 0.6 tells to move up to the upper boundary 1(which is highest probability of taking the UP action)
see the below image for understanding..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment