Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Training a Neural Network ATARI Pong agent with Policy Gradients from raw pixels
""" Trains an agent with (stochastic) Policy Gradients on Pong. Uses OpenAI Gym. """
import numpy as np
import cPickle as pickle
import gym
# hyperparameters
H = 200 # number of hidden layer neurons
batch_size = 10 # every how many episodes to do a param update?
learning_rate = 1e-4
gamma = 0.99 # discount factor for reward
decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2
resume = False # resume from previous checkpoint?
render = False
# model initialization
D = 80 * 80 # input dimensionality: 80x80 grid
if resume:
model = pickle.load(open('save.p', 'rb'))
model = {}
model['W1'] = np.random.randn(H,D) / np.sqrt(D) # "Xavier" initialization
model['W2'] = np.random.randn(H) / np.sqrt(H)
grad_buffer = { k : np.zeros_like(v) for k,v in model.iteritems() } # update buffers that add up gradients over a batch
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.iteritems() } # rmsprop memory
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x)) # sigmoid "squashing" function to interval [0,1]
def prepro(I):
""" prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
I = I[35:195] # crop
I = I[::2,::2,0] # downsample by factor of 2
I[I == 144] = 0 # erase background (background type 1)
I[I == 109] = 0 # erase background (background type 2)
I[I != 0] = 1 # everything else (paddles, ball) just set to 1
return I.astype(np.float).ravel()
def discount_rewards(r):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(xrange(0, r.size)):
if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
def policy_forward(x):
h =['W1'], x)
h[h<0] = 0 # ReLU nonlinearity
logp =['W2'], h)
p = sigmoid(logp)
return p, h # return probability of taking action 2, and hidden state
def policy_backward(eph, epdlogp):
""" backward pass. (eph is array of intermediate hidden states) """
dW2 =, epdlogp).ravel()
dh = np.outer(epdlogp, model['W2'])
dh[eph <= 0] = 0 # backpro prelu
dW1 =, epx)
return {'W1':dW1, 'W2':dW2}
env = gym.make("Pong-v0")
observation = env.reset()
prev_x = None # used in computing the difference frame
xs,hs,dlogps,drs = [],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 0
while True:
if render: env.render()
# preprocess the observation, set input to network to be difference image
cur_x = prepro(observation)
x = cur_x - prev_x if prev_x is not None else np.zeros(D)
prev_x = cur_x
# forward the policy network and sample an action from the returned probability
aprob, h = policy_forward(x)
action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
# record various intermediates (needed later for backprop)
xs.append(x) # observation
hs.append(h) # hidden state
y = 1 if action == 2 else 0 # a "fake label"
dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see if confused)
# step the environment and get new measurements
observation, reward, done, info = env.step(action)
reward_sum += reward
drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)
if done: # an episode finished
episode_number += 1
# stack together all inputs, hidden states, action gradients, and rewards for this episode
epx = np.vstack(xs)
eph = np.vstack(hs)
epdlogp = np.vstack(dlogps)
epr = np.vstack(drs)
xs,hs,dlogps,drs = [],[],[],[] # reset array memory
# compute the discounted reward backwards through time
discounted_epr = discount_rewards(epr)
# standardize the rewards to be unit normal (helps control the gradient estimator variance)
discounted_epr -= np.mean(discounted_epr)
discounted_epr /= np.std(discounted_epr)
epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.)
grad = policy_backward(eph, epdlogp)
for k in model: grad_buffer[k] += grad[k] # accumulate grad over batch
# perform rmsprop parameter update every batch_size episodes
if episode_number % batch_size == 0:
for k,v in model.iteritems():
g = grad_buffer[k] # gradient
rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g**2
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
grad_buffer[k] = np.zeros_like(v) # reset batch gradient buffer
# boring book-keeping
running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
print 'resetting env. episode reward total was %f. running mean: %f' % (reward_sum, running_reward)
if episode_number % 100 == 0: pickle.dump(model, open('save.p', 'wb'))
reward_sum = 0
observation = env.reset() # reset env
prev_x = None
if reward != 0: # Pong has either +1 or -1 reward exactly when game ends.
print ('ep %d: game finished, reward: %f' % (episode_number, reward)) + ('' if reward == -1 else ' !!!!!!!!')
liangfu commented Jun 1, 2016

Great work, thanks @karpathy for your simple gist code that just works !~

Awesome! Thanks @karpathy for your generosity again!

lakehanne commented Jun 1, 2016 edited

Thanks! I am a student in control theory and much to my chagrin, supervised learning for approximating dynamical systems, particularly in robot control is royally disappointing. I am just combing through Sutton's book myself (in chapter 3 now). This is priceless!

Really nice work.

domluna commented Jun 3, 2016

@karpathy I'm curious why you left out the no-op action. It explains in the video why the agent looks like it just drank 100 coffee cups :)

Does it make it harder to learn if no-op is left in?

@karpathy, great post! (big fan of cs231n btw). just wondering : if you want this to work for many actions, does it make sense to replace sigmoid by softmax & then replace line 87 : dlogs.append(softmaxloss_gradient) ?

yuzcccc commented Jun 7, 2016

great post!. I try to run this script using the given hyper-parameters, however, after 10000+ episode, the running mean is still about -16 ~ -18, and is far from converge from the visualization. Any suggestions?

etienne87 commented Jun 9, 2016 edited

@domluna, it works with noops as well if you replace sigmoid by softmax (this needs some minor modification like modify W2 matrix with more outputs, gradient is the same but at the right output). @yuzcccc Also setting the learning-rate to 1-3 works better (like 10x better in my trial)
EDIT : test here :

pitybea commented Jun 17, 2016

great example!
I wonder what will happen if negative training examples (lost games) are sub sampled?

Thanks @karpathy, this is a generous and well thought-out example. As with your rnn gist, it captures the essentials of a very difficult and exciting field

Atcold commented Jun 21, 2016 edited

Why are you calling the logit['W2'], h) logp? This is output is not constrained by ]-∞, 0], is it?
Otherwise, p = exp(logp), no?

dorajam commented Jul 1, 2016

It might not make a big difference, but why don't we backpropagate through the sigmoid layer? Seems like the backprop function just assumes the gradients to be the errors before the sigmoid activation. Any ideas?

Hi All,
I have a small question regarding downsampling done in prepro function.

I = I[::2,::2,0] # downsample by factor of 2

Here why only the R channel from RGB in considered. Could someone help me in understanding the idea behind considering only one channel for downsampling.


nickc92 commented Jul 11, 2016

@SelvamArul, note that a few lines later he has I[I != 0] = 1, so he's basically turning the image into a black/white binary image; he could have picked any channel, R, G, or B.

I'm using your gist to get started with Policy Gradients and I have a question about the policy_backward function: why aren't you backprop-ing though the sigmoid function? I assume you leave the \sigma*(1-\sigma) out as an approximation, but I want to make sure! Thanks again

NHDaly commented Aug 5, 2016

Thanks so much for this and for your awesome article. I'm sure you hear this often, but your articles are what got me excited about machine learning, especially deep learning! Thanks!! :)

A bug in your code:
Line 61 in the policy_backward function references epx, which is defined on line 99. I think this should be passed into the function, just like eph and dlogps, not referenced globally like it is.

c: Thanks again!

Hi Is it possible to speak with you

sohero commented Aug 29, 2016

Hi, I am confused about Line 86 y = 1 if action == 2 else 0 # a "fake label". why make a fake label? and there is no y label in the formula, only p(x).

Could you give some hints?

taey16 commented Aug 30, 2016


Why don't you read the blog post in Section. "Policy Gradients"

Okay, but what do we do if we do not have the correct label in the Reinforcement Learning setting? Here is the Policy Gradients solution (again refer to diagram below). Our policy network calculated probability of going UP as 30% (logprob -1.2) and DOWN as 70% (logprob -0.36). We will now sample an action from this distribution; E.g. suppose we sample DOWN, and we will execute it in the game. At this point notice one interesting fact: We could immediately fill in a gradient of 1.0 for DOWN as we did in supervised learning, and find the gradient vector that would encourage the network to be slightly more likely to do the DOWN action in the future. So we can immediately evaluate this gradient and that’s great, but the problem is that at least for now we do not yet know if going DOWN is good. But the critical point is that that’s okay, because we can simply wait a bit and see! For example in Pong we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case). In the example below, going DOWN ended up to us losing the game (-1 reward). So if we fill in -1 for log probability of DOWN and do backprop we will find a gradient that discourages the network to take the DOWN action for that input in the future (and rightly so, since taking that action led to us losing the game).

y = 1 if action == 2 else 0 # a "fake label" # line 86
# So we can immediately evaluate this gradient by introducing fake label
dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken
epdlogp = np.vstack(dlogps) # line 101
# we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), 
# and enter that scalar as the gradient for the action we have taken
epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.) # line 111
grad = policy_backward(eph, epdlogp)

On this line:

dlogps.append(y - aprob)

Aren't y and aprob probabilities? So it is incorrect to subtract them to get a difference in log probabilities?

hqm commented Sep 25, 2016

When you do the preprocessing where you compute the difference between the frame and the previous frame, doesn't that cause
the location of the paddles to be removed, if they have not moved from one frame to the next? It seems counter-intuitive that the learning wouldn't be affected by the location of the paddles...

Very excellent scripts.
I am wondering one thing regarding to learning speed. Hopefully you can give me some suggestions.
In your scripts, random action is sampled given an environment state. Normally it takes a long time to explore. What if the actions are guided by human intelligence instead of exploring by itself. After human teaching for a while, let machine learn by itself. In this case, the learning speed can be improved quite a lot. Do you have any comments about this approach ? If it is possible then how to proceed ?

DanielTakeshi commented Oct 21, 2016 edited

I don't know if this was written with a different API, but it's worth noting that the print statement in the penultimate line of the script isn't quite accurate. The +1 or -1 happens each time either player scores, +1 if we score, -1 if the computer scores. That is distinct from an episode which terminates after someone gets 21 points.

@NHDaly it actually still works, epx can still be a global variable and its value will be passed implicitly into the method.

finardi commented Nov 2, 2016 edited

Great job!
I have a question.
In CS231 the RMSprop algorithm is defined by:
x += - learning_rate * dx / (np.sqrt(cache) + eps).

But here, in line 120 we have:
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)

Why has not the negative sign in the right side of this equation?

note: I try train with the minus signal and after 1,000 episodes the running_reward still -21.

@dorajam and @greydanus
Actually the code does backprop through the sigmoid layer. Please see and search for the string "as you can double check yourself by taking the derivatives". The formula there matches exactly with the line dlogps.append(y - aprob) .

Thus the variable dlogps is the gradient w.r.t. the logit.

neighborzhang commented Dec 10, 2016 edited

@karpathy, your blog is awesome, and thanks for sharing the code. I am a newbee in deep learning. I am wondering is your code utilized GPU? Thanks

What I find quite fascinating is that, since the neural network makes no assumptions about the pixel spatial structure, the same algorithm would work equally well even if we randomly permuted the pixels on the screen. I bet that the algorithm would converge faster (to a stronger policy) If we used CNNs.

@hqm: There are 2 padddles in the game, ours and the computer's.

When we compute the difference between the current frame and the previous frame, we DO lose the the computer's paddle IF it is stationary. Our paddle however is never stationary (at every step, our action is either UP or DOWN) and we can keep track of our paddle's location (which is the one we care about)

irwenqiang commented Feb 9, 2017 edited

I trained the neural network after days, the reward mean is about 2.
The pre-trained model file was published at

JiaqiLiu commented Feb 21, 2017 edited


I think it wrong to put the following lines at the end.

if reward != 0: # Pong has either +1 or -1 reward exactly when game ends.
    print ('ep %d: game finished, reward: %f' % (episode_number, reward)) + ('' if reward == -1 else ' !!!!!!!!')

Instead, they should be put before 'if done:'.

Am I right, please?

JiaqiLiu commented Feb 22, 2017 edited

Hi all,

I found that it helpful to speed up if you turn the learning_rate to 1e-3. May that help.

Hi, your codes is awesome! I am a fresh in deep learning. And I learn a lot from your codes. Thanks a lot!

Hi, I have a question.
Does I can see the match of AI vs. agent?
How to do to watch it?


schinger commented Mar 2, 2017

I write an actor-critic version:

I'm getting problem with cPickle. Help me to solve the issue please..

rogaha commented Mar 7, 2017

Thanks for sharing @karpathy! Nice work!

same question as @finardi.

4SkyNet commented Mar 20, 2017 edited

Andrej @karpathy >> thx for your great work for the community!
But I think there are some implicit trouble in your example --> when you are computing discounted rewards.
It is good that gym returns rewards as floats or if we clip them as usual.
Buuut, if rewards represented as integers we can got some wrong behavior.

The easiest way to fix it:

def discount_rewards(r):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r, dtype=np.float32)  # or np.float64

thanks for sharing

jpeg729 commented Apr 22, 2017

I would love to see what would happen if you used a recurrent network and fed it ordinary frames rather than difference frames.
Seems like an obvious approach to me.

jjuel commented Apr 26, 2017

I am getting this error ValueError: operands could not be broadcast together with shapes (200,6400) (200,200) (200,6400)
It says it is happening on line 113.

Anyone seen this error or know why it could be happening?

Greetings fellow ML travellers,

I must commend @karpathy on his blog post and associated code. It’s certainly a feat to condense so many complex ideas so elegantly into such short texts.

I’m currently attempting to replicate the code in order to further my own understanding. However, I’ve run into a few issues and would appreciate any clarifications.

The most prominent issue appears to be (at least to me) confusion over the notion of “episode”. Pong has at least two natural notions of episode: “rounds” (the period within which exactly one player scores a point), and “games” (the period within which a player first scores 21 points). Obviously, a game consists of a variable number of rounds.

Let us say that an episode corresponds to a game. Now as I understand it, the Monte-Carlo policy gradient approach as applied here involves:

  1. sampling the policy network for an action at each decision point (every 2 to 4 frames)
  2. receiving an immediate reward in {-1,0,1} at each decision point upon acting.
  3. at the end of the game, computing the discounted return for each decision point
  4. using the vector of discounted returns (or a function thereof) to scale the policy gradients in gradient ascent at appropriate update-points
  5. looping the previous steps

In the given code, the end of an episode corresponds to the end of a game, yet returns are computed using rounds. Is this an error or am I misunderstanding what is going on?

A related issue concerns the use of “advantages” to modify the gradients in gradient ascent. I cannot find a justification in the given citation for the specific form used here, nor for full normalization. Strictly speaking, these are not advantages but simply discounted returns. I believe a more accessible justification can be found in Section 2.7 and Chapter 13 of Sutton’s book (neglecting the discount factor). So, subtracting the average of the returns as a baseline from every return of a given episode is somewhat justified as a variance reduction technique; however, I’m doubtful that dividing by the standard deviation (as one might do in preprocessing) is justified or necessary, particularly if returns are bounded in [-21,21] (if episodes are games) or [-1,1] (if episodes are rounds).

In short,

  1. If episodes are taken to be games, line 44 if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!) should be omitted, and the penultimate line should indeed be if done: (as stated by @JiaqiLiu and @DanielTakeshi).
  2. Is it really necessary in this case to fully normalize the “returns” before modifying the gradient?

Again, I would appreciate any clarification of the above. Thanks in advance!

@finardi and @kris-singh:

RMSProp is presented in CS231 in the context of gradient descent, wherein the goal is to move the parameters downward (in the negative direction of the gradient) in order to minimize a loss function.

Here, in the Monte-Carlo Policy Gradient method, we are using gradient ascent; we are trying to move the parameters upward (in the positive direction of the gradient) in order to maximize an objective function.

This is why a plus sign is used here, whereas a minus sign is used in the class notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment