Instantly share code, notes, and snippets.

# karpathy/pg-pong.pyCreated May 30, 2016

Training a Neural Network ATARI Pong agent with Policy Gradients from raw pixels
commented Jun 1, 2016
 Great work, thanks @karpathy for your simple gist code that just works !~
 Awesome! Thanks @karpathy for your generosity again!
commented Jun 1, 2016 edited
 Thanks! I am a student in control theory and much to my chagrin, supervised learning for approximating dynamical systems, particularly in robot control is royally disappointing. I am just combing through Sutton's book myself (in chapter 3 now). This is priceless!
 Really nice work.
commented Jun 3, 2016
 @karpathy I'm curious why you left out the no-op action. It explains in the video why the agent looks like it just drank 100 coffee cups :) Does it make it harder to learn if no-op is left in?
 @karpathy, great post! (big fan of cs231n btw). just wondering : if you want this to work for many actions, does it make sense to replace sigmoid by softmax & then replace line 87 : dlogs.append(softmaxloss_gradient) ?
commented Jun 7, 2016
 great post!. I try to run this script using the given hyper-parameters, however, after 10000+ episode, the running mean is still about -16 ~ -18, and is far from converge from the visualization. Any suggestions?
commented Jun 9, 2016 edited
 @domluna, it works with noops as well if you replace sigmoid by softmax (this needs some minor modification like modify W2 matrix with more outputs, gradient is the same but at the right output). @yuzcccc Also setting the learning-rate to 1-3 works better (like 10x better in my trial) EDIT : test here : https://gist.github.com/etienne87/6803a65653975114e6c6f08bb25e1522
commented Jun 17, 2016
 great example! I wonder what will happen if negative training examples (lost games) are sub sampled?
 Thanks @karpathy, this is a generous and well thought-out example. As with your rnn gist, it captures the essentials of a very difficult and exciting field
commented Jun 21, 2016 edited
 Why are you calling the logit np.dot(model['W2'], h) logp? This is output is not constrained by ]-∞, 0], is it? Otherwise, p = exp(logp), no?
commented Jul 1, 2016
 It might not make a big difference, but why don't we backpropagate through the sigmoid layer? Seems like the backprop function just assumes the gradients to be the errors before the sigmoid activation. Any ideas?
 Hi All, I have a small question regarding downsampling done in prepro function. I = I[::2,::2,0] # downsample by factor of 2 Here why only the R channel from RGB in considered. Could someone help me in understanding the idea behind considering only one channel for downsampling. Thanks
commented Jul 11, 2016
 @SelvamArul, note that a few lines later he has I[I != 0] = 1, so he's basically turning the image into a black/white binary image; he could have picked any channel, R, G, or B.
 I'm using your gist to get started with Policy Gradients and I have a question about the policy_backward function: why aren't you backprop-ing though the sigmoid function? I assume you leave the \sigma*(1-\sigma) out as an approximation, but I want to make sure! Thanks again
commented Aug 5, 2016
 Thanks so much for this and for your awesome article. I'm sure you hear this often, but your articles are what got me excited about machine learning, especially deep learning! Thanks!! :) A bug in your code: Line 61 in the policy_backward function references epx, which is defined on line 99. I think this should be passed into the function, just like eph and dlogps, not referenced globally like it is. c: Thanks again!
 Hi Is it possible to speak with you
commented Aug 29, 2016
 Hi, I am confused about Line 86 y = 1 if action == 2 else 0 # a "fake label". why make a fake label? and there is no y label in the formula, only p(x). Could you give some hints?
commented Aug 30, 2016
 @sohero Why don't you read the blog post in Section. "Policy Gradients" Okay, but what do we do if we do not have the correct label in the Reinforcement Learning setting? Here is the Policy Gradients solution (again refer to diagram below). Our policy network calculated probability of going UP as 30% (logprob -1.2) and DOWN as 70% (logprob -0.36). We will now sample an action from this distribution; E.g. suppose we sample DOWN, and we will execute it in the game. At this point notice one interesting fact: We could immediately fill in a gradient of 1.0 for DOWN as we did in supervised learning, and find the gradient vector that would encourage the network to be slightly more likely to do the DOWN action in the future. So we can immediately evaluate this gradient and that’s great, but the problem is that at least for now we do not yet know if going DOWN is good. But the critical point is that that’s okay, because we can simply wait a bit and see! For example in Pong we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case). In the example below, going DOWN ended up to us losing the game (-1 reward). So if we fill in -1 for log probability of DOWN and do backprop we will find a gradient that discourages the network to take the DOWN action for that input in the future (and rightly so, since taking that action led to us losing the game). y = 1 if action == 2 else 0 # a "fake label" # line 86 # So we can immediately evaluate this gradient by introducing fake label dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken ... ... epdlogp = np.vstack(dlogps) # line 101 ... # we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), # and enter that scalar as the gradient for the action we have taken epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.) # line 111 grad = policy_backward(eph, epdlogp) ... 
 On this line: dlogps.append(y - aprob) Aren't y and aprob probabilities? So it is incorrect to subtract them to get a difference in log probabilities?
commented Sep 25, 2016
 When you do the preprocessing where you compute the difference between the frame and the previous frame, doesn't that cause the location of the paddles to be removed, if they have not moved from one frame to the next? It seems counter-intuitive that the learning wouldn't be affected by the location of the paddles...
 Very excellent scripts. I am wondering one thing regarding to learning speed. Hopefully you can give me some suggestions. In your scripts, random action is sampled given an environment state. Normally it takes a long time to explore. What if the actions are guided by human intelligence instead of exploring by itself. After human teaching for a while, let machine learn by itself. In this case, the learning speed can be improved quite a lot. Do you have any comments about this approach ? If it is possible then how to proceed ?
commented Oct 21, 2016 edited
 I don't know if this was written with a different API, but it's worth noting that the print statement in the penultimate line of the script isn't quite accurate. The +1 or -1 happens each time either player scores, +1 if we score, -1 if the computer scores. That is distinct from an episode which terminates after someone gets 21 points. @NHDaly it actually still works, epx can still be a global variable and its value will be passed implicitly into the method.
commented Nov 2, 2016 edited
 Great job! I have a question. In CS231 the RMSprop algorithm is defined by: x += - learning_rate * dx / (np.sqrt(cache) + eps). But here, in line 120 we have: model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5) Why has not the negative sign in the right side of this equation? note: I try train with the minus signal and after 1,000 episodes the running_reward still -21.
 @dorajam and @greydanus Actually the code does backprop through the sigmoid layer. Please see http://cs231n.github.io/neural-networks-2/#losses and search for the string "as you can double check yourself by taking the derivatives". The formula there matches exactly with the line dlogps.append(y - aprob) . Thus the variable dlogps is the gradient w.r.t. the logit.
commented Dec 10, 2016 edited
 @karpathy, your blog is awesome, and thanks for sharing the code. I am a newbee in deep learning. I am wondering is your code utilized GPU? Thanks
 What I find quite fascinating is that, since the neural network makes no assumptions about the pixel spatial structure, the same algorithm would work equally well even if we randomly permuted the pixels on the screen. I bet that the algorithm would converge faster (to a stronger policy) If we used CNNs.
 @hqm: There are 2 padddles in the game, ours and the computer's. When we compute the difference between the current frame and the previous frame, we DO lose the the computer's paddle IF it is stationary. Our paddle however is never stationary (at every step, our action is either UP or DOWN) and we can keep track of our paddle's location (which is the one we care about)
commented Feb 9, 2017 edited
 I trained the neural network after days, the reward mean is about 2. The pre-trained model file was published at https://pan.baidu.com/s/1mh8JkiG
commented Feb 21, 2017 edited
 Hi, I think it wrong to put the following lines at the end. if reward != 0: # Pong has either +1 or -1 reward exactly when game ends. print ('ep %d: game finished, reward: %f' % (episode_number, reward)) + ('' if reward == -1 else ' !!!!!!!!')  Instead, they should be put before 'if done:'. Am I right, please?
commented Feb 22, 2017 edited
 Hi all, I found that it helpful to speed up if you turn the learning_rate to 1e-3. May that help.
 Hi, your codes is awesome! I am a fresh in deep learning. And I learn a lot from your codes. Thanks a lot!
 Hi, I have a question. Does I can see the match of AI vs. agent? How to do to watch it? thanks~
to join this conversation on GitHub. Already have an account? Sign in to comment