Instantly share code, notes, and snippets.

# karpathy/pg-pong.py Created May 30, 2016

Training a Neural Network ATARI Pong agent with Policy Gradients from raw pixels

### liangfu commented Jun 1, 2016

 Great work, thanks @karpathy for your simple gist code that just works !~

### micoolcho commented Jun 1, 2016

 Awesome! Thanks @karpathy for your generosity again!

### lakehanne commented Jun 1, 2016 • edited

 Thanks! I am a student in control theory and much to my chagrin, supervised learning for approximating dynamical systems, particularly in robot control is royally disappointing. I am just combing through Sutton's book myself (in chapter 3 now). This is priceless!

### jwjohnson314 commented Jun 2, 2016

 Really nice work.

### domluna commented Jun 3, 2016

 @karpathy I'm curious why you left out the no-op action. It explains in the video why the agent looks like it just drank 100 coffee cups :) Does it make it harder to learn if no-op is left in?

### etienne87 commented Jun 3, 2016

 @karpathy, great post! (big fan of cs231n btw). just wondering : if you want this to work for many actions, does it make sense to replace sigmoid by softmax & then replace line 87 : dlogs.append(softmaxloss_gradient) ?

### yuzcccc commented Jun 7, 2016

 great post!. I try to run this script using the given hyper-parameters, however, after 10000+ episode, the running mean is still about -16 ~ -18, and is far from converge from the visualization. Any suggestions?

### etienne87 commented Jun 9, 2016 • edited

 @domluna, it works with noops as well if you replace sigmoid by softmax (this needs some minor modification like modify W2 matrix with more outputs, gradient is the same but at the right output). @yuzcccc Also setting the learning-rate to 1-3 works better (like 10x better in my trial) EDIT : test here : https://gist.github.com/etienne87/6803a65653975114e6c6f08bb25e1522

### pitybea commented Jun 17, 2016

 great example! I wonder what will happen if negative training examples (lost games) are sub sampled?

### greydanus commented Jun 21, 2016

 Thanks @karpathy, this is a generous and well thought-out example. As with your rnn gist, it captures the essentials of a very difficult and exciting field

### Atcold commented Jun 21, 2016 • edited

 Why are you calling the logit np.dot(model['W2'], h) logp? This is output is not constrained by ]-∞, 0], is it? Otherwise, p = exp(logp), no?

### dorajam commented Jul 1, 2016

 It might not make a big difference, but why don't we backpropagate through the sigmoid layer? Seems like the backprop function just assumes the gradients to be the errors before the sigmoid activation. Any ideas?

### SelvamArul commented Jul 7, 2016

 Hi All, I have a small question regarding downsampling done in prepro function. I = I[::2,::2,0] # downsample by factor of 2 Here why only the R channel from RGB in considered. Could someone help me in understanding the idea behind considering only one channel for downsampling. Thanks

### nickc92 commented Jul 11, 2016

 @SelvamArul, note that a few lines later he has I[I != 0] = 1, so he's basically turning the image into a black/white binary image; he could have picked any channel, R, G, or B.

### NHDaly commented Aug 5, 2016

 Thanks so much for this and for your awesome article. I'm sure you hear this often, but your articles are what got me excited about machine learning, especially deep learning! Thanks!! :) A bug in your code: Line 61 in the policy_backward function references epx, which is defined on line 99. I think this should be passed into the function, just like eph and dlogps, not referenced globally like it is. c: Thanks again!

### sohero commented Aug 29, 2016

 Hi, I am confused about Line 86 y = 1 if action == 2 else 0 # a "fake label". why make a fake label? and there is no y label in the formula, only p(x). Could you give some hints?

### taey16 commented Aug 30, 2016

 @sohero Why don't you read the blog post in Section. "Policy Gradients" Okay, but what do we do if we do not have the correct label in the Reinforcement Learning setting? Here is the Policy Gradients solution (again refer to diagram below). Our policy network calculated probability of going UP as 30% (logprob -1.2) and DOWN as 70% (logprob -0.36). We will now sample an action from this distribution; E.g. suppose we sample DOWN, and we will execute it in the game. At this point notice one interesting fact: We could immediately fill in a gradient of 1.0 for DOWN as we did in supervised learning, and find the gradient vector that would encourage the network to be slightly more likely to do the DOWN action in the future. So we can immediately evaluate this gradient and that’s great, but the problem is that at least for now we do not yet know if going DOWN is good. But the critical point is that that’s okay, because we can simply wait a bit and see! For example in Pong we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case). In the example below, going DOWN ended up to us losing the game (-1 reward). So if we fill in -1 for log probability of DOWN and do backprop we will find a gradient that discourages the network to take the DOWN action for that input in the future (and rightly so, since taking that action led to us losing the game). y = 1 if action == 2 else 0 # a "fake label" # line 86 # So we can immediately evaluate this gradient by introducing fake label dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken ... ... epdlogp = np.vstack(dlogps) # line 101 ... # we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), # and enter that scalar as the gradient for the action we have taken epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.) # line 111 grad = policy_backward(eph, epdlogp) ... 

### MichaelBurge commented Sep 2, 2016

 On this line: dlogps.append(y - aprob) Aren't y and aprob probabilities? So it is incorrect to subtract them to get a difference in log probabilities?

### hqm commented Sep 25, 2016

 When you do the preprocessing where you compute the difference between the frame and the previous frame, doesn't that cause the location of the paddles to be removed, if they have not moved from one frame to the next? It seems counter-intuitive that the learning wouldn't be affected by the location of the paddles...

### farscape2012 commented Oct 13, 2016

 Very excellent scripts. I am wondering one thing regarding to learning speed. Hopefully you can give me some suggestions. In your scripts, random action is sampled given an environment state. Normally it takes a long time to explore. What if the actions are guided by human intelligence instead of exploring by itself. After human teaching for a while, let machine learn by itself. In this case, the learning speed can be improved quite a lot. Do you have any comments about this approach ? If it is possible then how to proceed ?

### DanielTakeshi commented Oct 21, 2016 • edited

 I don't know if this was written with a different API, but it's worth noting that the print statement in the penultimate line of the script isn't quite accurate. The +1 or -1 happens each time either player scores, +1 if we score, -1 if the computer scores. That is distinct from an episode which terminates after someone gets 21 points. @NHDaly it actually still works, epx can still be a global variable and its value will be passed implicitly into the method.

### finardi commented Nov 2, 2016 • edited

 Great job! I have a question. In CS231 the RMSprop algorithm is defined by: x += - learning_rate * dx / (np.sqrt(cache) + eps). But here, in line 120 we have: model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5) Why has not the negative sign in the right side of this equation? note: I try train with the minus signal and after 1,000 episodes the running_reward still -21.

### zerolocker commented Nov 16, 2016

 @dorajam and @greydanus Actually the code does backprop through the sigmoid layer. Please see http://cs231n.github.io/neural-networks-2/#losses and search for the string "as you can double check yourself by taking the derivatives". The formula there matches exactly with the line dlogps.append(y - aprob) . Thus the variable dlogps is the gradient w.r.t. the logit.

### neighborzhang commented Dec 10, 2016 • edited

 @karpathy, your blog is awesome, and thanks for sharing the code. I am a newbee in deep learning. I am wondering is your code utilized GPU? Thanks

### petercerno commented Jan 1, 2017

 What I find quite fascinating is that, since the neural network makes no assumptions about the pixel spatial structure, the same algorithm would work equally well even if we randomly permuted the pixels on the screen. I bet that the algorithm would converge faster (to a stronger policy) If we used CNNs.

### gokayhuz commented Jan 14, 2017

 @hqm: There are 2 padddles in the game, ours and the computer's. When we compute the difference between the current frame and the previous frame, we DO lose the the computer's paddle IF it is stationary. Our paddle however is never stationary (at every step, our action is either UP or DOWN) and we can keep track of our paddle's location (which is the one we care about)

### irwenqiang commented Feb 9, 2017 • edited

 I trained the neural network after days, the reward mean is about 2. The pre-trained model file was published at https://pan.baidu.com/s/1mh8JkiG

### JiaqiLiu commented Feb 21, 2017 • edited

 Hi, I think it wrong to put the following lines at the end. if reward != 0: # Pong has either +1 or -1 reward exactly when game ends. print ('ep %d: game finished, reward: %f' % (episode_number, reward)) + ('' if reward == -1 else ' !!!!!!!!')  Instead, they should be put before 'if done: '. Am I right, please?

### JiaqiLiu commented Feb 22, 2017 • edited

 Hi all, I found that it helpful to speed up if you turn the learning_rate to 1e-3. May that help.

### trillionpowers commented Feb 22, 2017

 Hi, your codes is awesome! I am a fresh in deep learning. And I learn a lot from your codes. Thanks a lot!

### trillionpowers commented Feb 22, 2017

 Hi, I have a question. Does I can see the match of AI vs. agent? How to do to watch it? thanks~~

### schinger commented Mar 2, 2017

 I write an actor-critic version: https://github.com/schinger/pong_actor-critic

### shailesh commented Mar 6, 2017

 I'm getting problem with cPickle. Help me to solve the issue please..

### rogaha commented Mar 7, 2017

 Thanks for sharing @karpathy! Nice work!

### kris-singh commented Mar 10, 2017

 same question as @finardi.

### 4SkyNet commented Mar 20, 2017 • edited

 Andrej @karpathy >> thx for your great work for the community! But I think there are some implicit trouble in your example --> when you are computing discounted rewards. It is good that gym returns rewards as floats or if we clip them as usual. Buuut, if rewards represented as integers we can got some wrong behavior. The easiest way to fix it: def discount_rewards(r): """ take 1D float array of rewards and compute discounted reward """ discounted_r = np.zeros_like(r, dtype=np.float32) # or np.float64

### chandankuiry commented Apr 7, 2017

 thanks for sharing

### jpeg729 commented Apr 22, 2017

 I would love to see what would happen if you used a recurrent network and fed it ordinary frames rather than difference frames. Seems like an obvious approach to me.

### jjuel commented Apr 26, 2017

 I am getting this error ValueError: operands could not be broadcast together with shapes (200,6400) (200,200) (200,6400) It says it is happening on line 113. Anyone seen this error or know why it could be happening?

### normferns commented Apr 27, 2017

 Greetings fellow ML travellers, I must commend @karpathy on his blog post and associated code. It’s certainly a feat to condense so many complex ideas so elegantly into such short texts. I’m currently attempting to replicate the code in order to further my own understanding. However, I’ve run into a few issues and would appreciate any clarifications. The most prominent issue appears to be (at least to me) confusion over the notion of “episode”. Pong has at least two natural notions of episode: “rounds” (the period within which exactly one player scores a point), and “games” (the period within which a player first scores 21 points). Obviously, a game consists of a variable number of rounds. Let us say that an episode corresponds to a game. Now as I understand it, the Monte-Carlo policy gradient approach as applied here involves: sampling the policy network for an action at each decision point (every 2 to 4 frames) receiving an immediate reward in {-1,0,1} at each decision point upon acting. at the end of the game, computing the discounted return for each decision point using the vector of discounted returns (or a function thereof) to scale the policy gradients in gradient ascent at appropriate update-points looping the previous steps In the given code, the end of an episode corresponds to the end of a game, yet returns are computed using rounds. Is this an error or am I misunderstanding what is going on? A related issue concerns the use of “advantages” to modify the gradients in gradient ascent. I cannot find a justification in the given citation for the specific form used here, nor for full normalization. Strictly speaking, these are not advantages but simply discounted returns. I believe a more accessible justification can be found in Section 2.7 and Chapter 13 of Sutton’s book (neglecting the discount factor). So, subtracting the average of the returns as a baseline from every return of a given episode is somewhat justified as a variance reduction technique; however, I’m doubtful that dividing by the standard deviation (as one might do in preprocessing) is justified or necessary, particularly if returns are bounded in [-21,21] (if episodes are games) or [-1,1] (if episodes are rounds). In short, If episodes are taken to be games, line 44 if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!) should be omitted, and the penultimate line should indeed be if done: (as stated by @JiaqiLiu and @DanielTakeshi). Is it really necessary in this case to fully normalize the “returns” before modifying the gradient? Again, I would appreciate any clarification of the above. Thanks in advance!

### normferns commented Apr 27, 2017

 RMSProp is presented in CS231 in the context of gradient descent, wherein the goal is to move the parameters downward (in the negative direction of the gradient) in order to minimize a loss function. Here, in the Monte-Carlo Policy Gradient method, we are using gradient ascent; we are trying to move the parameters upward (in the positive direction of the gradient) in order to maximize an objective function. This is why a plus sign is used here, whereas a minus sign is used in the class notes.

### dylanrandle commented May 26, 2017

 Hey @karpathy, Thanks for the code. About 10 years ago (when I was 13) I was really into rubik's cubes and I used your website to learn many algorithms. Today (I'm 23) I'm very interested in deep learning, and I find myself again learning from your (writeup of) algorithms. Thank you man! Cheers Dylan

### hyonaldo commented Jun 21, 2017 • edited

 Hi, @normferns: I appreciate your kind reply, but I need to modify some of your answers about RMSProp & positive/negative sign. That is not due to maximize the loss(obj func), but because of the line dlogps.append(y - aprob)  If the line was "dlogps.append( -1 * (y - aprob) )" or "dlogps.append( aprob - y )" , the line 120 should be model[k] -= learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)  Because "x += - learning_rate * dx / (np.sqrt(cache) + eps)." in CS231. In http://cs231n.github.io/neural-networks-3/#update, "dx" means gradient of loss function. so, dx = -1 * ( t - y ) * g' * x # see https://en.wikipedia.org/wiki/Delta_rule dx ~ -1 * ( t - y ) # negative sign  but, in karpathy's code, g = (y - aprob) * g' * x g ~ (y - aprob) # positive sign  As you said, in the Monte-Carlo Policy Gradient method, we are using gradient ascent to maximize ∑ A * logp(y∣x). That's right! However, this code uses ∑ 1/2 (y - aprob)^2 (i.e SSE for RMSProp) instead, like just a Supervised Learning! @finardi and @kris-singh What do you think about this?

### Kryptonbond commented Jul 20, 2017

 Hi THis is great work I am able to run the whole thing in AWS EC2. Can you please provide me how to run get the output of the game/? as its a remote computer.

### rrivera1849 commented Jul 28, 2017

 Thanks for sharing this Andrej.

### xmfbit commented Aug 7, 2017 • edited

 @hyonaldo It is not SSE(maybe you mean MSE) loss. Actually, the author used y-aprob because it is the gradient of logSoftmax(x). See Computing the Analytic Gradient with Backpropagation section in http://cs231n.github.io/neural-networks-case-study/ for detail.

### hyonaldo commented Aug 12, 2017

 @xmfbit You're right, I meant MSE. But whether SSE or MSE or RMSE does not matter. The point is that why the author used "y - aprob" is not to maximize the gain ( i.e. ∑ A * logp(y∣x) ), but to minimize the loss ( e.g. SSE or MSE or RMSE or Cross entropy, ...... or whatever )

### ibmua commented Aug 17, 2017 • edited

 Can anyone, please, say how many "episodes" it took them to get to 50%, or some other win rate?

### regmeg commented Aug 30, 2017 • edited

 When you discount the rewards running_add = running_add * gamma + r[t] and they all happen to be positive/negative, you will produce a mean which will be bigger than the smallest reward (consider a corner case when it is the first reward in the list). When you standardise it discounted_epr -= np.mean(discounted_epr) you are going to invert the signs the smallest negative/positive rewards, is it an issue in terms of calculating derivatives later on, namely it will produce a derivative which will contradict to the value of the reward? Would it be better to simply norm them, so that signs don't get inverted ?

### AbhishekAshokDubey commented Sep 16, 2017

 For the back-prop explanation at one place: https://github.com/AbhishekAshokDubey/RL/blob/master/karpathy-ping-pong/README.md :)

### maitchison commented Sep 18, 2017

 @ibmua the algorithm trains very slowly with the default learning_rate of 1e-4. It takes me around 20,000 (3 nights of training) to get a score of around -10. If you change the learning rate to 1e-3 you should get an average score of 0 (i.e. competitive with AI) in around 6,000 episodes. 3e-3 works too but seems less stable once we get above 0. Hope that helps :) -Matthew.

### Yugnaynehc commented Nov 3, 2017 • edited

 Hello everyone! I want to give an explanation about "dlogps.append(y - aprob)", which I had been very curious about. Firstly, here we want to produce a probability to take certain action. In this case, Karpathy use only one output to represent the probability to take UP action, and denote it as 'aprob' (I think it is better to use up_prob), hence the probability to take DOWN is '1-aprob'. Then, for the origin REINFORCE formula, we could think it as using log p(action) as the objected function (just temporarily leave rewards aside). Now the problem becomes "how to maximize the log p(action)", and we can calculate the gradient w.r.t. logit : When the taken action was UP, the probability is aprob, so the gradient is 1-aprob, and when the taken action was DOWN, the probability is 1-aprob, so the gradient is 1 - (1-aprob) = -aprob. Now it is clear why the code introduce the fake label y. If I was wrong, please correct me. Many thanks!

### kartikeyasaxena1012 commented Dec 9, 2017

 hello people whenever I try to run this code for pong on my windows machine this error pops up " \pong.py", line 24, in grad_buffer = { k : np.zeros_like(v) for k,v in model.iteritems() } # update buffers that add up gradients over a batch AttributeError: 'dict' object has no attribute 'iteritems' " can someone help me thanks

### pankajb64 commented Dec 10, 2017 • edited

 @AbhishekAshokDubey thanks for the explanation, that helped me understand the policy_backward method. @karpathy great post! Thank you so much! @kartikeyasaxena1012 you're probably using Python 3, in which case you should use model.items() instead.

### HeroKillerEver commented Jan 16, 2018

 A silly question, looking forward to reply. I am quite astonished to see that when I run the code, it is occupying all the cpus in my server. Since i know numpy is only using 1 cpu, which part of the code enables the parallel computing?

### YongHuangSJTU commented Jan 23, 2018

 I see many comments about "dlogps.append(y - aprob)" and I'm also confused about that line. Let's suppose aprob is 0.9, and award is 1 for UP and -1 for DOWN. If y is 1, then gradient epdlogp is (1-0.9) 1 = 0.1; but if y happens to be 0, then gradient epdlogp is (0-0.9) -1 = 0.9. The direction is still correct but not in same scale. y comes from np.random.uniform() < aprob,I think no matter what random number we get, it should not affect back propagation of W1 and W2, using aprod directly maybe better. I don't have much experience in RL, welcome any comments if I misunderstand anything :)

### Kelvin-Zhong commented Feb 3, 2018

 @taey16 Thank you for your explanation :)

### Kelvin-Zhong commented Feb 3, 2018

 @YongHuangSJTU I have the confusion same as yours

### EmrahErden commented Feb 6, 2018 • edited

 @finardi In some papers they are using cost (where negative values wanted, and cost minimized), in this one Karpathy used rewards (where positive values wanted, rewards maximized). Sign discrepancy probably caused from that.

### HadXu commented Feb 13, 2018

 could anyone share pytorch verison? thanks

### MariaNivedha commented Feb 16, 2018

 @YongHuangSJTU, @Kelvin-Zhong It works like this.. The 'aprob' gives the probability of taking the UP action and thus aprob=1 tells that the UP action must be taken(which means DOWN action must not be taken). Likewise, aprob=0 tells that the UP action must not be taken(which means DOWN action must be taken). This line "action = 2 if np.random.uniform() < aprob else 3 " is to take randomly a UP action or a DOWN action. Suppose if we get aprob=0.4 and on random, we get to sample for DOWN action, then we do gradient (0-0.4=-0.4) since this -ve 0.4 tells to move down to the lower boundary 0(which is highest probability of taking the DOWN action). Suppose if we get aprob=0.4 and on random, we get to sample for UP action, then we do gradient(1-0.4=0.6) since this +ve 0.6 tells to move up to the upper boundary 1(which is highest probability of taking the UP action) see the below image for understanding..

### akathirkathir commented Feb 26, 2018

 hey, I need some clear idea at line number 61. why we use 'epx'. can you write down backpropagation formula for w1 and w2.

### xombio commented Mar 16, 2018

 Question 1: I'm not a Python person so I'm trying to write this code in Matlab. I noticed that xs, hs, dlogps, and drs are initialized to [],[],[],[] and reset to [],[],[],[] after each episode. But epx, eph, epdlogp, and epr are neither initialized nor reset. They seem to keep growing for ever (lines 99-102). I'm not familiar with the nuances of np.vstack. Am I correct? Question 2: If I had a game with player movement options up, down, right, and left, how would I need to modify this code to make it work (beside the obvious modification to 4 nodes in the output layer)? Thanks.

### rayedbar commented Mar 20, 2018

 Is there a pre-trained network available which can be deployed in a gym environment and played with?

### mashoujiang commented Mar 22, 2018

 hi, @normferns I have the same confuse with you about the concept of episode. In line 95: if done: # an episode finished, I think maybe it is an error. and I noticed that the done=True only after 20 round finished. but reward!=0 happened for every round game. I find nobody recurred this issue.. AND this code worked! I can't understand..

### BigHopes commented Mar 31, 2018 • edited

 While running the code on Python 3.5, I faced some issues and solve them, happy to share: (note: the code won't work on windows cause atari_py won't work) 1- cPickle didn't work so just change it to: import pickle 2- model.iteritems() didn't work, change it to: model.items() 3- Error: No module named : 'atari_py'. , install Atari dependencies by running 'pip install gym[atari]'.) 4- name 'xrange' is not defined ... use range instead Then make: render = True, and enjoy watching it learning (the RL agent is the green one) :)

### felixdae commented Apr 4, 2018

 reimplemented it with tensorflow, hope it helps https://github.com/gameofdimension/policy-gradient-pong

### lucdaodainhan commented Apr 29, 2018 • edited

 thank for great post, anyone can help me, i done with entire code but just display only log result without gameplay of pong when running, i run code with IDLE shell python 2.7.14 with my OS window 10 32 bit, what problem with me? i mean just only shell running and no gameplay of pong appeare ??

### Alro10 commented May 29, 2018

 HI! @lucdaoinhan, add this command after gym.make("Pong-v0") env = gym.wrappers.Monitor(env, '.', force=True)

### crowjdh commented Jun 19, 2018

 Thanks for sharing great work :) You really should have added your blog post url somewhere, which illustrates this code in really comprehensive way. For anyone who may benefit, here is the link: http://karpathy.github.io/2016/05/31/rl/

### javaswinger commented Jul 9, 2018 • edited

 I added a revision for Python 3.6.6 that works and it required more Mac setup that was in the original post: git clone https://github.com/openai/gym.git; cd gym; brew install cmake boost boost-python sdl2 swig wget; pip install -e '.[atari]'

### nkcr commented Nov 20, 2018

 hey, I need some clear idea at line number 61. why we use 'epx'. can you write down backpropagation formula for w1 and w2. Here is the formula involved in the forward and backward functions, it appears clearly why the use of epx. First, let's draw a picture of our neural network : I used the general notation, but showed the corresponding values from the code (h and p). What the forward function does is computing the following: For the backward pass, we use the standard back propagation formula1, which are: So we can use those to compute the gradients for W_1 and W_2. For W_2: Note we use the transpose of h to match the dimensions. For W_1:

### vbordalo commented Dec 5, 2018

 Last line... the position of the close bracket (parenthesis): print ('ep %d: game finished, reward: %f' % (episode_number, reward) + ('' if reward == -1 else ' !!!!!!!!'))

### bassamalghram commented Dec 30, 2018

 HI, I have problem with the code can you help me I am using python3.6 and it seem the code is wrong at this line grad_buffer = { k : np.zeros_like(v) for k,v in model.iteritems() } # update buffers that add up gradients over a batch rmsprop_cache = { k : np.zeros_like(v) for k,v in model.iteritems() } # rmsprop memory

### bassamalghram commented Dec 30, 2018

 the line env=pygame.make(pong-n0).is wrong it said the make attribute is not attribute of pygame

### omkarv commented Dec 30, 2018

 For anyone trying to understand the code in this gist, I found the following video from Deep RL Bootcamp really helpful, probably more helpful than Andrej's blog: https://www.youtube.com/watch?v=tqrcjHuNdmQ

### DJCordhose commented Jan 1, 2019 • edited

 I made this code into a Colab Notebook running Python 3 that allows you to run this without local installation and keep training it. It also allows to display episodes played. I added some fixes/adjustments mentioned by several people in the comments here. https://colab.research.google.com/github/DJCordhose/ai/blob/master/notebooks/rl/pg-from-scratch.ipynb It performs ok after a few hours of training, but might take 1-2 days before it actually outperforms the computer player

### omkarv commented Jan 9, 2019

 I've forked the repo added some explanatory comments to the code, and fixed a minor bug which helps me to get a bit better performance (although this could be explained by randomness stumbling on a better reward return... Basically the base repo trims off 35px from the start of the width of the input image, but only 15px off the end - which would mean the network is still training off when the ball passes the paddle, which isn't useful. Trimming more off the end of the image seemed to boost performance: Using a learning rate of 1e-3 it took 12000 episodes to reach a trailing reward score of -5. With the bugfix in place and a learning rate of 1e-3, the trailing average reward at 10000 episodes was 3ish

### StudentRa commented Feb 19, 2019

 I had platform Win 64 bit , can this program work on it

### gusbakker commented Apr 26, 2019 • edited

 @lucdaodainhan or just simply put render = True :)

### roshray commented Aug 14, 2019

 HI, I have problem with the code can you help me I am using python3.6 and it seem the code is wrong at this line grad_buffer = { k : np.zeros_like(v) for k,v in model.iteritems() } # update buffers that add up gradients over a batch rmsprop_cache = { k : np.zeros_like(v) for k,v in model.iteritems() } # rmsprop memory Replace iteritems -->items. grad_buffer = { k : np.zeros_like(v) for k,v in model.items() } # update buffers that add up gradients over a batch rmsprop_cache = { k : np.zeros_like(v) for k,v in model.items() } # rmsprop memory