Instantly share code, notes, and snippets.

# karpathy/pg-pong.py

Created May 30, 2016 22:50
Training a Neural Network ATARI Pong agent with Policy Gradients from raw pixels
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

### Grsz commented Nov 23, 2020

I'm trying to implement a model with Tensorflow following this gist. I'm trying to do it in a more general way to support cases where there can be more than 2 actions, so using sparse categorical cross entropy. I've been spending weeks on this, but can't make it work. Done all sorts of research, tried a lot of approaches (made 12 versions of it - surprisingly found a version which outperforms the original - using MSE for loss, and instead of subtracting the previous state from the current state, to get the state, it just uses the current state, but it should be just accidental luck), did lot of testing, but it performs terrible, actually, it doesn't learn at all.

Here's the code:

``````import tensorflow as tf
import numpy as np
import gym

class Model(tf.keras.Model):
def __init__(self, h_units, y_size):
super().__init__()

self.whs = [tf.keras.layers.Dense(h_size, 'relu') for h_size in h_units]
self.wy = tf.keras.layers.Dense(y_size, 'sigmoid')

def call(self, x):
for wh in self.whs:
x = wh(x)
y = self.wy(x)

return y

def prepro(state):
""" prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
state = state[35:195] # crop
state = state[::2,::2,0] # downsample by factor of 2
state[state == 144] = 0 # erase background (background type 1)
state[state == 109] = 0 # erase background (background type 2)
state[state != 0] = 1 # everything else (paddles, ball) just set to 1
return state.ravel().reshape([1, -1])

class Environment:
def __init__(self, state_preprocessor):
self.env = gym.make('Pong-v0')
self.state_preprocessor = state_preprocessor

self.prev_s = None

def init(self):
cur_s = self.env.reset()
cur_s = self.state_preprocessor(cur_s)

s = cur_s - tf.zeros_like(cur_s)

self.prev_s = cur_s

return s

def step(self, action):
cur_s, r, done, info = self.env.step(action + 2)
cur_s = self.state_preprocessor(cur_s)

s = cur_s - self.prev_s

self.prev_s = cur_s

return s, r, done

# Runs model (policy) with x (input),
# samples y (action) from y_pred (probability of actions).
# Takes:
#  - x - the input, 1D tensor of the current state
#  - model - policy, returns probability of actions from state
#  - loss function to calculate loss
# Returns:
#  - gradients of model's weights based on loss
#  - loss from y, and y_pred with loss_fn
#  - y - 1D tensor, the sampled value what the action should be
@tf.function
def sample_action(x, model):
y_pred = model(x)
samples = tf.random.uniform()

y = y_pred - samples
y = tf.reshape(tf.argmax(y, 1), [-1, 1])

return y

def get_gradients(x, y, r, model, loss_fn):
y_pred = model(x)
loss = loss_fn(y, y_pred, r)

# Discounting later rewards more than sooner.
# Because the final reward happened much more likely
# because of a recent action than one at the beginning.
# Takes:
#  - full value rewards of timesteps
#  - discount multiplier
# Returns:
#  - discounted sum of rewards of timesteps

def discount_rewards(d):
def discount(rs):
d_rs = np.zeros_like(rs)
sum_rt = 0
for t in reversed(range(rs.shape)):
if rs[t] != 0: sum_rt = 0
# add rt to the discounted sum of rewards at t
sum_rt = sum_rt * d + rs[t]
d_rs[t] = sum_rt
d_rs -= np.mean(d_rs)
d_rs /= np.std(d_rs)
return d_rs
return discount

model = Model(, 2)
env = Environment(prepro)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
optimizer = tf.keras.optimizers.RMSprop(0.0001, 0.99)
discounter = discount_rewards(0.99)

epochs = 10000
batch_size = 10

# Train runs the model with the environment,
# collects gradients per execution, and optimizes
# the model's weights at each epoch.
# Takes:
#  - model (policy), which takes the input x (state), and returns y (action)
#  - environment, which performs the action, and returns the new state, and reward
#    must have methods:
#     - init() - initialize the state
#     - step(action) - perform the action, return new state, reward, and indicator if episode is over

def train(model, env, loss_fn, optimizer, discounter, epochs, batch_size):
for i in range(epochs):
xs = []
ys = []
rs = []
ep_rs = []

for e in range(batch_size):
done = False
x = env.init()
ep_r = 0

while not done:
xs.append(x)
y = sample_action(x, model)
ys.append(y)

x, r, done = env.step(y.numpy().astype('int'))
rs.append(r)

ep_r += r
ep_rs.append(ep_r)
print('Epoch:', i, 'Episode:', e, 'Reward:', ep_r)

xs = tf.concat(xs, 0)
ys = tf.concat(ys, 0)
rs = np.vstack(rs)
rs = discounter(rs)

print('Epoch:', i, 'Avg episode reward:', np.array(ep_rs).mean())

train(model, env, loss_fn, optimizer, discounter, epochs, batch_size)
``````

So in summary, set up a model with 2 dense layers, output the number of possible actions, subtract the random number from the output, get the index of the highest one, use it as the action. Run the environment with it, get the next state, and reward. Store all selected actions, rewards, states. If the number of episodes reach the batch size, rerun the model with the collected states, get the predictions, run sparse categorical cross entropy with the selected actions, and predictions, use the discounted rewards as sample weights for the losses, get the gradients, optimize the weights. Repeat.

During making those versions, and testing them, I realized that somehow the initial version makes a lot better random guesses before any training (10 episodes) with a mean of 20, while the tensorflow version keeps guessing -21 for all 10 episodes in multiple independent tries.

All makes sense, but doesn't work. Why?

### eabase commented May 15, 2021

@Grsz That is a perfect question for StackOverflow...

### haluptzok commented Oct 29, 2021

Python3 version of pg-pong.py with the minimum changes to make it work:
https://gist.github.com/haluptzok/d2a3eba5d25d238d6c2cbe847bc58b6b
Still a great policy gradient blog post and python script - but Python2 is so 2016 : )
Most folks reading this now will fire it up in python3 and blow up and not get the fun experience

### WillianFuks commented Apr 5, 2022

For those interested in seeing this implemented on top of TensorFlow 2 running entirely on graph mode here's the repo:

https://github.com/WillianFuks/Pong

The AI trained fairly quickly, in a day it already reached average return of ~14 points. But then it stops there and doesn't quite improve much after all. Not sure on how to further improve it then, other than keep tweaking the hyperparams.

### SeunghyunSEO commented May 19, 2022 • edited

May 19th 2022
I modify some lines of pg-pong.py because this is too old (but gold).
In my case rendering option did not work because of openai-gym issue.
pls check this code if you want to train agent playing pong in py38, gym>=0.21.0

### yanhong-zhao-ef commented Jul 6, 2022

In case someone wants to share a cool colab demo still - here is my notebook that ended up achieving level performance with the human opponent