Skip to content

Instantly share code, notes, and snippets.

@carlos-aguayo
Last active April 7, 2021 03:30
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save carlos-aguayo/3df32b1f5f39353afa58fbc29f9227a2 to your computer and use it in GitHub Desktop.
Save carlos-aguayo/3df32b1f5f39353afa58fbc29f9227a2 to your computer and use it in GitHub Desktop.
OpenAI Gym CartPole-v0
Cart pole balancing solved using the Q learning algorithm.
https://gym.openai.com/envs/CartPole-v0
https://gym.openai.com/evaluations/eval_kWknKOkPQ7izrixdhriurA
To run:
python CartPole-v0.py
import gym
import pandas as pd
import numpy as np
import random
# https://gym.openai.com/envs/CartPole-v0
# Carlos Aguayo - carlos.aguayo@gmail.com
class QLearner(object):
def __init__(self,
num_states=100,
num_actions=4,
alpha=0.2,
gamma=0.9,
random_action_rate=0.5,
random_action_decay_rate=0.99):
self.num_states = num_states
self.num_actions = num_actions
self.alpha = alpha
self.gamma = gamma
self.random_action_rate = random_action_rate
self.random_action_decay_rate = random_action_decay_rate
self.state = 0
self.action = 0
self.qtable = np.random.uniform(low=-1, high=1, size=(num_states, num_actions))
def set_initial_state(self, state):
"""
@summary: Sets the initial state and returns an action
@param state: The initial state
@returns: The selected action
"""
self.state = state
self.action = self.qtable[state].argsort()[-1]
return self.action
def move(self, state_prime, reward):
"""
@summary: Moves to the given state with given reward and returns action
@param state_prime: The new state
@param reward: The reward
@returns: The selected action
"""
alpha = self.alpha
gamma = self.gamma
state = self.state
action = self.action
qtable = self.qtable
choose_random_action = (1 - self.random_action_rate) <= np.random.uniform(0, 1)
if choose_random_action:
action_prime = random.randint(0, self.num_actions - 1)
else:
action_prime = self.qtable[state_prime].argsort()[-1]
self.random_action_rate *= self.random_action_decay_rate
qtable[state, action] = (1 - alpha) * qtable[state, action] + alpha * (reward + gamma * qtable[state_prime, action_prime])
self.state = state_prime
self.action = action_prime
return self.action
def cart_pole_with_qlearning():
env = gym.make('CartPole-v0')
experiment_filename = './cartpole-experiment-1'
env.monitor.start(experiment_filename, force=True)
goal_average_steps = 195
max_number_of_steps = 200
number_of_iterations_to_average = 100
number_of_features = env.observation_space.shape[0]
last_time_steps = np.ndarray(0)
cart_position_bins = pd.cut([-2.4, 2.4], bins=10, retbins=True)[1][1:-1]
pole_angle_bins = pd.cut([-2, 2], bins=10, retbins=True)[1][1:-1]
cart_velocity_bins = pd.cut([-1, 1], bins=10, retbins=True)[1][1:-1]
angle_rate_bins = pd.cut([-3.5, 3.5], bins=10, retbins=True)[1][1:-1]
def build_state(features):
return int("".join(map(lambda feature: str(int(feature)), features)))
def to_bin(value, bins):
return np.digitize(x=[value], bins=bins)[0]
learner = QLearner(num_states=10 ** number_of_features,
num_actions=env.action_space.n,
alpha=0.2,
gamma=1,
random_action_rate=0.5,
random_action_decay_rate=0.99)
for episode in xrange(50000):
observation = env.reset()
cart_position, pole_angle, cart_velocity, angle_rate_of_change = observation
state = build_state([to_bin(cart_position, cart_position_bins),
to_bin(pole_angle, pole_angle_bins),
to_bin(cart_velocity, cart_velocity_bins),
to_bin(angle_rate_of_change, angle_rate_bins)])
action = learner.set_initial_state(state)
for step in xrange(max_number_of_steps - 1):
observation, reward, done, info = env.step(action)
cart_position, pole_angle, cart_velocity, angle_rate_of_change = observation
state_prime = build_state([to_bin(cart_position, cart_position_bins),
to_bin(pole_angle, pole_angle_bins),
to_bin(cart_velocity, cart_velocity_bins),
to_bin(angle_rate_of_change, angle_rate_bins)])
if done:
reward = -200
action = learner.move(state_prime, reward)
if done:
last_time_steps = np.append(last_time_steps, [int(step + 1)])
if len(last_time_steps) > number_of_iterations_to_average:
last_time_steps = np.delete(last_time_steps, 0)
break
if last_time_steps.mean() > goal_average_steps:
print "Goal reached!"
print "Episodes before solve: ", episode + 1
print u"Best 100-episode performance {} {} {}".format(last_time_steps.max(),
unichr(177), # plus minus sign
last_time_steps.std())
break
env.monitor.close()
if __name__ == "__main__":
random.seed(0)
cart_pole_with_qlearning()
@gdb
Copy link

gdb commented May 11, 2016

@ludobouan
Copy link

ludobouan commented May 12, 2016

line 100 you mixed up the order of the variables. Cartpole-v0 returns the observation in this order:
[cart_position, cart_velocity, pole_angle, angle_rate_of_change].

The value of pole_angle is bounded by -0.2 and 0.2, so with your current algorithm there exist only two intervals for the pole_angle that can be reached.

I tried doubling the number of intervals that can be reached by pole_angle (from two to four), and it doesn't learn :
https://gist.github.com/anonymous/70afe80acc3810cc6df50747b63b9203

Am I missing something ?

@JKCooper2
Copy link

JKCooper2 commented May 12, 2016

I would recommend removing the if done: reward = -200 from line 117. Altering the rewards is normally discouraged as it means the performance is less indicative of the ability of the algorithm, as it also uses knowledge unavailable to the agent.

As this problem is capped at 200 steps as well it means that the final state even if you do a perfect episode and reach 200 steps will receive a very strong negative reward. Replacing with if done and steps < 200: would be better, but because it is adding knowledge the agent doesn't have of its environment I would still recommend removing the check altogether.

Another useful thing can be to reduce the random action rate per episode rather than per move. Even with only a 1% decay, after 400 moves (2 complete runs) the random action will go from 50% chance to below 1% chance. You could move self.random_action_rate *= self.random_action_decay_rate from QLearner.move() into the if done: check inside the episode run, and just have it as learner.random_action_rate *= learner.random_action_decay_rate.

@carlos-aguayo
Copy link
Author

Thanks! I really appreciate your feedback! I'll update it based on it and repost.

@Svalorzen
Copy link

@JKCooper removing the check makes no sense, if anything it is a shortcoming of the environment. Since the done variable is not directly visible to the agent and the rewards are always +1 there is no reason for the agent to do anything. As in, the environment itself does not encode any type of goal whatsoever.

Unless you include some type of signal for what kind of state/action is undesirable, there is no reason for the agent to learn what to do. Right now everybody is doing this by looking at the rendered images and knowing implicitly what the problem is asking them to do.

Imagine you were the agent though. You receive a bunch of numbers. The environment says: good for you, here's +1! You then do something else. Good for you again. You never have any incentives to either do or don't do something.

You could say, the reward is implicitly encoded by the length of the episode. But this would pretty much say that whenever you lose, you enter in a terminal state which will give you zero reward forever. But since this is not included in the gym, you'd have to code this yourself. And at that point, might as well add the -200 reward check.

@carlos-aguayo
Copy link
Author

@blabby Good catch, I indeed mixed them. However changing them shouldn't affect much.

@JKCooper2 Great catch on the decay rate, I hadn't noticed it was decaying so quickly. By moving it to the set_initial_state does what you suggest.

@JKCooper2 and @Svalorzen I was wondering about the rewards as well. Initially I was confused by the fact that the environment wouldn't return a negative reward when the cart failed to balance the pole. As in, that's definitely a state we didn't want to be in and the agent should learn not to get near it.
At the same time, I realize that without giving it a bad reward, the agent should learn to get as many positive rewards as possible, implying keeping the pole balanced for as long as possible.

@jgsonx
Copy link

jgsonx commented May 29, 2016

Hi, spent some time this weekend trying your CartPole-v0, but for some reason it doesn't converge. Do you use some different variable values than above? The only thing that I changed for the code to work in python35 was xrange to range, everything else is intact. like @blabby said: Am I missing something?

@mitbal
Copy link

mitbal commented Jan 24, 2017

Hi, I always get this error when I tried to run the script.

Error: Tried to reset environment which is not done. While the monitor is active for CartPole-v0, you cannot call reset() unless the episode is over.

which is strange since the reset only called at the beginning for each episode, which then can only be run if the previous episode reached 'done' state from the step function. Any clue as to why this happens?

Thanks in advance!

@alexmcnulty
Copy link

@mitbal I'm getting the same error after ~35 episodes after changing to:

env = gym.wrappers.Monitor(env, experiment_filename, force=True)

Did you manage to fix the issue? I don't understand why it stops working.

@hoo223
Copy link

hoo223 commented Sep 18, 2019

@alexmcnulty I fixed the error by putting the code "env.close()" before env.reset().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment