Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brando90/d1795ecc04cbf5dfc4bdc8904b5ffd5f to your computer and use it in GitHub Desktop.
Save brando90/d1795ecc04cbf5dfc4bdc8904b5ffd5f to your computer and use it in GitHub Desktop.
Solution to the Cartpole problem with policy gradients published on Medium
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Implementing Policy Gradients on CartPole with PyTorch"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import gym\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from tqdm import tqdm, trange\n",
"import pandas as pd\n",
"import torch\n",
"import torch.nn as nn\n",
"import torch.optim as optim\n",
"import torch.nn.functional as F\n",
"from torch.autograd import Variable\n",
"from torch.distributions import Categorical\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.\u001b[0m\n"
]
}
],
"source": [
"env = gym.make('CartPole-v1')\n",
"env.seed(1); torch.manual_seed(1);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Policy Gradients\n",
"A policy gradient attempts to train an agent without explicitly mapping the value for every state-action pair in an environment by taking small steps and updating the policy based on the reward associated with that step. The agent can receive a reward immediately for an action or the agent can receive the award at a later time such as the end of the episode. \n",
"We’ll designate the policy function our agent is trying to learn as $\\pi_\\theta(a,s)$, where $\\theta$ is the parameter vector, $s$ is a particular state, and $a$ is an action.\n",
"\n",
"We'll apply a technique called Monte-Carlo Policy Gradient which means we will have the agent run through an entire episode and then update our policy based on the rewards obtained."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model Construction\n",
"### Create Neural Network Model\n",
"We will use a simple feed forward neural network with one hidden layer of 128 neurons and a dropout of 0.6. We'll use Adam as our optimizer and a learning rate of 0.01. Using dropout will significantly improve the performance of our policy. I encourage you to compare results with and without dropout and experiment with other hyper-parameter values."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"#Hyperparameters\n",
"learning_rate = 0.01\n",
"gamma = 0.99"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"class Policy(nn.Module):\n",
" def __init__(self):\n",
" super(Policy, self).__init__()\n",
" self.state_space = env.observation_space.shape[0]\n",
" self.action_space = env.action_space.n\n",
" \n",
" self.l1 = nn.Linear(self.state_space, 128, bias=False)\n",
" self.l2 = nn.Linear(128, self.action_space, bias=False)\n",
" \n",
" self.gamma = gamma\n",
" \n",
" # Episode policy and reward history \n",
" self.policy_history = Variable(torch.Tensor()) \n",
" self.reward_episode = []\n",
" # Overall reward and loss history\n",
" self.reward_history = []\n",
" self.loss_history = []\n",
"\n",
" def forward(self, x): \n",
" model = torch.nn.Sequential(\n",
" self.l1,\n",
" nn.Dropout(p=0.6),\n",
" nn.ReLU(),\n",
" self.l2,\n",
" nn.Softmax(dim=-1)\n",
" )\n",
" return model(x)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"policy = Policy()\n",
"optimizer = optim.Adam(policy.parameters(), lr=learning_rate)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Select Action\n",
"The select_action function chooses an action based on our policy probability distribution using the PyTorch distributions package. Our policy returns a probability for each possible action in our action space (move left or move right) as an array of length two such as [0.7, 0.3]. We then choose an action based on these probabilities, record our history, and return our action. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def select_action(state):\n",
" #Select an action (0 or 1) by running policy model and choosing based on the probabilities in state\n",
" state = torch.from_numpy(state).type(torch.FloatTensor)\n",
" state = policy(Variable(state))\n",
" c = Categorical(state)\n",
" action = c.sample()\n",
" \n",
" # Add log probability of our chosen action to our history \n",
" if policy.policy_history.dim() != 0:\n",
" policy.policy_history = torch.cat([policy.policy_history, c.log_prob(action)])\n",
" else:\n",
" policy.policy_history = (c.log_prob(action))\n",
" return action"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reward $v_t$\n",
"We update our policy by taking a sample of the action value function $Q^{\\pi_\\theta} (s_t,a_t)$ by playing through episodes of the game. $Q^{\\pi_\\theta} (s_t,a_t)$ is defined as the expected return by taking action $a$ in state $s$ following policy $\\pi$.\n",
"\n",
"We know that for every step the simulation continues we receive a reward of 1. We can use this to calculate the policy gradient at each time step, where $r$ is the reward for a particular state-action pair. Rather than using the instantaneous reward, $r$, we instead use a long term reward $ v_{t} $ where $v_t$ is the discounted sum of all future rewards for the length of the episode. In this way, the **longer** the episode runs into the future, the **greater** the reward for a particular state-action pair in the present. $v_{t}$ is then,\n",
"\n",
"$$ v_{t} = \\sum_{k=0}^{N} \\gamma^{k}r_{t+k} $$\n",
"\n",
"where $\\gamma$ is the discount factor (0.99). For example, if an episode lasts 5 steps, the reward for each step will be [4.90, 3.94, 2.97, 1.99, 1].\n",
"Next we scale our reward vector by substracting the mean from each element and scaling to unit variance by dividing by the standard deviation. This practice is common for machine learning applications and the same operation as Scikit Learn's __[StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)__. It also has the effect of compensating for future uncertainty.\n",
"\n",
"## Update Policy\n",
"After each episode we apply Monte-Carlo Policy Gradient to improve our policy according to the equation:\n",
"\n",
"$$\\Delta\\theta_t = \\alpha\\nabla_\\theta \\, \\log \\pi_\\theta (s_t,a_t)v_t $$\n",
"\n",
"We will then feed our policy history multiplied by our rewards to our optimizer and update the weights of our neural network using stochastic gradent *ascent*. This should increase the likelihood of actions that got our agent a larger reward.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def update_policy():\n",
" R = 0\n",
" rewards = []\n",
" \n",
" # Discount future rewards back to the present using gamma\n",
" for r in policy.reward_episode[::-1]:\n",
" R = r + policy.gamma * R\n",
" rewards.insert(0,R)\n",
" \n",
" # Scale rewards\n",
" rewards = torch.FloatTensor(rewards)\n",
" rewards = (rewards - rewards.mean()) / (rewards.std() + np.finfo(np.float32).eps)\n",
" \n",
" # Calculate loss\n",
" loss = (torch.sum(torch.mul(policy.policy_history, Variable(rewards)).mul(-1), -1))\n",
" \n",
" # Update network weights\n",
" optimizer.zero_grad()\n",
" loss.backward()\n",
" optimizer.step()\n",
" \n",
" #Save and intialize episode history counters\n",
" policy.loss_history.append(loss.data[0])\n",
" policy.reward_history.append(np.sum(policy.reward_episode))\n",
" policy.policy_history = Variable(torch.Tensor())\n",
" policy.reward_episode= []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training\n",
"This is our main policy training loop. For each step in a training episode, we choose an action, take a step through the environment, and record the resulting new state and reward. We call update_policy() at the end of each episode to feed the episode history to our neural network and improve our policy."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def main(episodes):\n",
" running_reward = 10\n",
" for episode in range(episodes):\n",
" state = env.reset() # Reset environment and record the starting state\n",
" done = False \n",
" \n",
" for time in range(1000):\n",
" action = select_action(state)\n",
" # Step through environment using chosen action\n",
" state, reward, done, _ = env.step(action.data[0])\n",
"\n",
" # Save reward\n",
" policy.reward_episode.append(reward)\n",
" if done:\n",
" break\n",
" \n",
" # Used to determine when the environment is solved.\n",
" running_reward = (running_reward * 0.99) + (time * 0.01)\n",
"\n",
" update_policy()\n",
"\n",
" if episode % 50 == 0:\n",
" print('Episode {}\\tLast length: {:5d}\\tAverage length: {:.2f}'.format(episode, time, running_reward))\n",
"\n",
" if running_reward > env.spec.reward_threshold:\n",
" print(\"Solved! Running reward is now {} and the last episode runs to {} time steps!\".format(running_reward, time))\n",
" break\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Model"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Episode 0\tLast length: 8\tAverage length: 9.98\n",
"Episode 50\tLast length: 80\tAverage length: 18.82\n",
"Episode 100\tLast length: 215\tAverage length: 47.54\n",
"Episode 150\tLast length: 433\tAverage length: 145.24\n",
"Episode 200\tLast length: 499\tAverage length: 233.92\n",
"Episode 250\tLast length: 499\tAverage length: 332.90\n",
"Episode 300\tLast length: 499\tAverage length: 383.54\n",
"Episode 350\tLast length: 499\tAverage length: 412.94\n",
"Episode 400\tLast length: 499\tAverage length: 446.52\n",
"Episode 450\tLast length: 227\tAverage length: 462.03\n",
"Episode 500\tLast length: 499\tAverage length: 453.68\n",
"Episode 550\tLast length: 499\tAverage length: 468.94\n",
"Solved! Running reward is now 475.15748930299014 and the last episode runs to 499 time steps!\n"
]
}
],
"source": [
"episodes = 1000\n",
"main(episodes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our policy solves the environment prior to reaching 600 episodes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plot Results"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7fbf542f4320>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"window = int(episodes/20)\n",
"\n",
"fig, ((ax1), (ax2)) = plt.subplots(2, 1, sharey=True, figsize=[9,9]);\n",
"rolling_mean = pd.Series(policy.reward_history).rolling(window).mean()\n",
"std = pd.Series(policy.reward_history).rolling(window).std()\n",
"ax1.plot(rolling_mean)\n",
"ax1.fill_between(range(len(policy.reward_history)),rolling_mean-std, rolling_mean+std, color='orange', alpha=0.2)\n",
"ax1.set_title('Episode Length Moving Average ({}-episode window)'.format(window))\n",
"ax1.set_xlabel('Episode'); ax1.set_ylabel('Episode Length')\n",
"\n",
"ax2.plot(policy.reward_history)\n",
"ax2.set_title('Episode Length')\n",
"ax2.set_xlabel('Episode'); ax2.set_ylabel('Episode Length')\n",
"\n",
"fig.tight_layout(pad=2)\n",
"plt.show()\n",
"#fig.savefig('results.png')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment