simoninithomas/How works Sonic PPO.ipynb

## How works Sonic PPO.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Understand PPO implementation playing Sonic the Hedgehog 2 and 3 🦔\n",
    "\n",
    "This Notebook explains the Proximal Policy Optimization implementation.<br>\n",
    "The [repository link]() \n",
    "\n",
    "### Acknowledgements 👏\n",
    "This implementation is based on 2 repositories:\n",
    "- OpenAI [Baselines PPO2](https://github.com/openai/baselines/blob/24fe3d6576dd8f4cdd5f017805be689d6fa6be8c/baselines/ppo2/ppo2.py)\n",
    "- Alexandre Borghi [retro_contest_agent](https://github.com/aborghi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/PPO%20with%20Sonic%20the%20Hedgehog/assets/PPO.png\" alt=\"PPO\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
    "<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png\" alt=\"Deep Reinforcement Course\"/>\n",
    "<br>\n",
    "<p>  Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.**\n",
    "<br><br>\n",
    "    \n",
    "📜The articles explain the architectures from the big picture to the mathematical details behind them.\n",
    "<br>\n",
    "📹 The videos explain how to build the agents with Tensorflow </b></p>\n",
    "<br>\n",
    "This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. \n",
    "<br><br>\n",
    "</p> \n",
    "\n",
    "## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
    "\n",
    "\n",
    "## Any questions 👨‍💻\n",
    "<p> If you have any questions, feel free to ask me: </p>\n",
    "<p> 📧: <a href=\"mailto:hello@simoninithomas.com\">hello@simoninithomas.com</a>  </p>\n",
    "<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>\n",
    "<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>\n",
    "<p> Twitter: <a href=\"https://twitter.com/ThomasSimonini\">@ThomasSimonini</a> </p>\n",
    "<p> Don't forget to <b> follow me on <a href=\"https://twitter.com/ThomasSimonini\">twitter</a>, <a href=\"https://github.com/simoninithomas/Deep_reinforcement_learning_Course\">github</a> and <a href=\"https://medium.com/@thomassimonini\">Medium</a> to be alerted of the new articles that I publish </b></p>\n",
    "    \n",
    "## How to help  🙌\n",
    "3 ways:\n",
    "- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.\n",
    "- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. \n",
    "- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.\n",
    "<br>\n",
    "\n",
    "## Important note 🤔\n",
    "<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)\n",
    "https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning\n",
    "<br>\n",
    "⚠️ I don't have any business relations with them. I just loved their excellent customer service.\n",
    "\n",
    "If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to use it? 📖\n",
    "⚠️ First you need to follow Step 1 (Download Sonic the Hedgehog series)\n",
    "### Watch our agent playing 👀\n",
    "- **Modify the environment**: change `env.make_train_1`with the env you want in `env= DummyVecEnv([env.make_train_1]))` in play.py\n",
    "- **See the agent playing**: run `python play.py`\n",
    "\n",
    "### Continue to train 🏃‍♂️\n",
    "⚠️ There is a big risk **of overfitting**\n",
    "- Run `python agent.py`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# This notebook is a companion of the PPO implementation in [Deep Reinforcement Learning Course's repo]() where each part of the code is explained in the comments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Download Sonic the Hedgehog 2 and 3💻"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. The first step is to download the games, to make it works on retro **you need to buy them legally on Steam**\n",
    "<br><br>\n",
    "[Sonic the Hedgehog 2](https://store.steampowered.com/app/71163/Sonic_The_Hedgehog_2/)<br>\n",
    "[Sonic 3 & Knuckles](https://store.steampowered.com/app/71162/Sonic_3__Knuckles/)<br>\n",
    "\n",
    "<br>\n",
    "<img src=\"https://steamcdn-a.akamaihd.net/steam/apps/71162/header.jpg?t=1522076247\" alt=\"Sonic the Hedgehog 3\"/>\n",
    "<br>\n",
    "2. Then follow the **Quickstart part** of this [website](https://contest.openai.com/2018-1/details/)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Build all elements we need for our environement in sonic_env.py 🖼️"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `PreprocessFrame(gym.ObservationWrapper)` : in this class we will **preprocess our environment**\n",
    "    - Set frame to gray \n",
    "    - Resize the frame to 96x96x1\n",
    "<br>\n",
    "<br>\n",
    "- `ActionsDiscretizer(gym.ActionWrapper)` : in this class we **limit the possibles actions in our environment** (make it discrete)\n",
    "\n",
    "In fact you'll see that for each action in actions:\n",
    "    Create an array of 12 False (12 = nb of buttons)\n",
    "        For each button in action: (for instance ['LEFT']) we need to make that left button index = True\n",
    "            Then the button index = LEFT = True\n",
    "\n",
    "--> In fact at the end we will have an array where each array is an action and **each elements True of this array are the buttons clicked.**\n",
    "For instance LEFT action = [F, F, F, F, F, F, T, F, F, F, F, F]\n",
    "<br><br>\n",
    "- `RewardScaler(gym.RewardWrapper)`: We **scale the rewards** to reasonable scale (useful in PPO).\n",
    "<br><br>\n",
    "- `AllowBacktracking(gym.Wrapper)`: **Allow the agent to go backward without being discourage** (avoid our agent to be stuck on a wall during the game).\n",
    "<br><br>\n",
    "\n",
    "- `make_env(env_idx)` : **Build an environement** (and stack 4 frames together using `FrameStack`\n",
    "<br><br>\n",
    "The idea is that we'll build multiple instances of the environment, different environments each times (different level) **to avoid overfitting and helping our agent to generalize better at playing sonic** \n",
    "<br><br>\n",
    "To handle these multiple environements we'll use `SubprocVecEnv` that **creates a vector of n environments to run them simultaneously.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Build the PPO architecture in architecture.py 🧠"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `from baselines.common.distributions import make_pdtype`: This function selects the **probability distribution over actions**\n",
    "<br><br>\n",
    "\n",
    "- First, we create two functions that will help us to avoid to call conv and fc each time.\n",
    "    - `conv`: function to create a convolutional layer.\n",
    "    - `fc`: function to create a fully connected layer.\n",
    "<br><br>\n",
    "- Then, we create `PPOPolicy`, the object that **contains the architecture**<br>\n",
    "3 CNN for spatial dependencies<br>\n",
    "Temporal dependencies is handle by stacking frames<br>\n",
    "(Something funny nobody use LSTM in OpenAI Retro contest)<br>\n",
    "1 common FC<br>\n",
    "1 FC for policy<br>\n",
    "1 FC for value\n",
    "<br><br>\n",
    "- `self.pdtype = make_pdtype(action_space)`: Based on the action space, will **select what probability distribution typewe will use to distribute action in our stochastic policy** (in our case DiagGaussianPdType aka Diagonal Gaussian, multivariate normal distribution\n",
    "<br><br>\n",
    "- `self.pdtype.pdfromlatent` : return a returns a probability distribution over actions (self.pd) and our pi logits (self.pi).\n",
    "<br><br>\n",
    "\n",
    "- We create also 3 useful functions in PPOPolicy\n",
    "    - `def step(state_in, *_args, **_kwargs)`: Function use to take a step returns **action to take, V(s) and neglogprob**\n",
    "    - `def value(state_in, *_args, **_kwargs)`: Function that calculates **only the V(s)**\n",
    "    -  `def select_action(state_in, *_args, **_kwargs)`: Function that output **only the action to take**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Build the Model in model.py 🏗️"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use Model object to:\n",
    "__init__:\n",
    "- `policy(sess, ob_space, action_space, nenvs, 1, reuse=False)`: Creates the step_model (used for sampling)\n",
    "- `policy(sess, ob_space, action_space, nenvs*nsteps, nsteps, reuse=True)`: Creates the train_model (used for training)\n",
    "\n",
    "- `def train(states_in, actions, returns, values, lr)`: Make the training part (calculate advantage and feedforward and retropropagation of gradients)\n",
    "\n",
    "First we create the placeholders:\n",
    "- `oldneglopac_`: keep track of the old policy\n",
    "- `oldvpred_`: keep track of the old value\n",
    "- `cliprange_`: keep track of the cliprange\n",
    "\n",
    "save/load():\n",
    "- `def save(save_path)`:Save the weights.\n",
    "- `def load(load_path)`:Load the weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Build the Runner in model.py 🏃\n",
    "Runner will be used to make a mini batch of experiences\n",
    "- Each environement send 1 timestep (4 frames stacked) (`self.obs`)\n",
    "- This goes through `step_model`\n",
    "    - Returns actions, values.\n",
    "- Append `mb_obs`, `mb_actions`, `mb_values`, `mb_dones`, `mb_neglopacs`.\n",
    "- Take actions in environments and watch the consequences\n",
    "    - return `obs`, `rewards`, `dones`, `infos` (which contains a ton of useful informations such as the number of rings, the player position etc...)\n",
    "    \n",
    "- We need to calculate advantage to do that we use General Advantage Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Build the learn function in model.py\n",
    "The learn function can be seen as **the gathering of all the logic of our A2C**\n",
    "- Instantiate the model object (that creates step_model and train_model)\n",
    "- Instantiate the runner object\n",
    "- Train always in two phases:\n",
    "    - Run to get a batch of experiences\n",
    "    - Train that batch of experiences\n",
    "<br><br>\n",
    "\n",
    "We use explained_variance which is a **really important parameter**:\n",
    "<br><br>\n",
    "`ev = 1 - Variance[y - ypredicted] / Variance [y]`\n",
    "<br><br>\n",
    "In fact this calculates **if value function is a good predictor of the returns or if it's just worse than predicting nothing.**\n",
    "ev=0  =>  might as well have predicted zero\n",
    "ev<0  =>  worse than just predicting zero so you're overfitting (need to tune some hyperparameters)\n",
    "ev=1  =>  perfect prediction\n",
    "\n",
    "--> The goal is that **ev goes closer and closer to 1.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Build the play function in model.py\n",
    "- This function will be use to play an environment using the trained model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 8: Build the agent.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `config.gpu_options.allow_growth = True` : This creates a GPU session\n",
    "-  `model.learn(...)` : Here we just call the learn function that contains all the elements needed to train our PPO agent\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:gameplai]",
   "language": "python",
   "name": "conda-env-gameplai-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Understand PPO implementation playing Sonic the Hedgehog 2 and 3 🦔\n",
	"\n",
	"This Notebook explains the Proximal Policy Optimization implementation.<br>\n",
	"The [repository link]() \n",
	"\n",
	"### Acknowledgements 👏\n",
	"This implementation is based on 2 repositories:\n",
	"- OpenAI [Baselines PPO2](https://github.com/openai/baselines/blob/24fe3d6576dd8f4cdd5f017805be689d6fa6be8c/baselines/ppo2/ppo2.py)\n",
	"- Alexandre Borghi [retro_contest_agent](https://github.com/aborghi)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/PPO%20with%20Sonic%20the%20Hedgehog/assets/PPO.png\" alt=\"PPO\"/>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
	"<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png\" alt=\"Deep Reinforcement Course\"/>\n",
	"<br>\n",
	"<p> Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.\n",
	"<br><br>\n",
	" \n",
	"📜The articles explain the architectures from the big picture to the mathematical details behind them.\n",
	"<br>\n",
	"📹 The videos explain how to build the agents with Tensorflow </b></p>\n",
	"<br>\n",
	"This course will give you a solid foundation for understanding and implementing the future state of the art algorithms. And, you'll build a strong professional portfolio by creating agents that learn to play awesome environments: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. \n",
	"<br><br>\n",
	"</p> \n",
	"\n",
	"## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
	"\n",
	"\n",
	"## Any questions 👨‍💻\n",
	"<p> If you have any questions, feel free to ask me: </p>\n",
	"<p> 📧: <a href=\"mailto:hello@simoninithomas.com\">hello@simoninithomas.com</a> </p>\n",
	"<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>\n",
	"<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>\n",
	"<p> Twitter: <a href=\"https://twitter.com/ThomasSimonini\">@ThomasSimonini</a> </p>\n",
	"<p> Don't forget to <b> follow me on <a href=\"https://twitter.com/ThomasSimonini\">twitter</a>, <a href=\"https://github.com/simoninithomas/Deep_reinforcement_learning_Course\">github</a> and <a href=\"https://medium.com/@thomassimonini\">Medium</a> to be alerted of the new articles that I publish </b></p>\n",
	" \n",
	"## How to help 🙌\n",
	"3 ways:\n",
	"- Clap our articles and like our videos a lot:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.\n",
	"- Share and speak about our articles and videos: By sharing our articles and videos you help us to spread the word. \n",
	"- Improve our notebooks: if you found a bug or a better implementation you can send a pull request.\n",
	"<br>\n",
	"\n",
	"## Important note 🤔\n",
	"<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)\n",
	"https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning\n",
	"<br>\n",
	"⚠️ I don't have any business relations with them. I just loved their excellent customer service.\n",
	"\n",
	"If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## How to use it? 📖\n",
	"⚠️ First you need to follow Step 1 (Download Sonic the Hedgehog series)\n",
	"### Watch our agent playing 👀\n",
	"- Modify the environment: change `env.make_train_1`with the env you want in `env= DummyVecEnv([env.make_train_1]))` in play.py\n",
	"- See the agent playing: run `python play.py`\n",
	"\n",
	"### Continue to train 🏃‍♂️\n",
	"⚠️ There is a big risk of overfitting\n",
	"- Run `python agent.py`"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# This notebook is a companion of the PPO implementation in [Deep Reinforcement Learning Course's repo]() where each part of the code is explained in the comments"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 1: Download Sonic the Hedgehog 2 and 3💻"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"1. The first step is to download the games, to make it works on retro you need to buy them legally on Steam\n",
	"<br><br>\n",
	"[Sonic the Hedgehog 2](https://store.steampowered.com/app/71163/Sonic_The_Hedgehog_2/)<br>\n",
	"[Sonic 3 & Knuckles](https://store.steampowered.com/app/71162/Sonic_3__Knuckles/)<br>\n",
	"\n",
	"<br>\n",
	"<img src=\"https://steamcdn-a.akamaihd.net/steam/apps/71162/header.jpg?t=1522076247\" alt=\"Sonic the Hedgehog 3\"/>\n",
	"<br>\n",
	"2. Then follow the Quickstart part of this [website](https://contest.openai.com/2018-1/details/)\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 2: Build all elements we need for our environement in sonic_env.py 🖼️"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"- `PreprocessFrame(gym.ObservationWrapper)` : in this class we will preprocess our environment\n",
	" - Set frame to gray \n",
	" - Resize the frame to 96x96x1\n",
	"<br>\n",
	"<br>\n",
	"- `ActionsDiscretizer(gym.ActionWrapper)` : in this class we limit the possibles actions in our environment (make it discrete)\n",
	"\n",
	"In fact you'll see that for each action in actions:\n",
	" Create an array of 12 False (12 = nb of buttons)\n",
	" For each button in action: (for instance ['LEFT']) we need to make that left button index = True\n",
	" Then the button index = LEFT = True\n",
	"\n",
	"--> In fact at the end we will have an array where each array is an action and each elements True of this array are the buttons clicked.\n",
	"For instance LEFT action = [F, F, F, F, F, F, T, F, F, F, F, F]\n",
	"<br><br>\n",
	"- `RewardScaler(gym.RewardWrapper)`: We scale the rewards to reasonable scale (useful in PPO).\n",
	"<br><br>\n",
	"- `AllowBacktracking(gym.Wrapper)`: Allow the agent to go backward without being discourage (avoid our agent to be stuck on a wall during the game).\n",
	"<br><br>\n",
	"\n",
	"- `make_env(env_idx)` : Build an environement (and stack 4 frames together using `FrameStack`\n",
	"<br><br>\n",
	"The idea is that we'll build multiple instances of the environment, different environments each times (different level) to avoid overfitting and helping our agent to generalize better at playing sonic \n",
	"<br><br>\n",
	"To handle these multiple environements we'll use `SubprocVecEnv` that creates a vector of n environments to run them simultaneously."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 3: Build the PPO architecture in architecture.py 🧠"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"- `from baselines.common.distributions import make_pdtype`: This function selects the probability distribution over actions\n",
	"<br><br>\n",
	"\n",
	"- First, we create two functions that will help us to avoid to call conv and fc each time.\n",
	" - `conv`: function to create a convolutional layer.\n",
	" - `fc`: function to create a fully connected layer.\n",
	"<br><br>\n",
	"- Then, we create `PPOPolicy`, the object that contains the architecture<br>\n",
	"3 CNN for spatial dependencies<br>\n",
	"Temporal dependencies is handle by stacking frames<br>\n",
	"(Something funny nobody use LSTM in OpenAI Retro contest)<br>\n",
	"1 common FC<br>\n",
	"1 FC for policy<br>\n",
	"1 FC for value\n",
	"<br><br>\n",
	"- `self.pdtype = make_pdtype(action_space)`: Based on the action space, will select what probability distribution typewe will use to distribute action in our stochastic policy (in our case DiagGaussianPdType aka Diagonal Gaussian, multivariate normal distribution\n",
	"<br><br>\n",
	"- `self.pdtype.pdfromlatent` : return a returns a probability distribution over actions (self.pd) and our pi logits (self.pi).\n",
	"<br><br>\n",
	"\n",
	"- We create also 3 useful functions in PPOPolicy\n",
	" - `def step(state_in, _args, _kwargs)`: Function use to take a step returns action to take, V(s) and neglogprob*\n",
	" - `def value(state_in, _args, _kwargs)`: Function that calculates only the V(s)*\n",
	" - `def select_action(state_in, _args, _kwargs)`: Function that output only the action to take*"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 4: Build the Model in model.py 🏗️"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We use Model object to:\n",
	"__init__:\n",
	"- `policy(sess, ob_space, action_space, nenvs, 1, reuse=False)`: Creates the step_model (used for sampling)\n",
	"- `policy(sess, ob_space, action_space, nenvs*nsteps, nsteps, reuse=True)`: Creates the train_model (used for training)\n",
	"\n",
	"- `def train(states_in, actions, returns, values, lr)`: Make the training part (calculate advantage and feedforward and retropropagation of gradients)\n",
	"\n",
	"First we create the placeholders:\n",
	"- `oldneglopac_`: keep track of the old policy\n",
	"- `oldvpred_`: keep track of the old value\n",
	"- `cliprange_`: keep track of the cliprange\n",
	"\n",
	"save/load():\n",
	"- `def save(save_path)`:Save the weights.\n",
	"- `def load(load_path)`:Load the weights"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 5: Build the Runner in model.py 🏃\n",
	"Runner will be used to make a mini batch of experiences\n",
	"- Each environement send 1 timestep (4 frames stacked) (`self.obs`)\n",
	"- This goes through `step_model`\n",
	" - Returns actions, values.\n",
	"- Append `mb_obs`, `mb_actions`, `mb_values`, `mb_dones`, `mb_neglopacs`.\n",
	"- Take actions in environments and watch the consequences\n",
	" - return `obs`, `rewards`, `dones`, `infos` (which contains a ton of useful informations such as the number of rings, the player position etc...)\n",
	" \n",
	"- We need to calculate advantage to do that we use General Advantage Estimation"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 6: Build the learn function in model.py\n",
	"The learn function can be seen as the gathering of all the logic of our A2C\n",
	"- Instantiate the model object (that creates step_model and train_model)\n",
	"- Instantiate the runner object\n",
	"- Train always in two phases:\n",
	" - Run to get a batch of experiences\n",
	" - Train that batch of experiences\n",
	"<br><br>\n",
	"\n",
	"We use explained_variance which is a really important parameter:\n",
	"<br><br>\n",
	"`ev = 1 - Variance[y - ypredicted] / Variance [y]`\n",
	"<br><br>\n",
	"In fact this calculates if value function is a good predictor of the returns or if it's just worse than predicting nothing.\n",
	"ev=0 => might as well have predicted zero\n",
	"ev<0 => worse than just predicting zero so you're overfitting (need to tune some hyperparameters)\n",
	"ev=1 => perfect prediction\n",
	"\n",
	"--> The goal is that ev goes closer and closer to 1."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 7: Build the play function in model.py\n",
	"- This function will be use to play an environment using the trained model."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 8: Build the agent.py"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"- `config.gpu_options.allow_growth = True` : This creates a GPU session\n",
	"- `model.learn(...)` : Here we just call the learn function that contains all the elements needed to train our PPO agent\n"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [conda env:gameplai]",
	"language": "python",
	"name": "conda-env-gameplai-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}