simoninithomas/**Q* Learning with FrozenLake.ipynb**

## Q* Learning with FrozenLake.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Q* Learning with FrozenLake 🕹️⛄\n",
    "<br> \n",
    "In this Notebook, we'll implement an agent <b>that plays FrozenLake.</b>\n",
    "<img src=\"frozenlake.png\" alt=\"Frozen Lake\"/>\n",
    "\n",
    "The goal of this game is <b>to go from the starting state (S) to the goal state (G)</b> by walking only on frozen tiles (F) and avoid holes (H).However, the ice is slippery, <b>so you won't always move in the direction you intend (stochastic environment)</b>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
    "<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png\" alt=\"Deep Reinforcement Course\"/>\n",
    "<br>\n",
    "<p>  Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.**\n",
    "<br><br>\n",
    "    \n",
    "📜The articles explain the architectures from the big picture to the mathematical details behind them.\n",
    "<br>\n",
    "📹 The videos explain how to build the agents with Tensorflow </b></p>\n",
    "<br>\n",
    "This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. \n",
    "<br><br>\n",
    "</p> \n",
    "\n",
    "## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
    "\n",
    "\n",
    "## Any questions 👨‍💻\n",
    "<p> If you have any questions, feel free to ask me: </p>\n",
    "<p> 📧: <a href=\"mailto:hello@simoninithomas.com\">hello@simoninithomas.com</a>  </p>\n",
    "<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>\n",
    "<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>\n",
    "<p> Twitter: <a href=\"https://twitter.com/ThomasSimonini\">@ThomasSimonini</a> </p>\n",
    "<p> Don't forget to <b> follow me on <a href=\"https://twitter.com/ThomasSimonini\">twitter</a>, <a href=\"https://github.com/simoninithomas/Deep_reinforcement_learning_Course\">github</a> and <a href=\"https://medium.com/@thomassimonini\">Medium</a> to be alerted of the new articles that I publish </b></p>\n",
    "    \n",
    "## How to help  🙌\n",
    "3 ways:\n",
    "- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.\n",
    "- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. \n",
    "- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites 🏗️\n",
    "Before diving on the notebook **you need to understand**:\n",
    "- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419)\n",
    "- Q-learning [Article](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe)\n",
    "- In the [video version](https://www.youtube.com/watch?v=q2ZOEFAaaI0)  we implemented a Q-learning agent that learns to play OpenAI Taxi-v2 🚕 with Numpy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/q2ZOEFAaaI0?showinfo=0\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "HTML('<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/q2ZOEFAaaI0?showinfo=0\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 0: Import the dependencies 📚\n",
    "We use 3 libraries:\n",
    "- `Numpy` for our Qtable\n",
    "- `OpenAI Gym` for our FrozenLake Environment\n",
    "- `Random` to generate random numbers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import gym\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Create the environment 🎮\n",
    "- Here we'll create the FrozenLake environment. \n",
    "- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>\n",
    "- In our case we choose to use Frozen Lake."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make(\"FrozenLake-v0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Create the Q-table and initialize it 🗄️\n",
    "- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size\n",
    "- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_size = env.action_space.n\n",
    "state_size = env.observation_space.n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]]\n"
     ]
    }
   ],
   "source": [
    "qtable = np.zeros((state_size, action_size))\n",
    "print(qtable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Create the hyperparameters ⚙️\n",
    "- Here, we'll specify the hyperparameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_episodes = 15000        # Total episodes\n",
    "learning_rate = 0.8           # Learning rate\n",
    "max_steps = 99                # Max steps per episode\n",
    "gamma = 0.95                  # Discounting rate\n",
    "\n",
    "# Exploration parameters\n",
    "epsilon = 1.0                 # Exploration rate\n",
    "max_epsilon = 1.0             # Exploration probability at start\n",
    "min_epsilon = 0.01            # Minimum exploration probability \n",
    "decay_rate = 0.005             # Exponential decay rate for exploration prob"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: The Q learning algorithm 🧠\n",
    "- Now we implement the Q learning algorithm:\n",
    "<img src=\"qtable_algo.png\" alt=\"Q algo\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Score over time: 0.4755333333333333\n",
      "[[3.09661199e-01 4.20986767e-02 4.09817720e-02 4.33154671e-02]\n",
      " [3.04309088e-03 1.77615720e-02 1.75027968e-04 4.48805036e-02]\n",
      " [1.17515610e-02 3.49659785e-03 1.25602764e-02 1.45895688e-02]\n",
      " [5.30730075e-03 2.00738408e-03 2.10082319e-03 1.03044803e-02]\n",
      " [3.74544071e-01 1.14433376e-02 4.25301395e-02 8.92078716e-03]\n",
      " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
      " [2.45730220e-03 5.11951837e-05 2.32423145e-06 4.80236578e-07]\n",
      " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
      " [1.15951273e-01 2.26517591e-02 2.95426375e-03 4.22247574e-01]\n",
      " [2.73740942e-03 2.56680897e-01 5.08957170e-02 5.09211745e-02]\n",
      " [7.61741394e-03 7.11600600e-01 3.66761331e-03 1.12599083e-02]\n",
      " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
      " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
      " [3.78622225e-02 2.89343711e-02 4.23222346e-01 5.43340302e-02]\n",
      " [1.34016966e-01 1.90320465e-01 1.39202525e-01 8.99555845e-01]\n",
      " [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]\n"
     ]
    }
   ],
   "source": [
    "# List of rewards\n",
    "rewards = []\n",
    "\n",
    "# 2 For life or until learning is stopped\n",
    "for episode in range(total_episodes):\n",
    "    # Reset the environment\n",
    "    state = env.reset()\n",
    "    step = 0\n",
    "    done = False\n",
    "    total_rewards = 0\n",
    "    \n",
    "    for step in range(max_steps):\n",
    "        # 3. Choose an action a in the current world state (s)\n",
    "        ## First we randomize a number\n",
    "        exp_exp_tradeoff = random.uniform(0, 1)\n",
    "        \n",
    "        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)\n",
    "        if exp_exp_tradeoff > epsilon:\n",
    "            action = np.argmax(qtable[state,:])\n",
    "\n",
    "        # Else doing a random choice --> exploration\n",
    "        else:\n",
    "            action = env.action_space.sample()\n",
    "\n",
    "        # Take the action (a) and observe the outcome state(s') and reward (r)\n",
    "        new_state, reward, done, info = env.step(action)\n",
    "\n",
    "        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
    "        # qtable[new_state,:] : all the actions we can take from new state\n",
    "        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])\n",
    "        \n",
    "        total_rewards += reward\n",
    "        \n",
    "        # Our new state is state\n",
    "        state = new_state\n",
    "        \n",
    "        # If done (if we're dead) : finish episode\n",
    "        if done == True: \n",
    "            break\n",
    "        \n",
    "    # Reduce epsilon (because we need less and less exploration)\n",
    "    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) \n",
    "    rewards.append(total_rewards)\n",
    "\n",
    "print (\"Score over time: \" +  str(sum(rewards)/total_episodes))\n",
    "print(qtable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Use our Q-table to play FrozenLake ! 👾\n",
    "- After 10 000 episodes, our Q-table can be used as a \"cheatsheet\" to play FrozenLake\"\n",
    "- By running this cell you can see our agent playing FrozenLake."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "env.reset()\n",
    "\n",
    "for episode in range(5):\n",
    "    state = env.reset()\n",
    "    step = 0\n",
    "    done = False\n",
    "    print(\"****************************************************\")\n",
    "    print(\"EPISODE \", episode)\n",
    "\n",
    "    for step in range(max_steps):\n",
    "        \n",
    "        # Take the action (index) that have the maximum expected future reward given that state\n",
    "        action = np.argmax(qtable[state,:])\n",
    "        \n",
    "        new_state, reward, done, info = env.step(action)\n",
    "        \n",
    "        if done:\n",
    "            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)\n",
    "            env.render()\n",
    "            \n",
    "            # We print the number of step it took.\n",
    "            print(\"Number of steps\", step)\n",
    "            break\n",
    "        state = new_state\n",
    "env.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Q* Learning with FrozenLake 🕹️⛄\n",
	"<br> \n",
	"In this Notebook, we'll implement an agent <b>that plays FrozenLake.</b>\n",
	"<img src=\"frozenlake.png\" alt=\"Frozen Lake\"/>\n",
	"\n",
	"The goal of this game is <b>to go from the starting state (S) to the goal state (G)</b> by walking only on frozen tiles (F) and avoid holes (H).However, the ice is slippery, <b>so you won't always move in the direction you intend (stochastic environment)</b>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
	"<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png\" alt=\"Deep Reinforcement Course\"/>\n",
	"<br>\n",
	"<p> Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.\n",
	"<br><br>\n",
	" \n",
	"📜The articles explain the architectures from the big picture to the mathematical details behind them.\n",
	"<br>\n",
	"📹 The videos explain how to build the agents with Tensorflow </b></p>\n",
	"<br>\n",
	"This course will give you a solid foundation for understanding and implementing the future state of the art algorithms. And, you'll build a strong professional portfolio by creating agents that learn to play awesome environments: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. \n",
	"<br><br>\n",
	"</p> \n",
	"\n",
	"## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
	"\n",
	"\n",
	"## Any questions 👨‍💻\n",
	"<p> If you have any questions, feel free to ask me: </p>\n",
	"<p> 📧: <a href=\"mailto:hello@simoninithomas.com\">hello@simoninithomas.com</a> </p>\n",
	"<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>\n",
	"<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>\n",
	"<p> Twitter: <a href=\"https://twitter.com/ThomasSimonini\">@ThomasSimonini</a> </p>\n",
	"<p> Don't forget to <b> follow me on <a href=\"https://twitter.com/ThomasSimonini\">twitter</a>, <a href=\"https://github.com/simoninithomas/Deep_reinforcement_learning_Course\">github</a> and <a href=\"https://medium.com/@thomassimonini\">Medium</a> to be alerted of the new articles that I publish </b></p>\n",
	" \n",
	"## How to help 🙌\n",
	"3 ways:\n",
	"- Clap our articles and like our videos a lot:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.\n",
	"- Share and speak about our articles and videos: By sharing our articles and videos you help us to spread the word. \n",
	"- Improve our notebooks: if you found a bug or a better implementation you can send a pull request.\n",
	"<br>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Prerequisites 🏗️\n",
	"Before diving on the notebook you need to understand:\n",
	"- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419)\n",
	"- Q-learning [Article](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe)\n",
	"- In the [video version](https://www.youtube.com/watch?v=q2ZOEFAaaI0) we implemented a Q-learning agent that learns to play OpenAI Taxi-v2 🚕 with Numpy."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/q2ZOEFAaaI0?showinfo=0\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>"
	],
	"text/plain": [
	"<IPython.core.display.HTML object>"
	]
	},
	"execution_count": 1,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from IPython.display import HTML\n",
	"HTML('<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/q2ZOEFAaaI0?showinfo=0\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 0: Import the dependencies 📚\n",
	"We use 3 libraries:\n",
	"- `Numpy` for our Qtable\n",
	"- `OpenAI Gym` for our FrozenLake Environment\n",
	"- `Random` to generate random numbers"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"import gym\n",
	"import random"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 1: Create the environment 🎮\n",
	"- Here we'll create the FrozenLake environment. \n",
	"- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>\n",
	"- In our case we choose to use Frozen Lake."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"env = gym.make(\"FrozenLake-v0\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 2: Create the Q-table and initialize it 🗄️\n",
	"- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size\n",
	"- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"action_size = env.action_space.n\n",
	"state_size = env.observation_space.n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]\n",
	" [0. 0. 0. 0.]]\n"
	]
	}
	],
	"source": [
	"qtable = np.zeros((state_size, action_size))\n",
	"print(qtable)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 3: Create the hyperparameters ⚙️\n",
	"- Here, we'll specify the hyperparameters"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"total_episodes = 15000 # Total episodes\n",
	"learning_rate = 0.8 # Learning rate\n",
	"max_steps = 99 # Max steps per episode\n",
	"gamma = 0.95 # Discounting rate\n",
	"\n",
	"# Exploration parameters\n",
	"epsilon = 1.0 # Exploration rate\n",
	"max_epsilon = 1.0 # Exploration probability at start\n",
	"min_epsilon = 0.01 # Minimum exploration probability \n",
	"decay_rate = 0.005 # Exponential decay rate for exploration prob"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 4: The Q learning algorithm 🧠\n",
	"- Now we implement the Q learning algorithm:\n",
	"<img src=\"qtable_algo.png\" alt=\"Q algo\"/>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Score over time: 0.4755333333333333\n",
	"[[3.09661199e-01 4.20986767e-02 4.09817720e-02 4.33154671e-02]\n",
	" [3.04309088e-03 1.77615720e-02 1.75027968e-04 4.48805036e-02]\n",
	" [1.17515610e-02 3.49659785e-03 1.25602764e-02 1.45895688e-02]\n",
	" [5.30730075e-03 2.00738408e-03 2.10082319e-03 1.03044803e-02]\n",
	" [3.74544071e-01 1.14433376e-02 4.25301395e-02 8.92078716e-03]\n",
	" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
	" [2.45730220e-03 5.11951837e-05 2.32423145e-06 4.80236578e-07]\n",
	" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
	" [1.15951273e-01 2.26517591e-02 2.95426375e-03 4.22247574e-01]\n",
	" [2.73740942e-03 2.56680897e-01 5.08957170e-02 5.09211745e-02]\n",
	" [7.61741394e-03 7.11600600e-01 3.66761331e-03 1.12599083e-02]\n",
	" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
	" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
	" [3.78622225e-02 2.89343711e-02 4.23222346e-01 5.43340302e-02]\n",
	" [1.34016966e-01 1.90320465e-01 1.39202525e-01 8.99555845e-01]\n",
	" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]\n"
	]
	}
	],
	"source": [
	"# List of rewards\n",
	"rewards = []\n",
	"\n",
	"# 2 For life or until learning is stopped\n",
	"for episode in range(total_episodes):\n",
	" # Reset the environment\n",
	" state = env.reset()\n",
	" step = 0\n",
	" done = False\n",
	" total_rewards = 0\n",
	" \n",
	" for step in range(max_steps):\n",
	" # 3. Choose an action a in the current world state (s)\n",
	" ## First we randomize a number\n",
	" exp_exp_tradeoff = random.uniform(0, 1)\n",
	" \n",
	" ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)\n",
	" if exp_exp_tradeoff > epsilon:\n",
	" action = np.argmax(qtable[state,:])\n",
	"\n",
	" # Else doing a random choice --> exploration\n",
	" else:\n",
	" action = env.action_space.sample()\n",
	"\n",
	" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
	" new_state, reward, done, info = env.step(action)\n",
	"\n",
	" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
	" # qtable[new_state,:] : all the actions we can take from new state\n",
	" qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])\n",
	" \n",
	" total_rewards += reward\n",
	" \n",
	" # Our new state is state\n",
	" state = new_state\n",
	" \n",
	" # If done (if we're dead) : finish episode\n",
	" if done == True: \n",
	" break\n",
	" \n",
	" # Reduce epsilon (because we need less and less exploration)\n",
	" epsilon = min_epsilon + (max_epsilon - min_epsilon)np.exp(-decay_rateepisode) \n",
	" rewards.append(total_rewards)\n",
	"\n",
	"print (\"Score over time: \" + str(sum(rewards)/total_episodes))\n",
	"print(qtable)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Step 5: Use our Q-table to play FrozenLake ! 👾\n",
	"- After 10 000 episodes, our Q-table can be used as a \"cheatsheet\" to play FrozenLake\"\n",
	"- By running this cell you can see our agent playing FrozenLake."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"env.reset()\n",
	"\n",
	"for episode in range(5):\n",
	" state = env.reset()\n",
	" step = 0\n",
	" done = False\n",
	" print(\"****************************************************\")\n",
	" print(\"EPISODE \", episode)\n",
	"\n",
	" for step in range(max_steps):\n",
	" \n",
	" # Take the action (index) that have the maximum expected future reward given that state\n",
	" action = np.argmax(qtable[state,:])\n",
	" \n",
	" new_state, reward, done, info = env.step(action)\n",
	" \n",
	" if done:\n",
	" # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)\n",
	" env.render()\n",
	" \n",
	" # We print the number of step it took.\n",
	" print(\"Number of steps\", step)\n",
	" break\n",
	" state = new_state\n",
	"env.close()"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [default]",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}