Skip to content

Instantly share code, notes, and snippets.

@simoninithomas
Last active May 9, 2023 06:15
Show Gist options
  • Save simoninithomas/baafe42d1a665fb297ca669aa2fa6f92 to your computer and use it in GitHub Desktop.
Save simoninithomas/baafe42d1a665fb297ca669aa2fa6f92 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Q* Learning with FrozenLake 🕹️⛄\n",
"<br> \n",
"In this Notebook, we'll implement an agent <b>that plays FrozenLake.</b>\n",
"<img src=\"frozenlake.png\" alt=\"Frozen Lake\"/>\n",
"\n",
"The goal of this game is <b>to go from the starting state (S) to the goal state (G)</b> by walking only on frozen tiles (F) and avoid holes (H).However, the ice is slippery, <b>so you won't always move in the direction you intend (stochastic environment)</b>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
"<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png\" alt=\"Deep Reinforcement Course\"/>\n",
"<br>\n",
"<p> Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.**\n",
"<br><br>\n",
" \n",
"📜The articles explain the architectures from the big picture to the mathematical details behind them.\n",
"<br>\n",
"📹 The videos explain how to build the agents with Tensorflow </b></p>\n",
"<br>\n",
"This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. \n",
"<br><br>\n",
"</p> \n",
"\n",
"## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)\n",
"\n",
"\n",
"## Any questions 👨‍💻\n",
"<p> If you have any questions, feel free to ask me: </p>\n",
"<p> 📧: <a href=\"mailto:hello@simoninithomas.com\">hello@simoninithomas.com</a> </p>\n",
"<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>\n",
"<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>\n",
"<p> Twitter: <a href=\"https://twitter.com/ThomasSimonini\">@ThomasSimonini</a> </p>\n",
"<p> Don't forget to <b> follow me on <a href=\"https://twitter.com/ThomasSimonini\">twitter</a>, <a href=\"https://github.com/simoninithomas/Deep_reinforcement_learning_Course\">github</a> and <a href=\"https://medium.com/@thomassimonini\">Medium</a> to be alerted of the new articles that I publish </b></p>\n",
" \n",
"## How to help 🙌\n",
"3 ways:\n",
"- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.\n",
"- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. \n",
"- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites 🏗️\n",
"Before diving on the notebook **you need to understand**:\n",
"- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419)\n",
"- Q-learning [Article](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe)\n",
"- In the [video version](https://www.youtube.com/watch?v=q2ZOEFAaaI0) we implemented a Q-learning agent that learns to play OpenAI Taxi-v2 🚕 with Numpy."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/q2ZOEFAaaI0?showinfo=0\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import HTML\n",
"HTML('<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/q2ZOEFAaaI0?showinfo=0\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 0: Import the dependencies 📚\n",
"We use 3 libraries:\n",
"- `Numpy` for our Qtable\n",
"- `OpenAI Gym` for our FrozenLake Environment\n",
"- `Random` to generate random numbers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import gym\n",
"import random"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Create the environment 🎮\n",
"- Here we'll create the FrozenLake environment. \n",
"- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>\n",
"- In our case we choose to use Frozen Lake."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"env = gym.make(\"FrozenLake-v0\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Create the Q-table and initialize it 🗄️\n",
"- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size\n",
"- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"action_size = env.action_space.n\n",
"state_size = env.observation_space.n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]]\n"
]
}
],
"source": [
"qtable = np.zeros((state_size, action_size))\n",
"print(qtable)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Create the hyperparameters ⚙️\n",
"- Here, we'll specify the hyperparameters"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"total_episodes = 15000 # Total episodes\n",
"learning_rate = 0.8 # Learning rate\n",
"max_steps = 99 # Max steps per episode\n",
"gamma = 0.95 # Discounting rate\n",
"\n",
"# Exploration parameters\n",
"epsilon = 1.0 # Exploration rate\n",
"max_epsilon = 1.0 # Exploration probability at start\n",
"min_epsilon = 0.01 # Minimum exploration probability \n",
"decay_rate = 0.005 # Exponential decay rate for exploration prob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: The Q learning algorithm 🧠\n",
"- Now we implement the Q learning algorithm:\n",
"<img src=\"qtable_algo.png\" alt=\"Q algo\"/>"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Score over time: 0.4755333333333333\n",
"[[3.09661199e-01 4.20986767e-02 4.09817720e-02 4.33154671e-02]\n",
" [3.04309088e-03 1.77615720e-02 1.75027968e-04 4.48805036e-02]\n",
" [1.17515610e-02 3.49659785e-03 1.25602764e-02 1.45895688e-02]\n",
" [5.30730075e-03 2.00738408e-03 2.10082319e-03 1.03044803e-02]\n",
" [3.74544071e-01 1.14433376e-02 4.25301395e-02 8.92078716e-03]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
" [2.45730220e-03 5.11951837e-05 2.32423145e-06 4.80236578e-07]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
" [1.15951273e-01 2.26517591e-02 2.95426375e-03 4.22247574e-01]\n",
" [2.73740942e-03 2.56680897e-01 5.08957170e-02 5.09211745e-02]\n",
" [7.61741394e-03 7.11600600e-01 3.66761331e-03 1.12599083e-02]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
" [3.78622225e-02 2.89343711e-02 4.23222346e-01 5.43340302e-02]\n",
" [1.34016966e-01 1.90320465e-01 1.39202525e-01 8.99555845e-01]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]\n"
]
}
],
"source": [
"# List of rewards\n",
"rewards = []\n",
"\n",
"# 2 For life or until learning is stopped\n",
"for episode in range(total_episodes):\n",
" # Reset the environment\n",
" state = env.reset()\n",
" step = 0\n",
" done = False\n",
" total_rewards = 0\n",
" \n",
" for step in range(max_steps):\n",
" # 3. Choose an action a in the current world state (s)\n",
" ## First we randomize a number\n",
" exp_exp_tradeoff = random.uniform(0, 1)\n",
" \n",
" ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)\n",
" if exp_exp_tradeoff > epsilon:\n",
" action = np.argmax(qtable[state,:])\n",
"\n",
" # Else doing a random choice --> exploration\n",
" else:\n",
" action = env.action_space.sample()\n",
"\n",
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
" new_state, reward, done, info = env.step(action)\n",
"\n",
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
" # qtable[new_state,:] : all the actions we can take from new state\n",
" qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])\n",
" \n",
" total_rewards += reward\n",
" \n",
" # Our new state is state\n",
" state = new_state\n",
" \n",
" # If done (if we're dead) : finish episode\n",
" if done == True: \n",
" break\n",
" \n",
" # Reduce epsilon (because we need less and less exploration)\n",
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) \n",
" rewards.append(total_rewards)\n",
"\n",
"print (\"Score over time: \" + str(sum(rewards)/total_episodes))\n",
"print(qtable)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5: Use our Q-table to play FrozenLake ! 👾\n",
"- After 10 000 episodes, our Q-table can be used as a \"cheatsheet\" to play FrozenLake\"\n",
"- By running this cell you can see our agent playing FrozenLake."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"env.reset()\n",
"\n",
"for episode in range(5):\n",
" state = env.reset()\n",
" step = 0\n",
" done = False\n",
" print(\"****************************************************\")\n",
" print(\"EPISODE \", episode)\n",
"\n",
" for step in range(max_steps):\n",
" \n",
" # Take the action (index) that have the maximum expected future reward given that state\n",
" action = np.argmax(qtable[state,:])\n",
" \n",
" new_state, reward, done, info = env.step(action)\n",
" \n",
" if done:\n",
" # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)\n",
" env.render()\n",
" \n",
" # We print the number of step it took.\n",
" print(\"Number of steps\", step)\n",
" break\n",
" state = new_state\n",
"env.close()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@harshilpatel312
Copy link

For some runs, the value of the qtable does not change (outputs all zeros after Step 4.). I tried fixing the seed and still get different qtables at the end. Could you tell me why this would be the case?

Btw, awesome work on the reinforcement learning articles!

@anubhavshrimal
Copy link

For some runs, the value of the qtable does not change (outputs all zeros after Step 4.). I tried fixing the seed and still get different qtables at the end. Could you tell me why this would be the case?

In step 4:

Remove:

step = 0
done = False

And add:
action = None
after
exp_exp_tradeoff = random.uniform(0, 1)

@rmihir96
Copy link

---> 30 qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
31
32

IndexError: arrays used as indices must be of integer (or boolean) type

Any idea why this is happening?

@anuyash49
Copy link

my code is exactly same but I am getting total 143 rewards in 10000(ten thousand) episode. very low accuracy

@simoninithomas
Copy link
Author

Hey there this code is obsolete check this instead: https://huggingface.co/learn/deep-rl-course/unit2/introduction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment