"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"# Q* Learning with FrozenLake 🕹️⛄\n",
"<br> \n",
"In this Notebook, we'll implement an agent <b>that plays FrozenLake.</b>\n",
"<img src=\"frozenlake.png\" alt=\"Frozen Lake\"/>\n",
"The goal of this game is <b>to go from the starting state (S) to the goal state (G)</b> by walking only on frozen tiles (F) and avoid holes (H).However, the ice is slippery, <b>so you won't always move in the direction you intend (stochastic environment)</b>"
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites 🏗️\n",
"Before diving on the notebook **you need to understand**:\n",
"- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](\n",
"- Q-learning [Article](\n",
"- In the [video version]( we implemented a Q-learning agent that learns to play OpenAI Taxi-v2 🚕 with Numpy."
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
"data": {
"text/html": [
"<iframe width=\"560\" height=\"315\" src=\"\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>"
"text/plain": [
"<IPython.core.display.HTML object>"
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
"source": [
"from IPython.display import HTML\n",
"HTML('<iframe width=\"560\" height=\"315\" src=\"\" frameborder=\"0\" allow=\"autoplay; encrypted-media\" allowfullscreen></iframe>')"
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 0: Import the dependencies 📚\n",
"We use 3 libraries:\n",
"- `Numpy` for our Qtable\n",
"- `OpenAI Gym` for our FrozenLake Environment\n",
"- `Random` to generate random numbers"
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import gym\n",
"import random"
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Create the environment 🎮\n",
"- Here we'll create the FrozenLake environment. \n",
"- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>\n",
"- In our case we choose to use Frozen Lake."
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"env = gym.make(\"FrozenLake-v0\")"
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Create the Q-table and initialize it 🗄️\n",
"- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size\n",
"- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`"
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"action_size = env.action_space.n\n",
"state_size = env.observation_space.n"
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"[[0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]\n",
" [0. 0. 0. 0.]]\n"
"source": [
"qtable = np.zeros((state_size, action_size))\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Create the hyperparameters ⚙️\n",
"- Here, we'll specify the hyperparameters"
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"total_episodes = 15000 # Total episodes\n",
"learning_rate = 0.8 # Learning rate\n",
"max_steps = 99 # Max steps per episode\n",
"gamma = 0.95 # Discounting rate\n",
"# Exploration parameters\n",
"epsilon = 1.0 # Exploration rate\n",
"max_epsilon = 1.0 # Exploration probability at start\n",
"min_epsilon = 0.01 # Minimum exploration probability \n",
"decay_rate = 0.005 # Exponential decay rate for exploration prob"
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: The Q learning algorithm 🧠\n",
"- Now we implement the Q learning algorithm:\n",
"<img src=\"qtable_algo.png\" alt=\"Q algo\"/>"
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"Score over time: 0.4755333333333333\n",
"[[3.09661199e-01 4.20986767e-02 4.09817720e-02 4.33154671e-02]\n",
" [3.04309088e-03 1.77615720e-02 1.75027968e-04 4.48805036e-02]\n",
" [1.17515610e-02 3.49659785e-03 1.25602764e-02 1.45895688e-02]\n",
" [5.30730075e-03 2.00738408e-03 2.10082319e-03 1.03044803e-02]\n",
" [3.74544071e-01 1.14433376e-02 4.25301395e-02 8.92078716e-03]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
" [2.45730220e-03 5.11951837e-05 2.32423145e-06 4.80236578e-07]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
" [1.15951273e-01 2.26517591e-02 2.95426375e-03 4.22247574e-01]\n",
" [2.73740942e-03 2.56680897e-01 5.08957170e-02 5.09211745e-02]\n",
" [7.61741394e-03 7.11600600e-01 3.66761331e-03 1.12599083e-02]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]\n",
" [3.78622225e-02 2.89343711e-02 4.23222346e-01 5.43340302e-02]\n",
" [1.34016966e-01 1.90320465e-01 1.39202525e-01 8.99555845e-01]\n",
" [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]\n"
"source": [
"# List of rewards\n",
"rewards = []\n",
"# 2 For life or until learning is stopped\n",
"for episode in range(total_episodes):\n",
" # Reset the environment\n",
" state = env.reset()\n",
" step = 0\n",
" done = False\n",
" total_rewards = 0\n",
" \n",
" for step in range(max_steps):\n",
" # 3. Choose an action a in the current world state (s)\n",
" ## First we randomize a number\n",
" exp_exp_tradeoff = random.uniform(0, 1)\n",
" \n",
" ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)\n",
" if exp_exp_tradeoff > epsilon:\n",
" action = np.argmax(qtable[state,:])\n",
" # Else doing a random choice --> exploration\n",
" else:\n",
" action = env.action_space.sample()\n",
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
" new_state, reward, done, info = env.step(action)\n",
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
" # qtable[new_state,:] : all the actions we can take from new state\n",
" qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])\n",
" \n",
" total_rewards += reward\n",
" \n",
" # Our new state is state\n",
" state = new_state\n",
" \n",
" # If done (if we're dead) : finish episode\n",
" if done == True: \n",
" break\n",
" \n",
" # Reduce epsilon (because we need less and less exploration)\n",
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) \n",
" rewards.append(total_rewards)\n",
"print (\"Score over time: \" + str(sum(rewards)/total_episodes))\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5: Use our Q-table to play FrozenLake ! 👾\n",
"- After 10 000 episodes, our Q-table can be used as a \"cheatsheet\" to play FrozenLake\"\n",
"- By running this cell you can see our agent playing FrozenLake."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for episode in range(5):\n",
" state = env.reset()\n",
" step = 0\n",
" done = False\n",
" print(\"****************************************************\")\n",
" print(\"EPISODE \", episode)\n",
" for step in range(max_steps):\n",
" \n",
" # Take the action (index) that have the maximum expected future reward given that state\n",
" action = np.argmax(qtable[state,:])\n",
" \n",
" new_state, reward, done, info = env.step(action)\n",
" \n",
" if done:\n",
" # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)\n",
" env.render()\n",
" \n",
" # We print the number of step it took.\n",
" print(\"Number of steps\", step)\n",
" break\n",
" state = new_state\n",
For some runs, the value of the qtable does not change (outputs all zeros after Step 4.). I tried fixing the seed and still get different qtables at the end. Could you tell me why this would be the case?

Btw, awesome work on the reinforcement learning articles!

In step 4:


step = 0
done = False

And add:
action = None
exp_exp_tradeoff = random.uniform(0, 1)

---> 30 qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])

IndexError: arrays used as indices must be of integer (or boolean) type

Any idea why this is happening?

my code is exactly same but I am getting total 143 rewards in 10000(ten thousand) episode. very low accuracy

Copy link

