Skip to content

Instantly share code, notes, and snippets.

@simoninithomas
Last active August 10, 2020 02:37
Show Gist options
  • Star 13 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save simoninithomas/7611db5d8a6f3edde269e18b97fa4d0c to your computer and use it in GitHub Desktop.
Save simoninithomas/7611db5d8a6f3edde269e18b97fa4d0c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Deep Q learning with Doom 🕹️\n",
"In this notebook we'll implement an agent <b>that plays Doom by using a Deep Q learning architecture.</b> <br>\n",
"Our agent playing Doom:\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/DQN%20Doom/assets/doom.gif\" style=\"max-width: 600px;\" alt=\"Deep Q learning with Doom\"/>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# This is a notebook from Deep Reinforcement Learning Course with Tensorflow\n",
"<img src=\"https://simoninithomas.github.io/Deep_reinforcement_learning_Course/assets/img/preview.jpg\" alt=\"Deep Reinforcement Course\" style=\"width: 500px;\"/>\n",
"\n",
"<p> Deep Reinforcement Learning Course is a free series of blog posts and videos 🆕 about Deep Reinforcement Learning, where we'll learn the main algorithms, and how to implement them with Tensorflow.\n",
"\n",
"📜The articles explain the concept from the big picture to the mathematical details behind it.\n",
"\n",
"📹 The videos explain how to create the agent with Tensorflow </b></p>\n",
"\n",
"## <a href=\"https://simoninithomas.github.io/Deep_reinforcement_learning_Course/\">Syllabus</a><br>\n",
"### 📜 Part 1: Introduction to Reinforcement Learning [ARTICLE](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419) \n",
"\n",
"### Part 2: Q-learning with FrozenLake \n",
"#### 📜 [ARTICLE](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe) // [FROZENLAKE IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb)\n",
"#### 📹 [Implementing a Q-learning agent that plays Taxi-v2 🚕](https://youtu.be/q2ZOEFAaaI0) \n",
"\n",
"### Part 3: Deep Q-learning with Doom\n",
"#### 📜 [ARTICLE](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) // [DOOM IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/DQN%20Doom/Deep%20Q%20learning%20with%20Doom.ipynb)\n",
"#### 📹 [Create a DQN Agent that learns to play Atari Space Invaders 👾 ](https://youtu.be/gCJyVX98KJ4)\n",
"\n",
"### Part 3+: Improvments in Deep Q-Learning\n",
"#### 📜 [ARTICLE (📅 JUNE)] \n",
"#### 📹 [Create an Agent that learns to play Doom Deadly corridor (📅 06/20 )] \n",
"\n",
"### Part 4: Policy Gradients with Doom \n",
"#### 📜 [ARTICLE](https://medium.freecodecamp.org/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f) // [CARTPOLE IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb) // [DOOM IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Doom/Doom%20REINFORCE%20Monte%20Carlo%20Policy%20gradients.ipynb)\n",
"#### 📹 [Create an Agent that learns to play Doom deathmatch (📅 06/27)] \n",
"\n",
"### Part 5: Advantage Advantage Actor Critic (A2C) \n",
"#### 📜 [ARTICLE (📅 June)] \n",
"#### 📹 [Create an Agent that learns to play Outrun (📅 07/04)] \n",
"\n",
"### Part 6: Asynchronous Advantage Actor Critic (A3C) \n",
"#### 📜 [ARTICLE (📅 July)] \n",
"#### 📹 [Create an Agent that learns to play Michael Jackson's Moonwalker (📅 07/11)] \n",
"\n",
"### Part 7: Proximal Policy Gradients \n",
"#### 📜 [ARTICLE (📅 July)]\n",
"#### 📹 [Create an Agent that learns to play walk with Mujoco (📅 07/18)]\n",
"\n",
"### Part 8: TBA \n",
"\n",
"## Any questions 👨‍💻\n",
"<p> If you have any questions, feel free to ask me: </p>\n",
"<p> 📧: <a href=\"mailto:hello@simoninithomas.com\">hello@simoninithomas.com</a> </p>\n",
"<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>\n",
"<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>\n",
"<p> Twitter: <a href=\"https://twitter.com/ThomasSimonini\">@ThomasSimonini</a> </p>\n",
"<p> Don't forget to <b> follow me on <a href=\"https://twitter.com/ThomasSimonini\">twitter</a>, <a href=\"https://github.com/simoninithomas/Deep_reinforcement_learning_Course\">github</a> and <a href=\"https://medium.com/@thomassimonini\">Medium</a> to be alerted of the new articles that I publish </b></p>\n",
" \n",
"## How to help 🙌\n",
"3 ways:\n",
"- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.\n",
"- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. \n",
"- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.\n",
"<br>\n",
"\n",
"## Important note 🤔\n",
"<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)\n",
"https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning\n",
"<br>\n",
"⚠️ I don't have any business relations with them. I just loved their excellent customer service.\n",
"\n",
"If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Import the libraries 📚"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf # Deep Learning library\n",
"import numpy as np # Handle matrices\n",
"from vizdoom import * # Doom Environment\n",
"\n",
"import random # Handling random number generation\n",
"import time # Handling time calculation\n",
"from skimage import transform# Help us to preprocess the frames\n",
"\n",
"from collections import deque# Ordered collection with ends\n",
"import matplotlib.pyplot as plt # Display graphs\n",
"\n",
"import warnings # This ignore all the warning messages that are normally printed during the training because of skiimage\n",
"warnings.filterwarnings('ignore') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Create our environment 🎮\n",
"- Now that we imported the libraries/dependencies, we will create our environment.\n",
"- Doom environment takes:\n",
" - A `configuration file` that **handle all the options** (size of the frame, possible actions...)\n",
" - A `scenario file`: that **generates the correct scenario** (in our case basic **but you're invited to try other scenarios**).\n",
"- Note: We have 3 possible actions `[[0,0,1], [1,0,0], [0,1,0]]` so we don't need to do one hot encoding (thanks to < a href=\"https://stackoverflow.com/users/2237916/silgon\">silgon</a> for figuring out. \n",
"\n",
"### Our environment\n",
"<img src=\"assets/doom.png\" style=\"max-width:500px;\" alt=\"Doom\"/>\n",
" \n",
"- A monster is spawned **randomly somewhere along the opposite wall**. \n",
"- Player can only go **left/right and shoot**. \n",
"- 1 hit is enough **to kill the monster**. \n",
"- Episode finishes when **monster is killed or on timeout (300)**.\n",
"<br><br>\n",
"REWARDS:\n",
"\n",
"- +101 for killing the monster \n",
"- -5 for missing \n",
"- Episode ends after killing the monster or on timeout.\n",
"- living reward = -1"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"Here we create our environment\n",
"\"\"\"\n",
"def create_environment():\n",
" game = DoomGame()\n",
" \n",
" # Load the correct configuration\n",
" game.load_config(\"basic.cfg\")\n",
" \n",
" # Load the correct scenario (in our case basic scenario)\n",
" game.set_doom_scenario_path(\"basic.wad\")\n",
" \n",
" game.init()\n",
" \n",
" # Here our possible actions\n",
" left = [1, 0, 0]\n",
" right = [0, 1, 0]\n",
" shoot = [0, 0, 1]\n",
" possible_actions = [left, right, shoot]\n",
" \n",
" return game, possible_actions\n",
" \n",
"\"\"\"\n",
"Here we performing random action to test the environment\n",
"\"\"\"\n",
"def test_environment():\n",
" game = DoomGame()\n",
" game.load_config(\"basic.cfg\")\n",
" game.set_doom_scenario_path(\"basic.wad\")\n",
" game.init()\n",
" shoot = [0, 0, 1]\n",
" left = [1, 0, 0]\n",
" right = [0, 1, 0]\n",
" actions = [shoot, left, right]\n",
"\n",
" episodes = 10\n",
" for i in range(episodes):\n",
" game.new_episode()\n",
" while not game.is_episode_finished():\n",
" state = game.get_state()\n",
" img = state.screen_buffer\n",
" misc = state.game_variables\n",
" action = random.choice(actions)\n",
" print(action)\n",
" reward = game.make_action(action)\n",
" print (\"\\treward:\", reward)\n",
" time.sleep(0.02)\n",
" print (\"Result:\", game.get_total_reward())\n",
" time.sleep(2)\n",
" game.close()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"game, possible_actions = create_environment()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Define the preprocessing functions ⚙️\n",
"### preprocess_frame\n",
"Preprocessing is an important step, <b>because we want to reduce the complexity of our states to reduce the computation time needed for training.</b>\n",
"<br><br>\n",
"Our steps:\n",
"- Grayscale each of our frames (because <b> color does not add important information </b>). But this is already done by the config file.\n",
"- Crop the screen (in our case we remove the roof because it contains no information)\n",
"- We normalize pixel values\n",
"- Finally we resize the preprocessed frame"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
" preprocess_frame:\n",
" Take a frame.\n",
" Resize it.\n",
" __________________\n",
" | |\n",
" | |\n",
" | |\n",
" | |\n",
" |_________________|\n",
" \n",
" to\n",
" _____________\n",
" | |\n",
" | |\n",
" | |\n",
" |____________|\n",
" Normalize it.\n",
" \n",
" return preprocessed_frame\n",
" \n",
" \"\"\"\n",
"def preprocess_frame(frame):\n",
" # Greyscale frame already done in our vizdoom config\n",
" # x = np.mean(frame,-1)\n",
" \n",
" # Crop the screen (remove the roof because it contains no information)\n",
" cropped_frame = frame[30:-10,30:-30]\n",
" \n",
" # Normalize Pixel Values\n",
" normalized_frame = cropped_frame/255.0\n",
" \n",
" # Resize\n",
" preprocessed_frame = transform.resize(normalized_frame, [84,84])\n",
" \n",
" return preprocessed_frame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### stack_frames\n",
"👏 This part was made possible thanks to help of <a href=\"https://github.com/Miffyli\">Anssi</a><br>\n",
"\n",
"As explained in this really <a href=\"https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/\"> good article </a> we stack frames.\n",
"\n",
"Stacking frames is really important because it helps us to **give have a sense of motion to our Neural Network.**\n",
"\n",
"- First we preprocess frame\n",
"- Then we append the frame to the deque that automatically **removes the oldest frame**\n",
"- Finally we **build the stacked state**\n",
"\n",
"This is how work stack:\n",
"- For the first frame, we feed 4 frames\n",
"- At each timestep, **we add the new frame to deque and then we stack them to form a new stacked frame**\n",
"- And so on\n",
"<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/DQN/Space%20Invaders/assets/stack_frames.png\" alt=\"stack\">\n",
"- If we're done, **we create a new stack with 4 new frames (because we are in a new episode)**."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"stack_size = 4 # We stack 4 frames\n",
"\n",
"# Initialize deque with zero-images one array for each image\n",
"stacked_frames = deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4) \n",
"\n",
"def stack_frames(stacked_frames, state, is_new_episode):\n",
" # Preprocess frame\n",
" frame = preprocess_frame(state)\n",
" \n",
" if is_new_episode:\n",
" # Clear our stacked_frames\n",
" stacked_frames = deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4)\n",
" \n",
" # Because we're in a new episode, copy the same frame 4x\n",
" stacked_frames.append(frame)\n",
" stacked_frames.append(frame)\n",
" stacked_frames.append(frame)\n",
" stacked_frames.append(frame)\n",
" \n",
" # Stack the frames\n",
" stacked_state = np.stack(stacked_frames, axis=2)\n",
" \n",
" else:\n",
" # Append frame to deque, automatically removes the oldest frame\n",
" stacked_frames.append(frame)\n",
"\n",
" # Build the stacked state (first dimension specifies different frames)\n",
" stacked_state = np.stack(stacked_frames, axis=2) \n",
" \n",
" return stacked_state, stacked_frames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: Set up our hyperparameters ⚗️\n",
"In this part we'll set up our different hyperparameters. But when you implement a Neural Network by yourself you will **not implement hyperparamaters at once but progressively**.\n",
"\n",
"- First, you begin by defining the neural networks hyperparameters when you implement the model.\n",
"- Then, you'll add the training hyperparameters when you implement the training algorithm."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"### MODEL HYPERPARAMETERS\n",
"state_size = [84,84,4] # Our input is a stack of 4 frames hence 84x84x4 (Width, height, channels) \n",
"action_size = game.get_available_buttons_size() # 3 possible actions: left, right, shoot\n",
"learning_rate = 0.0002 # Alpha (aka learning rate)\n",
"\n",
"### TRAINING HYPERPARAMETERS\n",
"total_episodes = 500 # Total episodes for training\n",
"max_steps = 100 # Max possible steps in an episode\n",
"batch_size = 64 \n",
"\n",
"# Exploration parameters for epsilon greedy strategy\n",
"explore_start = 1.0 # exploration probability at start\n",
"explore_stop = 0.01 # minimum exploration probability \n",
"decay_rate = 0.0001 # exponential decay rate for exploration prob\n",
"\n",
"# Q learning hyperparameters\n",
"gamma = 0.95 # Discounting rate\n",
"\n",
"### MEMORY HYPERPARAMETERS\n",
"pretrain_length = batch_size # Number of experiences stored in the Memory when initialized for the first time\n",
"memory_size = 1000000 # Number of experiences the Memory can keep\n",
"\n",
"### MODIFY THIS TO FALSE IF YOU JUST WANT TO SEE THE TRAINED AGENT\n",
"training = True\n",
"\n",
"## TURN THIS TO TRUE IF YOU WANT TO RENDER THE ENVIRONMENT\n",
"episode_render = False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5: Create our Deep Q-learning Neural Network model 🧠\n",
"<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/DQN/doom/assets/model.png\" alt=\"Model\" />\n",
"This is our Deep Q-learning model:\n",
"- We take a stack of 4 frames as input\n",
"- It passes through 3 convnets\n",
"- Then it is flatened\n",
"- Finally it passes through 2 FC layers\n",
"- It outputs a Q value for each actions"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"class DQNetwork:\n",
" def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):\n",
" self.state_size = state_size\n",
" self.action_size = action_size\n",
" self.learning_rate = learning_rate\n",
" \n",
" with tf.variable_scope(name):\n",
" # We create the placeholders\n",
" # *state_size means that we take each elements of state_size in tuple hence is like if we wrote\n",
" # [None, 84, 84, 4]\n",
" self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name=\"inputs\")\n",
" self.actions_ = tf.placeholder(tf.float32, [None, 3], name=\"actions_\")\n",
" \n",
" # Remember that target_Q is the R(s,a) + ymax Qhat(s', a')\n",
" self.target_Q = tf.placeholder(tf.float32, [None], name=\"target\")\n",
" \n",
" \"\"\"\n",
" First convnet:\n",
" CNN\n",
" BatchNormalization\n",
" ELU\n",
" \"\"\"\n",
" # Input is 84x84x4\n",
" self.conv1 = tf.layers.conv2d(inputs = self.inputs_,\n",
" filters = 32,\n",
" kernel_size = [8,8],\n",
" strides = [4,4],\n",
" padding = \"VALID\",\n",
" kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),\n",
" name = \"conv1\")\n",
" \n",
" self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,\n",
" training = True,\n",
" epsilon = 1e-5,\n",
" name = 'batch_norm1')\n",
" \n",
" self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name=\"conv1_out\")\n",
" ## --> [20, 20, 32]\n",
" \n",
" \n",
" \"\"\"\n",
" Second convnet:\n",
" CNN\n",
" BatchNormalization\n",
" ELU\n",
" \"\"\"\n",
" self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,\n",
" filters = 64,\n",
" kernel_size = [4,4],\n",
" strides = [2,2],\n",
" padding = \"VALID\",\n",
" kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),\n",
" name = \"conv2\")\n",
" \n",
" self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,\n",
" training = True,\n",
" epsilon = 1e-5,\n",
" name = 'batch_norm2')\n",
"\n",
" self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name=\"conv2_out\")\n",
" ## --> [9, 9, 64]\n",
" \n",
" \n",
" \"\"\"\n",
" Third convnet:\n",
" CNN\n",
" BatchNormalization\n",
" ELU\n",
" \"\"\"\n",
" self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,\n",
" filters = 128,\n",
" kernel_size = [4,4],\n",
" strides = [2,2],\n",
" padding = \"VALID\",\n",
" kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),\n",
" name = \"conv3\")\n",
" \n",
" self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,\n",
" training = True,\n",
" epsilon = 1e-5,\n",
" name = 'batch_norm3')\n",
"\n",
" self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name=\"conv3_out\")\n",
" ## --> [3, 3, 128]\n",
" \n",
" \n",
" self.flatten = tf.layers.flatten(self.conv3_out)\n",
" ## --> [1152]\n",
" \n",
" \n",
" self.fc = tf.layers.dense(inputs = self.flatten,\n",
" units = 512,\n",
" activation = tf.nn.elu,\n",
" kernel_initializer=tf.contrib.layers.xavier_initializer(),\n",
" name=\"fc1\")\n",
" \n",
" \n",
" self.output = tf.layers.dense(inputs = self.fc, \n",
" kernel_initializer=tf.contrib.layers.xavier_initializer(),\n",
" units = 3, \n",
" activation=None)\n",
"\n",
" \n",
" # Q is our predicted Q value.\n",
" self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)\n",
" \n",
" \n",
" # The loss is the difference between our predicted Q_values and the Q_target\n",
" # Sum(Qtarget - Q)^2\n",
" self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))\n",
" \n",
" self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Reset the graph\n",
"tf.reset_default_graph()\n",
"\n",
"# Instantiate the DQNetwork\n",
"DQNetwork = DQNetwork(state_size, action_size, learning_rate)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 6: Experience Replay 🔁\n",
"Now that we create our Neural Network, **we need to implement the Experience Replay method.** <br><br>\n",
"Here we'll create the Memory object that creates a deque.A deque (double ended queue) is a data type that **removes the oldest element each time that you add a new element.**\n",
"\n",
"This part was taken from Udacity : <a href=\"https://github.com/udacity/deep-learning/blob/master/reinforcement/Q-learning-cart.ipynb\" Cartpole DQN</a>"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"class Memory():\n",
" def __init__(self, max_size):\n",
" self.buffer = deque(maxlen = max_size)\n",
" \n",
" def add(self, experience):\n",
" self.buffer.append(experience)\n",
" \n",
" def sample(self, batch_size):\n",
" buffer_size = len(self.buffer)\n",
" index = np.random.choice(np.arange(buffer_size),\n",
" size = batch_size,\n",
" replace = False)\n",
" \n",
" return [self.buffer[i] for i in index]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we'll **deal with the empty memory problem**: we pre-populate our memory by taking random actions and storing the experience (state, action, reward, new_state)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Instantiate memory\n",
"memory = Memory(max_size = memory_size)\n",
"\n",
"# Render the environment\n",
"game.new_episode()\n",
"\n",
"for i in range(pretrain_length):\n",
" # If it's the first step\n",
" if i == 0:\n",
" # First we need a state\n",
" state = game.get_state().screen_buffer\n",
" state, stacked_frames = stack_frames(stacked_frames, state, True)\n",
" \n",
" # Random action\n",
" action = random.choice(possible_actions)\n",
" \n",
" # Get the rewards\n",
" reward = game.make_action(action)\n",
" \n",
" # Look if the episode is finished\n",
" done = game.is_episode_finished()\n",
" \n",
" # If we're dead\n",
" if done:\n",
" # We finished the episode\n",
" next_state = np.zeros(state.shape)\n",
" \n",
" # Add experience to memory\n",
" memory.add((state, action, reward, next_state, done))\n",
" \n",
" # Start a new episode\n",
" game.new_episode()\n",
" \n",
" # First we need a state\n",
" state = game.get_state().screen_buffer\n",
" \n",
" # Stack the frames\n",
" state, stacked_frames = stack_frames(stacked_frames, state, True)\n",
" \n",
" else:\n",
" # Get the next state\n",
" next_state = game.get_state().screen_buffer\n",
" next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n",
" \n",
" # Add experience to memory\n",
" memory.add((state, action, reward, next_state, done))\n",
" \n",
" # Our state is now the next_state\n",
" state = next_state"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 7: Set up Tensorboard 📊\n",
"For more information about tensorboard, please watch this <a href=\"https://www.youtube.com/embed/eBbEDRsCmv4\">excellent 30min tutorial</a> <br><br>\n",
"To launch tensorboard : `tensorboard --logdir=/tensorboard/dqn/1`"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Setup TensorBoard Writer\n",
"writer = tf.summary.FileWriter(\"/tensorboard/dqn/1\")\n",
"\n",
"## Losses\n",
"tf.summary.scalar(\"Loss\", DQNetwork.loss)\n",
"\n",
"write_op = tf.summary.merge_all()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 8: Train our Agent 🏃‍♂️\n",
"\n",
"Our algorithm:\n",
"<br>\n",
"* Initialize the weights\n",
"* Init the environment\n",
"* Initialize the decay rate (that will use to reduce epsilon) \n",
"<br><br>\n",
"* **For** episode to max_episode **do** \n",
" * Make new episode\n",
" * Set step to 0\n",
" * Observe the first state $s_0$\n",
" <br><br>\n",
" * **While** step < max_steps **do**:\n",
" * Increase decay_rate\n",
" * With $\\epsilon$ select a random action $a_t$, otherwise select $a_t = \\mathrm{argmax}_a Q(s_t,a)$\n",
" * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$\n",
" * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$\n",
" * Sample random mini-batch from $D$: $<s, a, r, s'>$\n",
" * Set $\\hat{Q} = r$ if the episode ends at $+1$, otherwise set $\\hat{Q} = r + \\gamma \\max_{a'}{Q(s', a')}$\n",
" * Make a gradient descent step with loss $(\\hat{Q} - Q(s, a))^2$\n",
" * **endfor**\n",
" <br><br>\n",
"* **endfor**\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"This function will do the part\n",
"With ϵ select a random action atat, otherwise select at=argmaxaQ(st,a)\n",
"\"\"\"\n",
"def predict_action(explore_start, explore_stop, decay_rate, decay_step, state, actions):\n",
" ## EPSILON GREEDY STRATEGY\n",
" # Choose action a from state s using epsilon greedy.\n",
" ## First we randomize a number\n",
" exp_exp_tradeoff = np.random.rand()\n",
"\n",
" # Here we'll use an improved version of our epsilon greedy strategy used in Q-learning notebook\n",
" explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * decay_step)\n",
" \n",
" if (explore_probability > exp_exp_tradeoff):\n",
" # Make a random action (exploration)\n",
" action = random.choice(possible_actions)\n",
" \n",
" else:\n",
" # Get action from Q-network (exploitation)\n",
" # Estimate the Qs values state\n",
" Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})\n",
" \n",
" # Take the biggest Q value (= the best action)\n",
" choice = np.argmax(Qs)\n",
" action = possible_actions[int(choice)]\n",
" \n",
" return action, explore_probability"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Episode: 0 Total reward: 10.0 Training loss: 1.1466 Explore P: 0.9925\n",
"Model Saved\n",
"Episode: 1 Total reward: 95.0 Training loss: 15.8998 Explore P: 0.9919\n"
]
},
{
"ename": "KeyboardInterrupt",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-13-46ee7aff1ad1>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 111\u001b[0m feed_dict={DQNetwork.inputs_: states_mb,\n\u001b[0;32m 112\u001b[0m \u001b[0mDQNetwork\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtarget_Q\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mtargets_mb\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 113\u001b[1;33m DQNetwork.actions_: actions_mb})\n\u001b[0m\u001b[0;32m 114\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 115\u001b[0m \u001b[1;31m# Write TF Summaries\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\envs\\gameplai\\lib\\site-packages\\tensorflow\\python\\client\\session.py\u001b[0m in \u001b[0;36mrun\u001b[1;34m(self, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[0;32m 893\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 894\u001b[0m result = self._run(None, fetches, feed_dict, options_ptr,\n\u001b[1;32m--> 895\u001b[1;33m run_metadata_ptr)\n\u001b[0m\u001b[0;32m 896\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mrun_metadata\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 897\u001b[0m \u001b[0mproto_data\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtf_session\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mTF_GetBuffer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mrun_metadata_ptr\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\envs\\gameplai\\lib\\site-packages\\tensorflow\\python\\client\\session.py\u001b[0m in \u001b[0;36m_run\u001b[1;34m(self, handle, fetches, feed_dict, options, run_metadata)\u001b[0m\n\u001b[0;32m 1126\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mfinal_fetches\u001b[0m \u001b[1;32mor\u001b[0m \u001b[0mfinal_targets\u001b[0m \u001b[1;32mor\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mhandle\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mfeed_dict_tensor\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1127\u001b[0m results = self._do_run(handle, final_targets, final_fetches,\n\u001b[1;32m-> 1128\u001b[1;33m feed_dict_tensor, options, run_metadata)\n\u001b[0m\u001b[0;32m 1129\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1130\u001b[0m \u001b[0mresults\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\envs\\gameplai\\lib\\site-packages\\tensorflow\\python\\client\\session.py\u001b[0m in \u001b[0;36m_do_run\u001b[1;34m(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)\u001b[0m\n\u001b[0;32m 1342\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mhandle\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1343\u001b[0m return self._do_call(_run_fn, self._session, feeds, fetches, targets,\n\u001b[1;32m-> 1344\u001b[1;33m options, run_metadata)\n\u001b[0m\u001b[0;32m 1345\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1346\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_do_call\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0m_prun_fn\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_session\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mhandle\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfeeds\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfetches\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\envs\\gameplai\\lib\\site-packages\\tensorflow\\python\\client\\session.py\u001b[0m in \u001b[0;36m_do_call\u001b[1;34m(self, fn, *args)\u001b[0m\n\u001b[0;32m 1348\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_do_call\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfn\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1349\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1350\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mfn\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1351\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mOpError\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1352\u001b[0m \u001b[0mmessage\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcompat\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mas_text\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0me\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmessage\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\envs\\gameplai\\lib\\site-packages\\tensorflow\\python\\client\\session.py\u001b[0m in \u001b[0;36m_run_fn\u001b[1;34m(session, feed_dict, fetch_list, target_list, options, run_metadata)\u001b[0m\n\u001b[0;32m 1327\u001b[0m return tf_session.TF_Run(session, options,\n\u001b[0;32m 1328\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtarget_list\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1329\u001b[1;33m status, run_metadata)\n\u001b[0m\u001b[0;32m 1330\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1331\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_prun_fn\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msession\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mhandle\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfeed_dict\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfetch_list\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mKeyboardInterrupt\u001b[0m: "
]
}
],
"source": [
"# Saver will help us to save our model\n",
"saver = tf.train.Saver()\n",
"\n",
"if training == True:\n",
" with tf.Session() as sess:\n",
" # Initialize the variables\n",
" sess.run(tf.global_variables_initializer())\n",
" \n",
" # Initialize the decay rate (that will use to reduce epsilon) \n",
" decay_step = 0\n",
"\n",
" # Init the game\n",
" game.init()\n",
"\n",
" for episode in range(total_episodes):\n",
" # Set step to 0\n",
" step = 0\n",
" \n",
" # Initialize the rewards of the episode\n",
" episode_rewards = []\n",
" \n",
" # Make a new episode and observe the first state\n",
" game.new_episode()\n",
" state = game.get_state().screen_buffer\n",
" \n",
" # Remember that stack frame function also call our preprocess function.\n",
" state, stacked_frames = stack_frames(stacked_frames, state, True)\n",
"\n",
" while step < max_steps:\n",
" step += 1\n",
" \n",
" # Increase decay_step\n",
" decay_step +=1\n",
" \n",
" # Predict the action to take and take it\n",
" action, explore_probability = predict_action(explore_start, explore_stop, decay_rate, decay_step, state, possible_actions)\n",
"\n",
" # Do the action\n",
" reward = game.make_action(action)\n",
"\n",
" # Look if the episode is finished\n",
" done = game.is_episode_finished()\n",
" \n",
" # Add the reward to total reward\n",
" episode_rewards.append(reward)\n",
"\n",
" # If the game is finished\n",
" if done:\n",
" # the episode ends so no next state\n",
" next_state = np.zeros((84,84), dtype=np.int)\n",
" next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n",
"\n",
" # Set step = max_steps to end the episode\n",
" step = max_steps\n",
"\n",
" # Get the total reward of the episode\n",
" total_reward = np.sum(episode_rewards)\n",
"\n",
" print('Episode: {}'.format(episode),\n",
" 'Total reward: {}'.format(total_reward),\n",
" 'Training loss: {:.4f}'.format(loss),\n",
" 'Explore P: {:.4f}'.format(explore_probability))\n",
"\n",
" memory.add((state, action, reward, next_state, done))\n",
"\n",
" else:\n",
" # Get the next state\n",
" next_state = game.get_state().screen_buffer\n",
" \n",
" # Stack the frame of the next_state\n",
" next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n",
" \n",
"\n",
" # Add experience to memory\n",
" memory.add((state, action, reward, next_state, done))\n",
" \n",
" # st+1 is now our current state\n",
" state = next_state\n",
"\n",
"\n",
" ### LEARNING PART \n",
" # Obtain random mini-batch from memory\n",
" batch = memory.sample(batch_size)\n",
" states_mb = np.array([each[0] for each in batch], ndmin=3)\n",
" actions_mb = np.array([each[1] for each in batch])\n",
" rewards_mb = np.array([each[2] for each in batch]) \n",
" next_states_mb = np.array([each[3] for each in batch], ndmin=3)\n",
" dones_mb = np.array([each[4] for each in batch])\n",
"\n",
" target_Qs_batch = []\n",
"\n",
" # Get Q values for next_state \n",
" Qs_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})\n",
" \n",
" # Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')\n",
" for i in range(0, len(batch)):\n",
" terminal = dones_mb[i]\n",
"\n",
" # If we are in a terminal state, only equals reward\n",
" if terminal:\n",
" target_Qs_batch.append(rewards_mb[i])\n",
" \n",
" else:\n",
" target = rewards_mb[i] + gamma * np.max(Qs_next_state[i])\n",
" target_Qs_batch.append(target)\n",
" \n",
"\n",
" targets_mb = np.array([each for each in target_Qs_batch])\n",
"\n",
" loss, _ = sess.run([DQNetwork.loss, DQNetwork.optimizer],\n",
" feed_dict={DQNetwork.inputs_: states_mb,\n",
" DQNetwork.target_Q: targets_mb,\n",
" DQNetwork.actions_: actions_mb})\n",
"\n",
" # Write TF Summaries\n",
" summary = sess.run(write_op, feed_dict={DQNetwork.inputs_: states_mb,\n",
" DQNetwork.target_Q: targets_mb,\n",
" DQNetwork.actions_: actions_mb})\n",
" writer.add_summary(summary, episode)\n",
" writer.flush()\n",
"\n",
" # Save model every 5 episodes\n",
" if episode % 5 == 0:\n",
" save_path = saver.save(sess, \"./models/model.ckpt\")\n",
" print(\"Model Saved\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 9: Watch our Agent play 👀\n",
"Now that we trained our agent, we can test it"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with tf.Session() as sess:\n",
" \n",
" game, possible_actions = create_environment()\n",
" \n",
" totalScore = 0\n",
" \n",
" \n",
" # Load the model\n",
" saver.restore(sess, \"./models/model.ckpt\")\n",
" game.init()\n",
" for i in range(1):\n",
" \n",
" game.new_episode()\n",
" while not game.is_episode_finished():\n",
" frame = game.get_state().screen_buffer\n",
" state = stack_frames(stacked_frames, frame)\n",
" # Take the biggest Q value (= the best action)\n",
" Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})\n",
" action = np.argmax(Qs)\n",
" action = possible_actions[int(action)]\n",
" game.make_action(action) \n",
" score = game.get_total_reward()\n",
" print(\"Score: \", score)\n",
" totalScore += score\n",
" print(\"TOTAL_SCORE\", totalScore/100.0)\n",
" game.close()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@Humberd
Copy link

Humberd commented Mar 9, 2019

Where do I get basic.cfg file from?

@uahic
Copy link

uahic commented May 9, 2019

@Humberd
I found it in the 'scenarios' subfolder of vizdoom. As I installed vizdoom via pip it will be located under
<your_virtual_env_path>/lib/python2.7/site-packages/vizdoom/scenarios/basic.cfg
the basic.wad is there as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment