sezan92/DDPG.ipynb

## DDPG.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial for DDPG Reinforcement Learning using Keras"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MD Muhaimin Rahman\n",
    "sezan92[at]gmail[dot]com"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, I am trying to build ```Deep Deterministic Policy Gradient```aka ```DDPG``` Reinforcement Learning using Keras Backend.\n",
    "This notebook is structured as following\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- [Importing Libraries](#libraries)\n",
    "- [Algorithm](#algorithm)\n",
    "- [Model Definition](#model)\n",
    "- [Replay Buffer](#buffer)\n",
    "- [Noise Class](#noise)\n",
    "- [Training](#training)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Target Readers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you already have idea about Q Learning ,Deep Q Learning and Deep Neural Networks , then this tutorial is for you. Otherwise, you should learn them first"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id =\"libraries\"></a>\n",
    "### Importing Libraries\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/sezan92/anaconda/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "from __future__ import print_function,division\n",
    "import gym\n",
    "import keras\n",
    "from keras import layers\n",
    "from keras import backend as K\n",
    "from collections import deque\n",
    "from tqdm import tqdm\n",
    "import random\n",
    "import numpy as np\n",
    "import copy\n",
    "SEED =123\n",
    "np.random.seed(SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Important constants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_episodes = 100\n",
    "steps_per_episode=500\n",
    "BATCH_SIZE=256\n",
    "TAU=0.001\n",
    "GAMMA=0.95\n",
    "actor_lr=0.0001\n",
    "critic_lr=0.001\n",
    "SHOW= False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from keras.models import Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id =\"algorithm\"></a>\n",
    "### Algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The actual algorithm was developed by Timothy lilicap et al. The algorithm is an actor-critic based algorithm.. Which means, it has two networks to train- an actor network, which predicts action based on the current state. The other networ- known as Critic network- evaluates the state and action. This is the case for all actor-critic networks. The critic network is updated using Bellman Equation like DQN. The difference is the training of actor network. In DDPG , we train Actor network by trying to get the maximum value of gradient of $Q(s,a)$ for given action $a$ in a state $s$. In a normal machine learning classification and regression algorithm, our target is to get the value with minimum loss. Then we train the network by gradient descent technique using the gradient of Loss ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\begin{equation}\n",
    "\\theta \\gets \\theta - \\alpha \\frac{\\partial L}{\\partial \\theta}\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, $\\theta$ is the weight parameter of the network, and $L$ is loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But in our case, we have to get the maximize the $Q$ value. So we have to set the weight parameters such that we get the maximum $Q$ value. This technique is known as Gradient Ascent, as it does the exact opposite of Gradient Descent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\begin{equation}\n",
    "\\theta_a \\gets \\theta_a - \\alpha (-\\frac{\\partial Q(s,a) }{\\partial \\theta_a}) \n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above equation looks like the actual Gradient Descent equation. Only difference is , the minus sign. It makes the equation to minimize the negative value of $Q$ , which in turn maximizes $Q$ value.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the training is as following"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 1) Define Actor network $actor$ and Critic Network $critic$\n",
    "- 2) Define Target Actor and Critic Networks - $actor_{target}$ and $critic_{target}$ with exact same weights\n",
    "- 3) Initialize Replay Buffer \n",
    "- 4) Get the initial state , $state$\n",
    "- 5) Get the action $a$ from , $a \\gets actor(state)$ + Noise .[Here Noise is given to make the process stochastic and not so deterministic. The paper uses ornstein uhlenbeck noise process , so we will as well]\n",
    "- 6) Get Next state $state_{next}$ , Reward $r$ , Terminal from agent for given $state$ and $action$\n",
    "- 7) Add the experience , $state$,$action$,$reward$,$state_{next}$,$terminal$ to replay buffer\n",
    "- 8) Get sample minibatch from Replay buffer\n",
    "- 9) Train Critic Network Using Bellman Equation. Like DQN\n",
    "- 10) Train Actor Network using Gradient Ascent with gradients of $Q$ . $\\theta_a \\gets \\theta_a - \\alpha (-\\frac{\\partial Q(s,a) }{\\partial \\theta_a}) $\n",
    "- 11) Update weights of $actor_{target}$ and $critic_{target}$ using the equation $ \\theta \\gets \\tau \\theta + (1-\\tau)\\theta_{target}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id =\"model\"></a>\n",
    "### Model Definition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After some trials and errors, I have selected this network. The Actor Network is 3 layer MLP with 320 hidden nodes in each layer. The critic network is also a 3 layer MLP with 640 hidden nodes in each layer.Notice that the return arguments of function ```create_critic_network```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_actor_network(state_shape,action_shape):\n",
    "    in1=layers.Input(shape=state_shape,name=\"state\")\n",
    "    l1 =layers.Dense(320,activation=\"relu\")(in1)\n",
    "    l2 =layers.Dense(320,activation=\"relu\")(l1)\n",
    "    l3 =layers.Dense(320,activation=\"relu\")(l2)\n",
    "    action =layers.Dense(action_shape,activation=\"tanh\")(l3)\n",
    "    actor= Model(in1,action)\n",
    "    return actor\n",
    "\n",
    "def create_critic_network(state_shape,action_shape):\n",
    "    in1 = layers.Input(shape=state_shape,name=\"state\")\n",
    "    in2 = layers.Input(shape=action_shape,name=\"action\")\n",
    "    l1 = layers.concatenate([in1,in2])\n",
    "    l2 = layers.Dense(640,activation=\"relu\")(l1)\n",
    "    l3 = layers.Dense(640,activation=\"relu\")(l2)\n",
    "    l4 = layers.Dense(640,activation=\"relu\")(l3)\n",
    "    value = layers.Dense(1)(l4)\n",
    "    critic = Model(inputs=[in1,in2],outputs=value)\n",
    "    return critic,in1,in2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I am chosing ```MountainCarContinuous-v0``` game. Mainly because my GPU is not that good to work on higher dimensional state space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mWARN: gym.spaces.Box autodetected dtype as <type 'numpy.float32'>. Please provide explicit dtype.\u001b[0m\n",
      "\u001b[33mWARN: gym.spaces.Box autodetected dtype as <type 'numpy.float32'>. Please provide explicit dtype.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "env = gym.make(\"MountainCarContinuous-v0\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "state_shape= env.observation_space.sample().shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_shape=env.action_space.sample().shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "actor = create_actor_network(state_shape,action_shape[0])\n",
    "critic,state_tensor,action_tensor = create_critic_network(state_shape,action_shape)\n",
    "target_actor=create_actor_network(state_shape,action_shape[0])\n",
    "target_critic,_,_ = create_critic_network(state_shape,action_shape)\n",
    "target_actor.set_weights(actor.get_weights())\n",
    "target_critic.set_weights(critic.get_weights())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I have chosen ```RMSProp``` optimizer, due to more stability compared to Adam . I found it after trials and errors, no theoritical background on chosing this optimizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "actor_optimizer = keras.optimizers.RMSprop(actor_lr)\n",
    "\n",
    "critic_optimizer = keras.optimizers.RMSprop(critic_lr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /home/sezan92/anaconda/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:1290: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "keep_dims is deprecated, use keepdims instead\n"
     ]
    }
   ],
   "source": [
    "critic.compile(loss=\"mse\",optimizer=critic_optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Actor training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I think this is the most critical part of ddpg in keras. The object ```critic``` and ```actor``` has a ```__call__``` method inside it, which will give output tensor if you give input a tensor. So to get the tensor object of ```Q``` we will use this functionality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "CriticValues = critic([state_tensor,actor(state_tensor)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it is time to get the gradient value of $-\\frac{\\partial Q(s,a)}{\\theta_a}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "updates = actor_optimizer.get_updates(\n",
    "    params=actor.trainable_weights,loss=-K.mean(CriticValues))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will create a function which will train the actor network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "actor_train = K.function(inputs=[state_tensor],outputs=[actor(state_tensor),CriticValues],\n",
    "                   updates=updates)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id =\"buffer\"></a>\n",
    "### Replay Buffer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "memory = deque(maxlen=10000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10000/10000 [00:00<00:00, 21304.88it/s]\n"
     ]
    }
   ],
   "source": [
    "state = env.reset()\n",
    "state = state.reshape(-1,)\n",
    "for _ in tqdm(range(memory.maxlen)):\n",
    "    action = env.action_space.sample()\n",
    "    next_state,reward,terminal,_=env.step(action)\n",
    "    \n",
    "    state=next_state\n",
    "    if terminal:\n",
    "        reward=-100\n",
    "        state= env.reset()\n",
    "        state = state.reshape(-1,)\n",
    "    memory.append((state,action,reward,next_state,terminal))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.49533114, -0.00344986])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next_state"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id =\"noise\"></a>\n",
    "### Noise class"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The use of Noise is to make the process Stochastic and to help the agent explore different actions. The paper used Orstein Uhlenbeck Noise class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "class OrnsteinUhlenbeckProcess(object):\n",
    "    def __init__(self, theta, mu=0, sigma=1, x0=0, dt=1e-2, n_steps_annealing=10, size=1):\n",
    "        self.theta = theta\n",
    "        self.sigma = sigma\n",
    "        self.n_steps_annealing = n_steps_annealing\n",
    "        self.sigma_step = - self.sigma / float(self.n_steps_annealing)\n",
    "        self.x0 = x0\n",
    "        self.mu = mu\n",
    "        self.dt = dt\n",
    "        self.size = size\n",
    "    def restart(self):\n",
    "        self.x0=copy.copy(self.mu)\n",
    "    def generate(self, step):\n",
    "        #sigma = max(0, self.sigma_step * step + self.sigma)\n",
    "        x = self.x0 + self.theta * (self.mu - self.x0) * self.dt + self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.size)\n",
    "        self.x0 = x\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id =\"training\"></a>\n",
    "### Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Failed! Episode 0 Total Reward -121.829750\n",
      "Failed! Episode 1 Total Reward -118.666569\n",
      "Failed! Episode 2 Total Reward -162.041720\n",
      "Failed! Episode 3 Total Reward -87.619343\n",
      "Failed! Episode 4 Total Reward -96.425173\n",
      "Failed! Episode 5 Total Reward -143.686660\n",
      "Failed! Episode 6 Total Reward -38.429055\n",
      "Passed! Episode 7 Total Reward 76.095012\n",
      "Passed! Episode 8 Total Reward 78.345259\n",
      "Failed! Episode 9 Total Reward -0.019448\n",
      "Passed! Episode 10 Total Reward 48.135478\n",
      "Passed! Episode 11 Total Reward 54.750747\n",
      "Passed! Episode 12 Total Reward 79.410100\n",
      "Passed! ward 83.0033361Episode 13 Total Reward 83.003336\n",
      "Passed! Episode 14 Total Reward 83.174427\n",
      "Failed! Episode 15 Total Reward -48.217783\n",
      "Passed! Episode 16 Total Reward 87.284775\n",
      "Passed! Episode 17 Total Reward 32.635075\n",
      "Passed! Episode 18 Total Reward 84.165725\n",
      "Passed! Episode 19 Total Reward 89.819676\n",
      "Passed! Episode 20 Total Reward 76.334761\n",
      "Passed! Episode 21 Total Reward 58.299554\n",
      "Passed! Episode 22 Total Reward 81.035049\n",
      "Failed! Episode 23 Total Reward -39.691471\n",
      "Failed! Episode 24 Total Reward -161.331777\n",
      "Passed! Episode 25 Total Reward 86.132543\n",
      "Failed! Episode 26 Total Reward -104.653697\n",
      "Passed! Episode 27 Total Reward 85.415800\n",
      "Failed! Episode 28 Total Reward -74.066867\n",
      "Failed! Episode 29 Total Reward -29.747882\n",
      "Passed! Episode 30 Total Reward 59.700877\n",
      "Passed! Episode 31 Total Reward 64.003951\n",
      "Failed! Episode 32 Total Reward -43.807664\n",
      "Passed! Episode 33 Total Reward 81.749323\n",
      "Passed! Episode 34 Total Reward 66.484130\n",
      "Failed! Episode 35 Total Reward -120.525069\n",
      "Failed! Episode 36 Total Reward -24.451582\n",
      "Failed! Episode 37 Total Reward -207.239527\n",
      "Failed! Episode 38 Total Reward -43.106546\n",
      "Passed! Episode 39 Total Reward 34.880121\n",
      "Passed! Episode 40 Total Reward 66.188821\n",
      "Passed! Episode 41 Total Reward 29.939728\n",
      "Failed! Episode 42 Total Reward -86.516192\n",
      "Failed! Episode 43 Total Reward -225.143733\n",
      "Passed! Episode 44 Total Reward 7.568837\n",
      "Passed! Episode 45 Total Reward 84.050707\n",
      "Failed! Episode 46 Total Reward -86.705966\n",
      "Passed! Episode 47 Total Reward 51.108435\n",
      "Failed! Episode 48 Total Reward -88.531423\n",
      "Passed! Episode 49 Total Reward 79.210735\n",
      "Failed! Episode 50 Total Reward -64.247977\n",
      "Failed! Episode 51 Total Reward -28.334167\n",
      "Failed! Episode 52 Total Reward -53.942703\n",
      "Passed! Episode 53 Total Reward 56.541394\n",
      "Failed! Episode 54 Total Reward -30.573394\n",
      "Failed! Episode 55 Total Reward -43.900060\n",
      "Failed! Episode 56 Total Reward -31.793103\n",
      "Passed! Episode 57 Total Reward 49.367744\n",
      "Failed! Episode 58 Total Reward -59.277307\n",
      "Failed! Episode 59 Total Reward -91.342287\n",
      "Passed! Episode 60 Total Reward 72.884442\n",
      "Passed! Episode 61 Total Reward 81.967582\n",
      "Failed! Episode 62 Total Reward -69.682571\n",
      "Failed! Episode 63 Total Reward -21.655243\n",
      "Failed! Episode 64 Total Reward -102.427871\n",
      "Failed! Episode 65 Total Reward -18.727795\n",
      "Failed! Episode 66 Total Reward -106.180673\n",
      "Passed! Episode 67 Total Reward 66.997363\n",
      "Failed! Episode 68 Total Reward -98.338890\n",
      "Failed! Episode 69 Total Reward -17.488416\n",
      "Failed! Episode 70 Total Reward -126.062955\n",
      "Passed! Episode 71 Total Reward 91.446696\n",
      "Failed! Episode 72 Total Reward -153.244886\n",
      "Passed! Episode 73 Total Reward 32.589752\n",
      "Failed! Episode 74 Total Reward -106.034746\n",
      "Failed! Episode 75 Total Reward -46.197340\n",
      "Failed! Episode 76 Total Reward -64.983000\n",
      "Failed! Episode 77 Total Reward -50.082150\n",
      "Failed! Episode 78 Total Reward -44.506584\n",
      "Total Reward -86.285619\r"
     ]
    }
   ],
   "source": [
    "steps_per_episodes=5000\n",
    "ou = OrnsteinUhlenbeckProcess(theta=0.35,mu=0.8,sigma=0.4,n_steps_annealing=10)\n",
    "max_total_reward=0\n",
    "for episode in range(num_episodes):\n",
    "    state= env.reset()\n",
    "    state = state.reshape(-1,)\n",
    "    total_reward=0\n",
    "    ou.restart()\n",
    "    for step in range(steps_per_episodes):\n",
    "        action= actor.predict(state.reshape(1,-1))+ou.generate(episode)\n",
    "        next_state,reward,done,_ = env.step(action)\n",
    "        total_reward=total_reward+reward\n",
    "        #random minibatch from buffer\n",
    "        \n",
    "        batches=random.sample(memory,BATCH_SIZE)\n",
    "        states= np.array([batch[0].reshape((-1,)) for batch in batches])\n",
    "        actions= np.array([batch[1] for batch in batches])\n",
    "        actions=actions.reshape(-1,1)\n",
    "        rewards=np.array([batch[2] for batch in batches])\n",
    "        rewards = rewards.reshape((-1,1))\n",
    "        new_states=np.array([batch[3].reshape((-1,)) for batch in batches])\n",
    "        terminals=np.array([batch[4] for batch in batches])\n",
    "        terminals = terminals.reshape((-1,1))\n",
    "        #training\n",
    "        \n",
    "        target_actions = target_actor.predict(new_states)\n",
    "        target_Qs = target_critic.predict([new_states,target_actions])\n",
    "        \n",
    "        new_Qs = rewards+GAMMA*target_Qs*terminals\n",
    "        critic.fit([states,actions],new_Qs,verbose=False)\n",
    "        _,critic_values=actor_train(inputs=[states])\n",
    "        target_critic_weights=[TAU*weight+(1-TAU)*target_weight for weight,target_weight in zip(critic.get_weights(),target_critic.get_weights())]\n",
    "        target_actor_weights=[TAU*weight+(1-TAU)*target_weight for weight,target_weight in zip(actor.get_weights(),target_actor.get_weights())]\n",
    "        target_critic.set_weights(target_critic_weights)\n",
    "        target_actor.set_weights(target_actor_weights)\n",
    "        print(\"Total Reward %f\"%total_reward,end=\"\\r\")\n",
    "        if SHOW:\n",
    "            env.render()\n",
    "        if done or step==(steps_per_episodes-1):\n",
    "            \n",
    "            if total_reward<0:\n",
    "                print(\"Failed!\",end=\" \")\n",
    "                reward=-100\n",
    "            elif total_reward>0:\n",
    "                print(\"Passed!\",end=\" \")\n",
    "                reward=100\n",
    "            memory.append((state,action,reward,next_state,done))\n",
    "            break\n",
    "        \n",
    "        memory.append((state,action,reward,next_state,done))\n",
    "        state=next_state\n",
    "    if total_reward>max_total_reward:\n",
    "        actor.save_weights(\"MC_DDPG_Weights/Actor_Best_weights episode %d_GAMMA_%f_TAU%f_lr_%f.h5\"%(episode,GAMMA,TAU,actor_lr))\n",
    "        critic.save_weights(\"MC_DDPG_Weights/Critic_Best_weights episode %d_GAMMA_%f_TAU%f_lr_%f.h5\"%(episode,GAMMA,TAU,critic_lr))\n",
    "        max_total_reward=total_reward\n",
    "    print(\"Episode %d Total Reward %f\"%(episode,total_reward))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Tutorial for DDPG Reinforcement Learning using Keras"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### MD Muhaimin Rahman\n",
	"sezan92[at]gmail[dot]com"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In this tutorial, I am trying to build ```Deep Deterministic Policy Gradient```aka ```DDPG``` Reinforcement Learning using Keras Backend.\n",
	"This notebook is structured as following\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"- [Importing Libraries](#libraries)\n",
	"- [Algorithm](#algorithm)\n",
	"- [Model Definition](#model)\n",
	"- [Replay Buffer](#buffer)\n",
	"- [Noise Class](#noise)\n",
	"- [Training](#training)\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Target Readers"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If you already have idea about Q Learning ,Deep Q Learning and Deep Neural Networks , then this tutorial is for you. Otherwise, you should learn them first"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<a id =\"libraries\"></a>\n",
	"### Importing Libraries\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"/home/sezan92/anaconda/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
	" from ._conv import register_converters as _register_converters\n",
	"Using TensorFlow backend.\n"
	]
	}
	],
	"source": [
	"from __future__ import print_function,division\n",
	"import gym\n",
	"import keras\n",
	"from keras import layers\n",
	"from keras import backend as K\n",
	"from collections import deque\n",
	"from tqdm import tqdm\n",
	"import random\n",
	"import numpy as np\n",
	"import copy\n",
	"SEED =123\n",
	"np.random.seed(SEED)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Important constants"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"num_episodes = 100\n",
	"steps_per_episode=500\n",
	"BATCH_SIZE=256\n",
	"TAU=0.001\n",
	"GAMMA=0.95\n",
	"actor_lr=0.0001\n",
	"critic_lr=0.001\n",
	"SHOW= False"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"from keras.models import Model"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<a id =\"algorithm\"></a>\n",
	"### Algorithm"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The actual algorithm was developed by Timothy lilicap et al. The algorithm is an actor-critic based algorithm.. Which means, it has two networks to train- an actor network, which predicts action based on the current state. The other networ- known as Critic network- evaluates the state and action. This is the case for all actor-critic networks. The critic network is updated using Bellman Equation like DQN. The difference is the training of actor network. In DDPG , we train Actor network by trying to get the maximum value of gradient of $Q(s,a)$ for given action $a$ in a state $s$. In a normal machine learning classification and regression algorithm, our target is to get the value with minimum loss. Then we train the network by gradient descent technique using the gradient of Loss ."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\\begin{equation}\n",
	"\\theta \\gets \\theta - \\alpha \\frac{\\partial L}{\\partial \\theta}\n",
	"\\end{equation}"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here, $\\theta$ is the weight parameter of the network, and $L$ is loss"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"But in our case, we have to get the maximize the $Q$ value. So we have to set the weight parameters such that we get the maximum $Q$ value. This technique is known as Gradient Ascent, as it does the exact opposite of Gradient Descent"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\\begin{equation}\n",
	"\\theta_a \\gets \\theta_a - \\alpha (-\\frac{\\partial Q(s,a) }{\\partial \\theta_a}) \n",
	"\\end{equation}"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The above equation looks like the actual Gradient Descent equation. Only difference is , the minus sign. It makes the equation to minimize the negative value of $Q$ , which in turn maximizes $Q$ value.\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So the training is as following"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"- 1) Define Actor network $actor$ and Critic Network $critic$\n",
	"- 2) Define Target Actor and Critic Networks - $actor_{target}$ and $critic_{target}$ with exact same weights\n",
	"- 3) Initialize Replay Buffer \n",
	"- 4) Get the initial state , $state$\n",
	"- 5) Get the action $a$ from , $a \\gets actor(state)$ + Noise .[Here Noise is given to make the process stochastic and not so deterministic. The paper uses ornstein uhlenbeck noise process , so we will as well]\n",
	"- 6) Get Next state $state_{next}$ , Reward $r$ , Terminal from agent for given $state$ and $action$\n",
	"- 7) Add the experience , $state$,$action$,$reward$,$state_{next}$,$terminal$ to replay buffer\n",
	"- 8) Get sample minibatch from Replay buffer\n",
	"- 9) Train Critic Network Using Bellman Equation. Like DQN\n",
	"- 10) Train Actor Network using Gradient Ascent with gradients of $Q$ . $\\theta_a \\gets \\theta_a - \\alpha (-\\frac{\\partial Q(s,a) }{\\partial \\theta_a}) $\n",
	"- 11) Update weights of $actor_{target}$ and $critic_{target}$ using the equation $ \\theta \\gets \\tau \\theta + (1-\\tau)\\theta_{target}$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<a id =\"model\"></a>\n",
	"### Model Definition"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"After some trials and errors, I have selected this network. The Actor Network is 3 layer MLP with 320 hidden nodes in each layer. The critic network is also a 3 layer MLP with 640 hidden nodes in each layer.Notice that the return arguments of function ```create_critic_network```."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"def create_actor_network(state_shape,action_shape):\n",
	" in1=layers.Input(shape=state_shape,name=\"state\")\n",
	" l1 =layers.Dense(320,activation=\"relu\")(in1)\n",
	" l2 =layers.Dense(320,activation=\"relu\")(l1)\n",
	" l3 =layers.Dense(320,activation=\"relu\")(l2)\n",
	" action =layers.Dense(action_shape,activation=\"tanh\")(l3)\n",
	" actor= Model(in1,action)\n",
	" return actor\n",
	"\n",
	"def create_critic_network(state_shape,action_shape):\n",
	" in1 = layers.Input(shape=state_shape,name=\"state\")\n",
	" in2 = layers.Input(shape=action_shape,name=\"action\")\n",
	" l1 = layers.concatenate([in1,in2])\n",
	" l2 = layers.Dense(640,activation=\"relu\")(l1)\n",
	" l3 = layers.Dense(640,activation=\"relu\")(l2)\n",
	" l4 = layers.Dense(640,activation=\"relu\")(l3)\n",
	" value = layers.Dense(1)(l4)\n",
	" critic = Model(inputs=[in1,in2],outputs=value)\n",
	" return critic,in1,in2"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I am chosing ```MountainCarContinuous-v0``` game. Mainly because my GPU is not that good to work on higher dimensional state space"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"\u001b[33mWARN: gym.spaces.Box autodetected dtype as <type 'numpy.float32'>. Please provide explicit dtype.\u001b[0m\n",
	"\u001b[33mWARN: gym.spaces.Box autodetected dtype as <type 'numpy.float32'>. Please provide explicit dtype.\u001b[0m\n"
	]
	}
	],
	"source": [
	"env = gym.make(\"MountainCarContinuous-v0\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"state_shape= env.observation_space.sample().shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [],
	"source": [
	"action_shape=env.action_space.sample().shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"actor = create_actor_network(state_shape,action_shape[0])\n",
	"critic,state_tensor,action_tensor = create_critic_network(state_shape,action_shape)\n",
	"target_actor=create_actor_network(state_shape,action_shape[0])\n",
	"target_critic,_,_ = create_critic_network(state_shape,action_shape)\n",
	"target_actor.set_weights(actor.get_weights())\n",
	"target_critic.set_weights(critic.get_weights())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I have chosen ```RMSProp``` optimizer, due to more stability compared to Adam . I found it after trials and errors, no theoritical background on chosing this optimizer"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [],
	"source": [
	"actor_optimizer = keras.optimizers.RMSprop(actor_lr)\n",
	"\n",
	"critic_optimizer = keras.optimizers.RMSprop(critic_lr)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"WARNING:tensorflow:From /home/sezan92/anaconda/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:1290: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
	"Instructions for updating:\n",
	"keep_dims is deprecated, use keepdims instead\n"
	]
	}
	],
	"source": [
	"critic.compile(loss=\"mse\",optimizer=critic_optimizer)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Actor training"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I think this is the most critical part of ddpg in keras. The object ```critic``` and ```actor``` has a ```__call__``` method inside it, which will give output tensor if you give input a tensor. So to get the tensor object of ```Q``` we will use this functionality."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [],
	"source": [
	"CriticValues = critic([state_tensor,actor(state_tensor)])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now it is time to get the gradient value of $-\\frac{\\partial Q(s,a)}{\\theta_a}$"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [],
	"source": [
	"updates = actor_optimizer.get_updates(\n",
	" params=actor.trainable_weights,loss=-K.mean(CriticValues))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now we will create a function which will train the actor network."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [],
	"source": [
	"actor_train = K.function(inputs=[state_tensor],outputs=[actor(state_tensor),CriticValues],\n",
	" updates=updates)\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<a id =\"buffer\"></a>\n",
	"### Replay Buffer"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [],
	"source": [
	"memory = deque(maxlen=10000)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"100%\|██████████\| 10000/10000 [00:00<00:00, 21304.88it/s]\n"
	]
	}
	],
	"source": [
	"state = env.reset()\n",
	"state = state.reshape(-1,)\n",
	"for _ in tqdm(range(memory.maxlen)):\n",
	" action = env.action_space.sample()\n",
	" next_state,reward,terminal,_=env.step(action)\n",
	" \n",
	" state=next_state\n",
	" if terminal:\n",
	" reward=-100\n",
	" state= env.reset()\n",
	" state = state.reshape(-1,)\n",
	" memory.append((state,action,reward,next_state,terminal))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([-0.49533114, -0.00344986])"
	]
	},
	"execution_count": 16,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"next_state"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<a id =\"noise\"></a>\n",
	"### Noise class"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The use of Noise is to make the process Stochastic and to help the agent explore different actions. The paper used Orstein Uhlenbeck Noise class"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [],
	"source": [
	"class OrnsteinUhlenbeckProcess(object):\n",
	" def __init__(self, theta, mu=0, sigma=1, x0=0, dt=1e-2, n_steps_annealing=10, size=1):\n",
	" self.theta = theta\n",
	" self.sigma = sigma\n",
	" self.n_steps_annealing = n_steps_annealing\n",
	" self.sigma_step = - self.sigma / float(self.n_steps_annealing)\n",
	" self.x0 = x0\n",
	" self.mu = mu\n",
	" self.dt = dt\n",
	" self.size = size\n",
	" def restart(self):\n",
	" self.x0=copy.copy(self.mu)\n",
	" def generate(self, step):\n",
	" #sigma = max(0, self.sigma_step * step + self.sigma)\n",
	" x = self.x0 + self.theta * (self.mu - self.x0) * self.dt + self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.size)\n",
	" self.x0 = x\n",
	" return x"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<a id =\"training\"></a>\n",
	"### Training"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Failed! Episode 0 Total Reward -121.829750\n",
	"Failed! Episode 1 Total Reward -118.666569\n",
	"Failed! Episode 2 Total Reward -162.041720\n",
	"Failed! Episode 3 Total Reward -87.619343\n",
	"Failed! Episode 4 Total Reward -96.425173\n",
	"Failed! Episode 5 Total Reward -143.686660\n",
	"Failed! Episode 6 Total Reward -38.429055\n",
	"Passed! Episode 7 Total Reward 76.095012\n",
	"Passed! Episode 8 Total Reward 78.345259\n",
	"Failed! Episode 9 Total Reward -0.019448\n",
	"Passed! Episode 10 Total Reward 48.135478\n",
	"Passed! Episode 11 Total Reward 54.750747\n",
	"Passed! Episode 12 Total Reward 79.410100\n",
	"Passed! ward 83.0033361Episode 13 Total Reward 83.003336\n",
	"Passed! Episode 14 Total Reward 83.174427\n",
	"Failed! Episode 15 Total Reward -48.217783\n",
	"Passed! Episode 16 Total Reward 87.284775\n",
	"Passed! Episode 17 Total Reward 32.635075\n",
	"Passed! Episode 18 Total Reward 84.165725\n",
	"Passed! Episode 19 Total Reward 89.819676\n",
	"Passed! Episode 20 Total Reward 76.334761\n",
	"Passed! Episode 21 Total Reward 58.299554\n",
	"Passed! Episode 22 Total Reward 81.035049\n",
	"Failed! Episode 23 Total Reward -39.691471\n",
	"Failed! Episode 24 Total Reward -161.331777\n",
	"Passed! Episode 25 Total Reward 86.132543\n",
	"Failed! Episode 26 Total Reward -104.653697\n",
	"Passed! Episode 27 Total Reward 85.415800\n",
	"Failed! Episode 28 Total Reward -74.066867\n",
	"Failed! Episode 29 Total Reward -29.747882\n",
	"Passed! Episode 30 Total Reward 59.700877\n",
	"Passed! Episode 31 Total Reward 64.003951\n",
	"Failed! Episode 32 Total Reward -43.807664\n",
	"Passed! Episode 33 Total Reward 81.749323\n",
	"Passed! Episode 34 Total Reward 66.484130\n",
	"Failed! Episode 35 Total Reward -120.525069\n",
	"Failed! Episode 36 Total Reward -24.451582\n",
	"Failed! Episode 37 Total Reward -207.239527\n",
	"Failed! Episode 38 Total Reward -43.106546\n",
	"Passed! Episode 39 Total Reward 34.880121\n",
	"Passed! Episode 40 Total Reward 66.188821\n",
	"Passed! Episode 41 Total Reward 29.939728\n",
	"Failed! Episode 42 Total Reward -86.516192\n",
	"Failed! Episode 43 Total Reward -225.143733\n",
	"Passed! Episode 44 Total Reward 7.568837\n",
	"Passed! Episode 45 Total Reward 84.050707\n",
	"Failed! Episode 46 Total Reward -86.705966\n",
	"Passed! Episode 47 Total Reward 51.108435\n",
	"Failed! Episode 48 Total Reward -88.531423\n",
	"Passed! Episode 49 Total Reward 79.210735\n",
	"Failed! Episode 50 Total Reward -64.247977\n",
	"Failed! Episode 51 Total Reward -28.334167\n",
	"Failed! Episode 52 Total Reward -53.942703\n",
	"Passed! Episode 53 Total Reward 56.541394\n",
	"Failed! Episode 54 Total Reward -30.573394\n",
	"Failed! Episode 55 Total Reward -43.900060\n",
	"Failed! Episode 56 Total Reward -31.793103\n",
	"Passed! Episode 57 Total Reward 49.367744\n",
	"Failed! Episode 58 Total Reward -59.277307\n",
	"Failed! Episode 59 Total Reward -91.342287\n",
	"Passed! Episode 60 Total Reward 72.884442\n",
	"Passed! Episode 61 Total Reward 81.967582\n",
	"Failed! Episode 62 Total Reward -69.682571\n",
	"Failed! Episode 63 Total Reward -21.655243\n",
	"Failed! Episode 64 Total Reward -102.427871\n",
	"Failed! Episode 65 Total Reward -18.727795\n",
	"Failed! Episode 66 Total Reward -106.180673\n",
	"Passed! Episode 67 Total Reward 66.997363\n",
	"Failed! Episode 68 Total Reward -98.338890\n",
	"Failed! Episode 69 Total Reward -17.488416\n",
	"Failed! Episode 70 Total Reward -126.062955\n",
	"Passed! Episode 71 Total Reward 91.446696\n",
	"Failed! Episode 72 Total Reward -153.244886\n",
	"Passed! Episode 73 Total Reward 32.589752\n",
	"Failed! Episode 74 Total Reward -106.034746\n",
	"Failed! Episode 75 Total Reward -46.197340\n",
	"Failed! Episode 76 Total Reward -64.983000\n",
	"Failed! Episode 77 Total Reward -50.082150\n",
	"Failed! Episode 78 Total Reward -44.506584\n",
	"Total Reward -86.285619\r"
	]
	}
	],
	"source": [
	"steps_per_episodes=5000\n",
	"ou = OrnsteinUhlenbeckProcess(theta=0.35,mu=0.8,sigma=0.4,n_steps_annealing=10)\n",
	"max_total_reward=0\n",
	"for episode in range(num_episodes):\n",
	" state= env.reset()\n",
	" state = state.reshape(-1,)\n",
	" total_reward=0\n",
	" ou.restart()\n",
	" for step in range(steps_per_episodes):\n",
	" action= actor.predict(state.reshape(1,-1))+ou.generate(episode)\n",
	" next_state,reward,done,_ = env.step(action)\n",
	" total_reward=total_reward+reward\n",
	" #random minibatch from buffer\n",
	" \n",
	" batches=random.sample(memory,BATCH_SIZE)\n",
	" states= np.array([batch[0].reshape((-1,)) for batch in batches])\n",
	" actions= np.array([batch[1] for batch in batches])\n",
	" actions=actions.reshape(-1,1)\n",
	" rewards=np.array([batch[2] for batch in batches])\n",
	" rewards = rewards.reshape((-1,1))\n",
	" new_states=np.array([batch[3].reshape((-1,)) for batch in batches])\n",
	" terminals=np.array([batch[4] for batch in batches])\n",
	" terminals = terminals.reshape((-1,1))\n",
	" #training\n",
	" \n",
	" target_actions = target_actor.predict(new_states)\n",
	" target_Qs = target_critic.predict([new_states,target_actions])\n",
	" \n",
	" new_Qs = rewards+GAMMAtarget_Qsterminals\n",
	" critic.fit([states,actions],new_Qs,verbose=False)\n",
	" _,critic_values=actor_train(inputs=[states])\n",
	" target_critic_weights=[TAUweight+(1-TAU)target_weight for weight,target_weight in zip(critic.get_weights(),target_critic.get_weights())]\n",
	" target_actor_weights=[TAUweight+(1-TAU)target_weight for weight,target_weight in zip(actor.get_weights(),target_actor.get_weights())]\n",
	" target_critic.set_weights(target_critic_weights)\n",
	" target_actor.set_weights(target_actor_weights)\n",
	" print(\"Total Reward %f\"%total_reward,end=\"\\r\")\n",
	" if SHOW:\n",
	" env.render()\n",
	" if done or step==(steps_per_episodes-1):\n",
	" \n",
	" if total_reward<0:\n",
	" print(\"Failed!\",end=\" \")\n",
	" reward=-100\n",
	" elif total_reward>0:\n",
	" print(\"Passed!\",end=\" \")\n",
	" reward=100\n",
	" memory.append((state,action,reward,next_state,done))\n",
	" break\n",
	" \n",
	" memory.append((state,action,reward,next_state,done))\n",
	" state=next_state\n",
	" if total_reward>max_total_reward:\n",
	" actor.save_weights(\"MC_DDPG_Weights/Actor_Best_weights episode %d_GAMMA_%f_TAU%f_lr_%f.h5\"%(episode,GAMMA,TAU,actor_lr))\n",
	" critic.save_weights(\"MC_DDPG_Weights/Critic_Best_weights episode %d_GAMMA_%f_TAU%f_lr_%f.h5\"%(episode,GAMMA,TAU,critic_lr))\n",
	" max_total_reward=total_reward\n",
	" print(\"Episode %d Total Reward %f\"%(episode,total_reward))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.15"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}