sujnesh/c1105cea-af25-4142-96f8-3cc2964d2f87.ipynb

## c1105cea-af25-4142-96f8-3cc2964d2f87.ipynb
{"nbformat":4,"nbformat_minor":0,"metadata":{"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.8-final"},"orig_nbformat":2,"kernelspec":{"name":"python3","display_name":"Python 3.8.8 64-bit (conda)","metadata":{"interpreter":{"hash":"828b2d2e90c34517c0f3db151137e4812c232f0241304c27576ea4a0014563c4"}}},"colab":{"name":"IITM_RL_DP_ASSIGNMENT_v1","provenance":[{"file_id":"1LGjo_UabJCcQl41FwuOahjeJe7Lp2kHk","timestamp":1614721014962},{"file_id":"14fC9FsOpJ6tJHZ5sa9s2EILP9BLIxnlw","timestamp":1614718819970}],"collapsed_sections":[],"toc_visible":true}},"cells":[{"cell_type":"markdown","metadata":{"id":"VXoLHrjbjNJp"},"source":["<div style=\"text-align: center\">\n","  <a href=\"https://www.aicrowd.com/challenges/rl-taxi\"><img alt=\"AIcrowd\" src=\"https://images.aicrowd.com/raw_images/challenges/banner_file/759/d9540ebbd506b68a5ff2.jpg\"></a>\n","</div>"]},{"cell_type":"markdown","metadata":{"id":"_rBlqB-7jSaG"},"source":["# What is the notebook about?\n","\n","## Problem - DP Algorithm\n","This problem deals with a taxi driver with multiple actions in different cities. The tasks you have to do are:\n","- Implement DP Algorithm to find the optimal sequence for the taxi driver\n","- Find optimal policies for sequences of varying lengths\n","- Explain a variation on the policy\n","\n","# How to use this notebook? 📝\n","\n","- This is a shared template and any edits you make here will not be saved. **You\n","should make a copy in your own drive**. Click the \"File\" menu (top-left), then \"Save a Copy in Drive\". You will be working in your copy however you like.\n","\n","<p style=\"text-align: center\"><img src=\"https://gitlab.aicrowd.com/aicrowd/assets/-/raw/master/notebook/aicrowd_notebook_submission_flow.png?inline=false\" alt=\"notebook overview\" style=\"width: 650px;\"/></p>\n","\n","- **Update the config parameters**. You can define the common variables here\n","\n","Variable | Description\n","--- | ---\n","`AICROWD_DATASET_PATH` | Path to the file containing test data. This should be an absolute path.\n","`AICROWD_RESULTS_DIR` | Path to write the output to.\n","`AICROWD_ASSETS_DIR` | In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.\n","`AICROWD_API_KEY` | In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me\n","\n","- **Installing packages**. Please use the [Install packages 🗃](#install-packages-) section to install the packages"]},{"cell_type":"markdown","metadata":{"id":"zr2_Itu2jmYu"},"source":["# Setup AIcrowd Utilities 🛠\n","\n","We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block."]},{"cell_type":"code","metadata":{"id":"kriIY9ntvLQD"},"source":["!pip install -U git+https://gitlab.aicrowd.com/aicrowd/aicrowd-cli.git@notebook-submission-v2 > /dev/null "],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"ecFpLP6avMok"},"source":["%load_ext aicrowd.magic "],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"9tpalq6zjo3X"},"source":["# AIcrowd Runtime Configuration 🧷\n","\n","Define configuration parameters. Please include any files needed for the notebook to run under `ASSETS_DIR`. We will copy the contents of this directory to your final submission file 🙂"]},{"cell_type":"code","metadata":{"id":"Up8mk6fhvOub"},"source":["import os\n","\n","AICROWD_DATASET_PATH = os.getenv(\"DATASET_PATH\", os.getcwd()+\"/40746340-4151-4921-8496-be10b3f8f5cf_hw2_q1.zip\")\n","AICROWD_RESULTS_DIR = os.getenv(\"OUTPUTS_DIR\", \"results\")\n","API_KEY = \"\" #Get your API key from https://www.aicrowd.com/participants/me"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"O2w1nVPDj9Uj"},"source":["# Download dataset files 📲"]},{"cell_type":"code","metadata":{"id":"ZV9xReqhvVNt"},"source":["!aicrowd login --api-key $API_KEY\n","!aicrowd dataset download -c rl-taxi"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"MTYP6MFKvXCM"},"source":["!unzip -q $AICROWD_DATASET_PATH"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"vijz0kfaKzmw"},"source":["DATASET_DIR = 'hw2_q1/'\n","!mkdir {DATASET_DIR}results/"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"gTeWFlWukTob"},"source":["# Install packages 🗃\n","\n","Please add all pacakage installations in this section"]},{"cell_type":"code","metadata":{"id":"KV5fXPkYkUxz"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"yJbm42p_kWRG"},"source":["# Import packages 💻"]},{"cell_type":"code","metadata":{"id":"geEOnXHeK4oQ"},"source":["import numpy as np\n","import os\n","# ADD ANY IMPORTS YOU WANT HERE"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"oL3_9MAk5cqv"},"source":["import numpy as np\n","\n","class TaxiEnv_HW2:\n","    def __init__(self, states, actions, probabilities, rewards):\n","        self.possible_states = states\n","        self._possible_actions = {st: ac for st, ac in zip(states, actions)}\n","        self._ride_probabilities = {st: pr for st, pr in zip(states, probabilities)}\n","        self._ride_rewards = {st: rw for st, rw in zip(states, rewards)}\n","        self._verify()\n","\n","    def _check_state(self, state):\n","        assert state in self.possible_states, \"State %s is not a valid state\" % state\n","\n","    def _verify(self):\n","        \"\"\" \n","        Verify that data conditions are met:\n","        Number of actions matches shape of next state and actions\n","        Every probability distribution adds up to 1 \n","        \"\"\"\n","        ns = len(self.possible_states)\n","        for state in self.possible_states:\n","            ac = self._possible_actions[state]\n","            na = len(ac)\n","\n","            rp = self._ride_probabilities[state]\n","            assert np.all(rp.shape == (na, ns)), \"Probabilities shape mismatch\"\n","        \n","            rr = self._ride_rewards[state]\n","            assert np.all(rr.shape == (na, ns)), \"Rewards shape mismatch\"\n","\n","            assert np.allclose(rp.sum(axis=1), 1), \"Probabilities don't add up to 1\"\n","\n","    def possible_actions(self, state):\n","        \"\"\" Return all possible actions from a given state \"\"\"\n","        self._check_state(state)\n","        return self._possible_actions[state]\n","\n","    def ride_probabilities(self, state, action):\n","        \"\"\" \n","        Returns all possible ride probabilities from a state for a given action\n","        For every action a list with the returned with values in the same order as self.possible_states\n","        \"\"\"\n","        actions = self.possible_actions(state)\n","        ac_idx = actions.index(action)\n","        return self._ride_probabilities[state][ac_idx]\n","\n","    def ride_rewards(self, state, action):\n","        actions = self.possible_actions(state)\n","        ac_idx = actions.index(action)\n","        return self._ride_rewards[state][ac_idx]"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"APNUuTHK5cqw"},"source":["# Examples of using the environment functions"]},{"cell_type":"code","metadata":{"id":"U8BwZY8Z5cqw"},"source":["def check_taxienv():\n","    # These are the values as used in the pdf, but they may be changed during submission, so do not hardcode anything\n","\n","    states = ['A', 'B', 'C']\n","\n","    actions = [['1','2','3'], ['1','2'], ['1','2','3']]\n","\n","    probs = [np.array([[1/2,  1/4,  1/4],\n","                    [1/16, 3/4,  3/16],\n","                    [1/4,  1/8,  5/8]]),\n","\n","            np.array([[1/2,   0,     1/2],\n","                    [1/16,  7/8,  1/16]]),\n","\n","            np.array([[1/4,  1/4,  1/2],\n","                    [1/8,  3/4,  1/8],\n","                    [3/4,  1/16, 3/16]]),]\n","\n","    rewards = [np.array([[10,  4,  8],\n","                        [ 8,  2,  4],\n","                        [ 4,  6,  4]]),\n","    \n","            np.array([[14,  0, 18],\n","                        [ 8, 16,  8]]),\n","    \n","            np.array([[10,  2,  8],\n","                        [6,   4,  2],\n","                        [4,   0,  8]]),]\n","\n","\n","    env = TaxiEnv_HW2(states, actions, probs, rewards)\n","    print(\"All possible states\", env.possible_states)\n","    print(\"All possible actions from state B\", env.possible_actions('B'))\n","    print(\"Ride probabilities from state A with action 2\", env.ride_probabilities('A', '2'))\n","    print(\"Ride rewards from state C with action 3\", env.ride_rewards('C', '3'))\n","\n","check_taxienv()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"uh6g_1u05cqx"},"source":["# Task 1 - DP Algorithm implementation\n","Implement your DP algorithm that takes the starting state and sequence length\n","and return the expected reward for the policy"]},{"cell_type":"code","metadata":{"id":"IMYcbfVq5cqx"},"source":["def dp_solve(taxienv):\n","    ## Implement the DP algorithm for the taxienv\n","    states = taxienv.possible_states\n","    values = {s: 0 for s in states}\n","    policy = {s: '0' for s in states}\n","    all_values = [] # Append the \"values\" dictionary to this after each update\n","    all_policies = [] # Append the \"policy\" dictionary to this after each update\n","    # Note: The sequence length is always N=10\n","    \n","    # ADD YOUR CODE BELOW - DO NOT EDIT ABOVE THIS LINE\n","\n","    # DO NOT EDIT BELOW THIS LINE\n","    results = {\"Expected Reward\": all_values, \"Polcies\": all_policies}\n","    return results"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"us8AYNVXISv1"},"source":["## Here is an example of what the \"results\" output from value_iter function should look like\n","\n","Ofcourse, it won't be all zeros\n","``` python \n","{'Expected Reward': [{'A': 0, 'B': 0, 'C': 0},\n","  {'A': 0, 'B': 0, 'C': 0},\n","  {'A': 0, 'B': 0, 'C': 0},\n","  {'A': 0, 'B': 0, 'C': 0},\n","  {'A': 0, 'B': 0, 'C': 0},\n","  {'A': 0, 'B': 0, 'C': 0},\n","  {'A': 0, 'B': 0, 'C': 0},\n","  {'A': 0, 'B': 0, 'C': 0},\n","  {'A': 0, 'B': 0, 'C': 0},\n","  {'A': 0, 'B': 0, 'C': 0}],\n"," 'Polcies': [{'A': '0', 'B': '0', 'C': '0'},\n","  {'A': '0', 'B': '0', 'C': '0'},\n","  {'A': '0', 'B': '0', 'C': '0'},\n","  {'A': '0', 'B': '0', 'C': '0'},\n","  {'A': '0', 'B': '0', 'C': '0'},\n","  {'A': '0', 'B': '0', 'C': '0'},\n","  {'A': '0', 'B': '0', 'C': '0'},\n","  {'A': '0', 'B': '0', 'C': '0'},\n","  {'A': '0', 'B': '0', 'C': '0'},\n","  {'A': '0', 'B': '0', 'C': '0'}]}\n","\n","  ```"]},{"cell_type":"code","metadata":{"id":"5Ct5_WU1meeo"},"source":["if not os.path.exists(AICROWD_RESULTS_DIR):\n","  os.mkdir(AICROWD_RESULTS_DIR)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"AblFN9zNIjwV"},"source":["# DO NOT EDIT THIS CELL, DURING EVALUATION THE DATASET DIR WILL CHANGE\n","input_dir = os.path.join(DATASET_DIR, 'inputs')\n","for params_file in os.listdir(input_dir):\n","  kwargs = np.load(os.path.join(input_dir, params_file), allow_pickle=True).item()\n","\n","  env = TaxiEnv_HW2(**kwargs)\n","\n","  results = dp_solve(env)\n","  idx = params_file.split('_')[-1][:-4]\n","  np.save(os.path.join(AICROWD_RESULTS_DIR, 'results_' + idx), results)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"fswnLXrXL2wh"},"source":["## Modify this code to show the results for the policy and expected rewards properly\n","print(results)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Qsne4xm4YTi-"},"source":["# Task 2 - Tabulate the optimal policy & optimal value for each state in each round for N=10\n","\n","Modify this cell and add your answer"]},{"cell_type":"markdown","metadata":{"id":"vKsA1jrCKszh"},"source":["# Question -  Consider a policy that always forces the driver to go to the nearest taxi stand, irrespective of the state. Is it optimal? Justify your answer.\n","\n"]},{"cell_type":"markdown","metadata":{"id":"Yl502NaJHaPE"},"source":["Modify this cell and add your answer"]},{"cell_type":"markdown","metadata":{"id":"NiAS3hQPkiXS"},"source":["# Submit to AIcrowd 🚀\n","\n","**NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)**"]},{"cell_type":"code","metadata":{"id":"LfpXzdeTvjJ7"},"source":["!DATASET_PATH=$AICROWD_DATASET_PATH aicrowd notebook submit -c rl-taxi -a assets"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"VfwpjWPr3yMm"},"source":[""],"execution_count":null,"outputs":[]}]}