jteichma/pomdp.ipynb Secret

## pomdp.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "name": "untitled0.ipynb",
      "authorship_tag": "ABX9TyN6rYLZqHntV+mRSg0OsJM+",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/jteichma/05441c0e13d450c341c71d0a307623f1/untitled0.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "We shall interpret this simplest POMDP also from a point of view of finance. Let us first write it down abstractly.\n",
        "\n",
        "There are two states of the environment: tiger-left and tiger-right. There are three actions: open-left, open-right and listen.\n",
        "\n",
        "If action is equal to listen, then the state remains constant. If action equals open we are uniformly resampling after collecting the reward.\n",
        "\n",
        "The observation is a noisy version of the actual state, i.e. the probability to receive the actual state is reduced by an additive quantity, e.g. 0.15.\n",
        "\n",
        "One financial interpretation is the following:\n",
        "\n",
        "1.   Assume the two states to be two short maturity investment vehicles, where one appears to be beneficial, the other one means a total loss. \n",
        "2.   You are uninformed about the true nature of the vehicles, but you can look at the performance ('listen') of course with some noise.\n",
        "3.   After choosing a vehicle you realize the return and are facing the same situation.\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "IjM3KP-8ZCWl"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "pETnXXQDxZQ5",
        "outputId": "e164619a-2267-414f-c5ff-eb0a3cf48e5c"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Collecting pomdp_py\n",
            "  Downloading pomdp-py-1.3.2.tar.gz (105 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m105.3/105.3 kB\u001b[0m \u001b[31m3.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "Requirement already satisfied: Cython in /usr/local/lib/python3.10/dist-packages (from pomdp_py) (0.29.34)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from pomdp_py) (1.22.4)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from pomdp_py) (1.10.1)\n",
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from pomdp_py) (4.65.0)\n",
            "Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from pomdp_py) (3.7.1)\n",
            "Requirement already satisfied: pygame in /usr/local/lib/python3.10/dist-packages (from pomdp_py) (2.3.0)\n",
            "Requirement already satisfied: opencv-python in /usr/local/lib/python3.10/dist-packages (from pomdp_py) (4.7.0.72)\n",
            "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pomdp_py) (1.0.7)\n",
            "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pomdp_py) (0.11.0)\n",
            "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pomdp_py) (4.39.3)\n",
            "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pomdp_py) (1.4.4)\n",
            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pomdp_py) (23.1)\n",
            "Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pomdp_py) (8.4.0)\n",
            "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pomdp_py) (3.0.9)\n",
            "Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pomdp_py) (2.8.2)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->pomdp_py) (1.16.0)\n",
            "Building wheels for collected packages: pomdp_py\n",
            "  Building wheel for pomdp_py (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for pomdp_py: filename=pomdp_py-1.3.2-cp310-cp310-linux_x86_64.whl size=4370158 sha256=02a89319ea859f3470c704b63ada0ee9886e2d2c14fe0c5a874cf018cc6c9b68\n",
            "  Stored in directory: /root/.cache/pip/wheels/91/59/ce/ca1bdd31a8083b61feb71fad55261ebbf6eba0a8a6f1104ac2\n",
            "Successfully built pomdp_py\n",
            "Installing collected packages: pomdp_py\n",
            "Successfully installed pomdp_py-1.3.2\n"
          ]
        }
      ],
      "source": [
        "!pip install pomdp_py"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "\"\"\"The classic Tiger problem.\n",
        "\n",
        "This is a POMDP problem; Namely, it specifies both\n",
        "the POMDP (i.e. state, action, observation space)\n",
        "and the T/O/R for the agent as well as the environment.\n",
        "\n",
        "The description of the tiger problem is as follows: (Quote from\n",
        "`POMDP: Introduction to Partially Observable Markov Decision Processes\n",
        "<https://cran.r-project.org/web/packages/pomdp/vignettes/POMDP.pdf>`_ by\n",
        "Kamalzadeh and Hahsler )\n",
        "\n",
        "A tiger is put with equal probability behind one\n",
        "of two doors, while treasure is put behind the other one.\n",
        "You are standing in front of the two closed doors and\n",
        "need to decide which one to open. If you open the door\n",
        "with the tiger, you will get hurt (negative reward).\n",
        "But if you open the door with treasure, you receive\n",
        "a positive reward. Instead of opening a door right away,\n",
        "you also have the option to wait and listen for tiger noises. But\n",
        "listening is neither free nor entirely accurate. You might hear the\n",
        "tiger behind the left door while it is actually behind the right\n",
        "door and vice versa.\n",
        "\n",
        "States: tiger-left, tiger-right\n",
        "Actions: open-left, open-right, listen\n",
        "Rewards:\n",
        "    +10 for opening treasure door. -100 for opening tiger door.\n",
        "    -1 for listening.\n",
        "Observations: You can hear either \"tiger-left\", or \"tiger-right\".\n",
        "\n",
        "Note that in this example, the TigerProblem is a POMDP that\n",
        "also contains the agent and the environment as its fields. In\n",
        "general this doesn't need to be the case. (Refer to more\n",
        "complicated examples.)\n",
        "\"\"\"\n",
        "\n",
        "import pomdp_py\n",
        "from pomdp_py.utils import TreeDebugger\n",
        "import random\n",
        "import numpy as np\n",
        "import sys\n",
        "\n",
        "class TigerState(pomdp_py.State):\n",
        "    def __init__(self, name):\n",
        "        self.name = name\n",
        "    def __hash__(self):\n",
        "        return hash(self.name)\n",
        "    def __eq__(self, other):\n",
        "        if isinstance(other, TigerState):\n",
        "            return self.name == other.name\n",
        "        return False\n",
        "    def __str__(self):\n",
        "        return self.name\n",
        "    def __repr__(self):\n",
        "        return \"TigerState(%s)\" % self.name\n",
        "\n",
        "    def other(self):\n",
        "        if self.name.endswith(\"left\"):\n",
        "            return TigerState(\"tiger-right\")\n",
        "        else:\n",
        "            return TigerState(\"tiger-left\")\n",
        "\n",
        "\n",
        "class TigerAction(pomdp_py.Action):\n",
        "    def __init__(self, name):\n",
        "        self.name = name\n",
        "    def __hash__(self):\n",
        "        return hash(self.name)\n",
        "    def __eq__(self, other):\n",
        "        if isinstance(other, TigerAction):\n",
        "            return self.name == other.name\n",
        "        return False\n",
        "    def __str__(self):\n",
        "        return self.name\n",
        "    def __repr__(self):\n",
        "        return \"TigerAction(%s)\" % self.name\n",
        "\n",
        "\n",
        "class TigerObservation(pomdp_py.Observation):\n",
        "    def __init__(self, name):\n",
        "        self.name = name\n",
        "    def __hash__(self):\n",
        "        return hash(self.name)\n",
        "    def __eq__(self, other):\n",
        "        if isinstance(other, TigerObservation):\n",
        "            return self.name == other.name\n",
        "        return False\n",
        "    def __str__(self):\n",
        "        return self.name\n",
        "    def __repr__(self):\n",
        "        return \"TigerObservation(%s)\" % self.name\n",
        "\n",
        "\n",
        "# Observation model\n",
        "\n",
        "class ObservationModel(pomdp_py.ObservationModel):\n",
        "    def __init__(self, noise=0.15):\n",
        "        self.noise = noise\n",
        "\n",
        "\n",
        "    def probability(self, observation, next_state, action):\n",
        "        if action.name == \"listen\":\n",
        "            # heard the correct growl\n",
        "            if observation.name == next_state.name:\n",
        "                return 1.0 - self.noise\n",
        "            else:\n",
        "                return self.noise\n",
        "        else:\n",
        "            return 0.5\n",
        "\n",
        "\n",
        "\n",
        "    def sample(self, next_state, action):\n",
        "        if action.name == \"listen\":\n",
        "            thresh = 1.0 - self.noise\n",
        "        else:\n",
        "            thresh = 0.5\n",
        "\n",
        "        if random.uniform(0,1) < thresh:\n",
        "            return TigerObservation(next_state.name)\n",
        "        else:\n",
        "            return TigerObservation(next_state.other().name)\n",
        "\n",
        "\n",
        "\n",
        "    def get_all_observations(self):\n",
        "        \"\"\"Only need to implement this if you're using\n",
        "        a solver that needs to enumerate over the observation space\n",
        "        (e.g. value iteration)\"\"\"\n",
        "        return [TigerObservation(s)\n",
        "                for s in {\"tiger-left\", \"tiger-right\"}]\n",
        "\n",
        "\n",
        "# Transition Model\n",
        "\n",
        "class TransitionModel(pomdp_py.TransitionModel):\n",
        "\n",
        "    def probability(self, next_state, state, action):\n",
        "        \"\"\"According to problem spec, the world resets once\n",
        "        action is open-left/open-right. Otherwise, stays the same\"\"\"\n",
        "        if action.name.startswith(\"open\"):\n",
        "            return 0.5\n",
        "        else:\n",
        "            if next_state.name == state.name:\n",
        "                return 1.0 - 1e-9\n",
        "            else:\n",
        "                return 1e-9\n",
        "\n",
        "\n",
        "\n",
        "    def sample(self, state, action):\n",
        "        if action.name.startswith(\"open\"):\n",
        "            return random.choice(self.get_all_states())\n",
        "        else:\n",
        "            return TigerState(state.name)\n",
        "\n",
        "\n",
        "\n",
        "    def get_all_states(self):\n",
        "        \"\"\"Only need to implement this if you're using\n",
        "        a solver that needs to enumerate over the observation space (e.g. value iteration)\"\"\"\n",
        "        return [TigerState(s) for s in {\"tiger-left\", \"tiger-right\"}]\n",
        "\n",
        "\n",
        "# Reward Model\n",
        "\n",
        "class RewardModel(pomdp_py.RewardModel):\n",
        "    def _reward_func(self, state, action):\n",
        "        if action.name == \"open-left\":\n",
        "            if state.name == \"tiger-right\":\n",
        "                return 10\n",
        "            else:\n",
        "                return -100\n",
        "        elif action.name == \"open-right\":\n",
        "            if state.name == \"tiger-left\":\n",
        "                return 10\n",
        "            else:\n",
        "                return -100\n",
        "        else: # listen\n",
        "            return -1\n",
        "\n",
        "\n",
        "    def sample(self, state, action, next_state):\n",
        "        # deterministic\n",
        "        return self._reward_func(state, action)\n",
        "\n",
        "\n",
        "# Policy Model\n",
        "\n",
        "class PolicyModel(pomdp_py.RolloutPolicy):\n",
        "    \"\"\"A simple policy model with uniform prior over a\n",
        "       small, finite action space\"\"\"\n",
        "    ACTIONS = {TigerAction(s)\n",
        "              for s in {\"open-left\", \"open-right\", \"listen\"}}\n",
        "\n",
        "\n",
        "    def sample(self, state):\n",
        "        return random.sample(self.get_all_actions(), 1)[0]\n",
        "\n",
        "\n",
        "\n",
        "    def rollout(self, state, history=None):\n",
        "        \"\"\"Treating this PolicyModel as a rollout policy\"\"\"\n",
        "        return self.sample(state)\n",
        "\n",
        "\n",
        "\n",
        "    def get_all_actions(self, state=None, history=None):\n",
        "        return PolicyModel.ACTIONS\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "class TigerProblem(pomdp_py.POMDP):\n",
        "    \"\"\"\n",
        "    In fact, creating a TigerProblem class is entirely optional\n",
        "    to simulate and solve POMDPs. But this is just an example\n",
        "    of how such a class can be created.\n",
        "    \"\"\"\n",
        "\n",
        "    def __init__(self, obs_noise, init_true_state, init_belief):\n",
        "        \"\"\"init_belief is a Distribution.\"\"\"\n",
        "        agent = pomdp_py.Agent(init_belief,\n",
        "                               PolicyModel(),\n",
        "                               TransitionModel(),\n",
        "                               ObservationModel(obs_noise),\n",
        "                               RewardModel())\n",
        "        env = pomdp_py.Environment(init_true_state,\n",
        "                                   TransitionModel(),\n",
        "                                   RewardModel())\n",
        "        super().__init__(agent, env, name=\"TigerProblem\")\n",
        "\n",
        "\n",
        "    @staticmethod\n",
        "    def create(state=\"tiger-left\", belief=0.5, obs_noise=0.15):\n",
        "        \"\"\"\n",
        "        Args:\n",
        "            state (str): could be 'tiger-left' or 'tiger-right';\n",
        "                         True state of the environment\n",
        "            belief (float): Initial belief that the target is\n",
        "                            on the left; Between 0-1.\n",
        "            obs_noise (float): Noise for the observation\n",
        "                               model (default 0.15)\n",
        "        \"\"\"\n",
        "        init_true_state = TigerState(state)\n",
        "        init_belief = pomdp_py.Histogram({\n",
        "            TigerState(\"tiger-left\"): belief,\n",
        "            TigerState(\"tiger-right\"): 1.0 - belief\n",
        "        })\n",
        "        tiger_problem = TigerProblem(obs_noise,\n",
        "                                     init_true_state, init_belief)\n",
        "        tiger_problem.agent.set_belief(init_belief, prior=True)\n",
        "        return tiger_problem\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "def test_planner(tiger_problem, planner, nsteps=3,\n",
        "                 debug_tree=False):\n",
        "    \"\"\"\n",
        "    Runs the action-feedback loop of Tiger problem POMDP\n",
        "\n",
        "    Args:\n",
        "        tiger_problem (TigerProblem): a problem instance\n",
        "        planner (Planner): a planner\n",
        "        nsteps (int): Maximum number of steps to run this loop.\n",
        "        debug_tree (bool): True if get into the pdb with a\n",
        "                           TreeDebugger created as 'dd' variable.\n",
        "    \"\"\"\n",
        "    for i in range(nsteps):\n",
        "        action = planner.plan(tiger_problem.agent)\n",
        "        if debug_tree:\n",
        "            from pomdp_py.utils import TreeDebugger\n",
        "            dd = TreeDebugger(tiger_problem.agent.tree)\n",
        "            import pdb; pdb.set_trace()\n",
        "\n",
        "        print(\"==== Step %d ====\" % (i+1))\n",
        "        print(\"True state:\", tiger_problem.env.state)\n",
        "        print(\"Belief:\", tiger_problem.agent.cur_belief)\n",
        "        print(\"Action:\", action)\n",
        "        # There is no state transition for the tiger domain.\n",
        "        # In general, the ennvironment state can be transitioned\n",
        "        # using\n",
        "        #\n",
        "        #   reward = tiger_problem.env.state_transition(action, execute=True)\n",
        "        #\n",
        "        # Or, it is possible that you don't have control\n",
        "        # over the environment change (e.g. robot acting\n",
        "        # in real world); In that case, you could skip\n",
        "        # the state transition and re-estimate the state\n",
        "        # (e.g. through the perception stack on the robot).\n",
        "        reward = tiger_problem.env.reward_model.sample(tiger_problem.env.state, action, None)\n",
        "        print(\"Reward:\", reward)\n",
        "\n",
        "        # Let's create some simulated real observation;\n",
        "        # Update the belief Creating true observation for\n",
        "        # sanity checking solver behavior. In general, this\n",
        "        # observation should be sampled from agent's observation\n",
        "        # model, as\n",
        "        #\n",
        "        #    real_observation = tiger_problem.agent.observation_model.sample(tiger_problem.env.state, action)\n",
        "        #\n",
        "        # or coming from an external source (e.g. robot sensor\n",
        "        # reading). Note that tiger_problem.env.state stores the\n",
        "        # environment state after action execution.\n",
        "        real_observation = TigerObservation(tiger_problem.env.state.name)\n",
        "        print(\">> Observation:\",  real_observation)\n",
        "        tiger_problem.agent.update_history(action, real_observation)\n",
        "\n",
        "        # If the planner is POMCP, planner.update also updates agent belief.\n",
        "        planner.update(tiger_problem.agent, action, real_observation)\n",
        "        if isinstance(planner, pomdp_py.POUCT):\n",
        "            print(\"Num sims:\", planner.last_num_sims)\n",
        "            print(\"Plan time: %.5f\" % planner.last_planning_time)\n",
        "\n",
        "        if isinstance(tiger_problem.agent.cur_belief,\n",
        "                      pomdp_py.Histogram):\n",
        "            new_belief = pomdp_py.update_histogram_belief(\n",
        "                tiger_problem.agent.cur_belief,\n",
        "                action, real_observation,\n",
        "                tiger_problem.agent.observation_model,\n",
        "                tiger_problem.agent.transition_model)\n",
        "            tiger_problem.agent.set_belief(new_belief)\n",
        "\n",
        "        if action.name.startswith(\"open\"):\n",
        "            # Make it clearer to see what actions are taken\n",
        "            # until every time door is opened.\n",
        "            print(\"\\n\")\n",
        "\n",
        "\n",
        "\n",
        "def main():\n",
        "    init_true_state = random.choice([TigerState(\"tiger-left\"),\n",
        "                                     TigerState(\"tiger-right\")])\n",
        "    init_belief = pomdp_py.Histogram({TigerState(\"tiger-left\"): 0.5,\n",
        "                                      TigerState(\"tiger-right\"): 0.5})\n",
        "    tiger_problem = TigerProblem(0.15,  # observation noise\n",
        "                                 init_true_state, init_belief)\n",
        "\n",
        "    print(\"** Testing value iteration **\")\n",
        "    vi = pomdp_py.ValueIteration(horizon=3, discount_factor=0.95)\n",
        "    test_planner(tiger_problem, vi, nsteps=3)\n",
        "\n",
        "    # Reset agent belief\n",
        "    tiger_problem.agent.set_belief(init_belief, prior=True)\n",
        "\n",
        "    print(\"\\n** Testing POUCT **\")\n",
        "    pouct = pomdp_py.POUCT(max_depth=3, discount_factor=0.95,\n",
        "                           num_sims=4096, exploration_const=50,\n",
        "                           rollout_policy=tiger_problem.agent.policy_model,\n",
        "                           show_progress=True)\n",
        "    test_planner(tiger_problem, pouct, nsteps=10)\n",
        "    TreeDebugger(tiger_problem.agent.tree).pp\n",
        "\n",
        "    # Reset agent belief\n",
        "    tiger_problem.agent.set_belief(init_belief, prior=True)\n",
        "    tiger_problem.agent.tree = None\n",
        "\n",
        "    print(\"** Testing POMCP **\")\n",
        "    tiger_problem.agent.set_belief(pomdp_py.Particles.from_histogram(init_belief, num_particles=100), prior=True)\n",
        "    pomcp = pomdp_py.POMCP(max_depth=3, discount_factor=0.95,\n",
        "                           num_sims=1000, exploration_const=50,\n",
        "                           rollout_policy=tiger_problem.agent.policy_model,\n",
        "                           show_progress=True, pbar_update_interval=500)\n",
        "    test_planner(tiger_problem, pomcp, nsteps=10)\n",
        "    TreeDebugger(tiger_problem.agent.tree).pp\n",
        "\n",
        "\n",
        "#if __name__ == '__main__':\n",
        "#main()"
      ],
      "metadata": {
        "id": "d3kTdnkXxdb4"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "main()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9s0V9Iv6yFpk",
        "outputId": "060083d0-333f-4ee5-c0fd-d83b80bdfd06"
      },
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "** Testing value iteration **\n",
            "==== Step 1 ====\n",
            "True state: tiger-right\n",
            "Belief: {TigerState(tiger-left): 0.5, TigerState(tiger-right): 0.5}\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "==== Step 2 ====\n",
            "True state: tiger-right\n",
            "Belief: {TigerState(tiger-left): 0.15, TigerState(tiger-right): 0.85}\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "==== Step 3 ====\n",
            "True state: tiger-right\n",
            "Belief: {TigerState(tiger-left): 0.03020134244268276, TigerState(tiger-right): 0.9697986575573173}\n",
            "Action: open-left\n",
            "Reward: 10\n",
            ">> Observation: tiger-right\n",
            "\n",
            "\n",
            "\n",
            "** Testing POUCT **\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "  0%|          | 0/4096 [00:00<?, ?it/s]<ipython-input-2-4e9fc990eaef>:198: DeprecationWarning: Sampling from a set deprecated\n",
            "since Python 3.9 and will be removed in a subsequent version.\n",
            "  return random.sample(self.get_all_actions(), 1)[0]\n",
            "100%|█████████▉| 4095/4096 [00:01<00:00, 2234.47it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 1 ====\n",
            "True state: tiger-right\n",
            "Belief: {TigerState(tiger-left): 0.5, TigerState(tiger-right): 0.5}\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "Num sims: 4096\n",
            "Plan time: 1.83246\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|█████████▉| 4095/4096 [00:02<00:00, 1927.00it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 2 ====\n",
            "True state: tiger-right\n",
            "Belief: {TigerState(tiger-left): 0.15, TigerState(tiger-right): 0.85}\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "Num sims: 4096\n",
            "Plan time: 2.12496\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|█████████▉| 4095/4096 [00:01<00:00, 2119.57it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 3 ====\n",
            "True state: tiger-right\n",
            "Belief: {TigerState(tiger-left): 0.03020134244268276, TigerState(tiger-right): 0.9697986575573173}\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "Num sims: 4096\n",
            "Plan time: 1.93191\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|█████████▉| 4095/4096 [00:01<00:00, 2233.25it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 4 ====\n",
            "True state: tiger-right\n",
            "Belief: {TigerState(tiger-left): 0.005465587248755101, TigerState(tiger-right): 0.994534412751245}\n",
            "Action: open-left\n",
            "Reward: 10\n",
            ">> Observation: tiger-right\n",
            "Num sims: 4096\n",
            "Plan time: 1.83354\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|█████████▉| 4095/4096 [00:01<00:00, 2075.17it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 7 ====\n",
            "True state: tiger-right\n",
            "Belief: {TigerState(tiger-left): 0.03020134244268276, TigerState(tiger-right): 0.9697986575573173}\n",
            "Action: open-left\n",
            "Reward: 10\n",
            ">> Observation: tiger-right\n",
            "Num sims: 4096\n",
            "Plan time: 1.97321\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1000/1000 [00:00<00:00, 2571.18it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 3 ====\n",
            "True state: tiger-right\n",
            "Belief: [(TigerState(tiger-right), 0.963963963963964), (TigerState(tiger-left), 0.036036036036036036)]\n",
            "Action: open-left\n",
            "Reward: 10\n",
            ">> Observation: tiger-right\n",
            "Num sims: 1000\n",
            "Plan time: 0.37717\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1000/1000 [00:00<00:00, 8944.68it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 4 ====\n",
            "True state: tiger-right\n",
            "Belief: [(TigerState(tiger-left), 0.48307692307692307), (TigerState(tiger-right), 0.5169230769230769)]\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "Num sims: 1000\n",
            "Plan time: 0.10701\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1000/1000 [00:00<00:00, 10977.09it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 5 ====\n",
            "True state: tiger-right\n",
            "Belief: [(TigerState(tiger-right), 0.8574144486692015), (TigerState(tiger-left), 0.14258555133079848)]\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "Num sims: 1000\n",
            "Plan time: 0.08772\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1000/1000 [00:00<00:00, 15085.15it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 6 ====\n",
            "True state: tiger-right\n",
            "Belief: [(TigerState(tiger-right), 0.9665327978580991), (TigerState(tiger-left), 0.03346720214190094)]\n",
            "Action: open-left\n",
            "Reward: 10\n",
            ">> Observation: tiger-right\n",
            "Num sims: 1000\n",
            "Plan time: 0.06282\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1000/1000 [00:00<00:00, 14211.33it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 7 ====\n",
            "True state: tiger-right\n",
            "Belief: [(TigerState(tiger-left), 0.49272349272349275), (TigerState(tiger-right), 0.5072765072765073)]\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "Num sims: 1000\n",
            "Plan time: 0.06644\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1000/1000 [00:00<00:00, 14120.38it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 8 ====\n",
            "True state: tiger-right\n",
            "Belief: [(TigerState(tiger-right), 0.8577154308617234), (TigerState(tiger-left), 0.14228456913827656)]\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "Num sims: 1000\n",
            "Plan time: 0.06736\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1000/1000 [00:00<00:00, 14450.11it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 9 ====\n",
            "True state: tiger-right\n",
            "Belief: [(TigerState(tiger-right), 0.9727520435967303), (TigerState(tiger-left), 0.027247956403269755)]\n",
            "Action: open-left\n",
            "Reward: 10\n",
            ">> Observation: tiger-right\n",
            "Num sims: 1000\n",
            "Plan time: 0.06546\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 1000/1000 [00:00<00:00, 14934.85it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "==== Step 10 ====\n",
            "True state: tiger-right\n",
            "Belief: [(TigerState(tiger-right), 0.5069582504970179), (TigerState(tiger-left), 0.49304174950298213)]\n",
            "Action: listen\n",
            "Reward: -1\n",
            ">> Observation: tiger-right\n",
            "Num sims: 1000\n",
            "Plan time: 0.06340\n",
            "\u001b[92m_VNodePP\u001b[0m(n=890, v=0.235)\u001b[96m(depth=0)\u001b[0m\n",
            "├─── ₀\u001b[92mlisten\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=846, v=0.235)\n",
            "│    ├─── ₀\u001b[91mtiger-left\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=165, v=-3.766)\u001b[96m(depth=1)\u001b[0m\n",
            "│    │    ├─── ₀\u001b[92mlisten\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=159, v=-3.766)\n",
            "│    │    │    ├─── ₀\u001b[91mtiger-left\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=35, v=-1.000)\u001b[96m(depth=2)\u001b[0m\n",
            "│    │    │    │    ├─── ₀\u001b[92mlisten\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=19, v=-1.000)\n",
            "│    │    │    │    └─── ₂\u001b[92mopen-right\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=15, v=-4.667)\n",
            "│    │    │    └─── ₁\u001b[91mtiger-right\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=65, v=-1.000)\u001b[96m(depth=2)\u001b[0m\n",
            "│    │    │         ├─── ₀\u001b[92mlisten\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=34, v=-1.000)\n",
            "│    │    │         ├─── ₁\u001b[92mopen-left\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=30, v=-4.667)\n",
            "│    │    ├─── ₁\u001b[92mopen-left\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=2, v=-100.000)\n",
            "│    │    └─── ₂\u001b[92mopen-right\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=4, v=-68.988)\n",
            "│    │         └─── ₀\u001b[91mtiger-right\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=2, v=0.000)\u001b[96m(depth=2)\u001b[0m\n",
            "│    └─── ₁\u001b[91mtiger-right\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=564, v=5.422)\u001b[96m(depth=1)\u001b[0m\n",
            "│         ├─── ₀\u001b[92mlisten\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=110, v=-0.689)\n",
            "│         │    ├─── ₀\u001b[91mtiger-left\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=12, v=10.000)\u001b[96m(depth=2)\u001b[0m\n",
            "│         │    │    ├─── ₀\u001b[92mlisten\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=4, v=-1.000)\n",
            "│         │    │    ├─── ₁\u001b[92mopen-left\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=6, v=10.000)\n",
            "│         │    │    └─── ₂\u001b[92mopen-right\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=2, v=-45.000)\n",
            "│         │    └─── ₁\u001b[91mtiger-right\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=62, v=4.634)\u001b[96m(depth=2)\u001b[0m\n",
            "│         │         ├─── ₀\u001b[92mlisten\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=20, v=-1.000)\n",
            "│         │         ├─── ₁\u001b[92mopen-left\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=41, v=4.634)\n",
            "│         ├─── ₁\u001b[92mopen-left\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=452, v=5.422)\n",
            "│         │    ├─── ₀\u001b[91mtiger-left\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=140, v=-1.000)\u001b[96m(depth=2)\u001b[0m\n",
            "│         │    │    ├─── ₀\u001b[92mlisten\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=134, v=-1.000)\n",
            "│         │    │    ├─── ₁\u001b[92mopen-left\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=3, v=-63.333)\n",
            "│         │    │    └─── ₂\u001b[92mopen-right\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=3, v=-63.333)\n",
            "│         │    └─── ₁\u001b[91mtiger-right\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=160, v=-1.000)\u001b[96m(depth=2)\u001b[0m\n",
            "│         │         ├─── ₀\u001b[92mlisten\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=152, v=-1.000)\n",
            "│         │         ├─── ₁\u001b[92mopen-left\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=5, v=-56.000)\n",
            "│         │         └─── ₂\u001b[92mopen-right\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=3, v=-63.333)\n",
            "│         └─── ₂\u001b[92mopen-right\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=2, v=-100.000)\n",
            "├─── ₁\u001b[92mopen-left\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=42, v=-16.032)\n",
            "│    └─── ₁\u001b[91mtiger-right\u001b[0m⟶\u001b[92m_VNodePP\u001b[0m(n=2, v=10.000)\u001b[96m(depth=1)\u001b[0m\n",
            "└─── ₂\u001b[92mopen-right\u001b[0m⟶\u001b[91m_QNodePP\u001b[0m(n=2, v=-100.000)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "1cXg0dGRyHTN"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "G43lHR_Rywe_"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "P4eGTg8iy5j-"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}