Skip to content

Instantly share code, notes, and snippets.

@hlian
Created November 18, 2010 19:07
Show Gist options
  • Save hlian/705441 to your computer and use it in GitHub Desktop.
Save hlian/705441 to your computer and use it in GitHub Desktop.
\documentclass[oneside, article, twocolumn]{memoir}
\usepackage{amsmath, microtype}
\usepackage[margin=0.5in]{geometry}
\usepackage[charter]{mathdesign}
% Line spacing.
\linespread{1.1}
% Vector norm notation.
\newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|}
% arg max operator.
\newcommand{\argmax}{\operatornamewithlimits{arg\,max}}
% Make verbatim smaller.
\setverbatimfont{\normalfont\ttfamily\footnotesize}
\begin{document}
\title{COS 402 Fall 2010: Value Iteration.}
\author{we can make broad strokes}
\maketitle
% Evil paragraph indenting style is required for these short notes.
\nonzeroparskip
\setlength{\parindent}{0pt}
\subsection{Last time.}
Last time, we had our toy environment with $s_0, s_1, \cdots$ states where $s_i$ can transition to $s_{i+1}$ and $R(s_i)$ is the reward.
]
The utility of state sequence we defined as $\sum_{t=0}^{\infty} \gamma^t R(s_t)$ where $0 < \gamma < 1$. We let $\pi$ be a policy and $\pi^*$ to be the optimal policy.
We then derived $U^{\pi}(s) = E[\text{utility of state sequence} | \pi, s_0 = s]$. Let $U^*(s) = U^{\pi*}(s)$.
But if we can find $U^*$, we can find
%
\begin{align*}
\pi^* = \argmax_a \sum_{s'} \Pr[s' | s, a] U^*(s'),
\end{align*}
%
i.e. choose the action that maximizes the short-range utility. This then leads to the Bellman equations:
\begin{align*}
U^*(s) = R(s) + \gamma \max_a \sum_{s'} \Pr[s' | s, a] U^*(s').
\end{align*}
This then led to \emph{value iteration}:
\begin{verbatim}
U[0](s) = 0 for all s
for i in (1 .. inf):
for all s:
U[i + 1](s) = R(s) +
gamma * max[a](sum[s'](Pr[s' | s, a] U*(s')))
if max[s]|U[i](s) - U[i + 1](s)| < epsilon:
return U[i + 1].
\end{verbatim}
\subsection{Proof of convergence for value iteration.}
We can think of each $U_0, U_1, \cdots, U_*$ as a vector. Then how \emph{long} is this vector? Measure the Euclidean distance from each $U_i$ to each $U_{i+1}$? It turns out that Euclidean distance (2-norm) is a bad estimate. Instead we use max-norm $L_\infty$. That is,
%
\begin{align*}
\vectornorm{U}_\infty = \max_s |U(s)|.
\end{align*}
%
Then what value iteration actually does is perform a \emph{Bellman update} to produce a new vector element i.e.
\begin{verbatim}
...
U[i + 1] = B(U[i]).
...
\end{verbatim}
We claim $\vectornorm{B(U) - B(U')}_\infty \leq \gamma \vectornorm{U - U'}_\infty$. (Proof is assigned as homework in homework 6.) The rest of the proof follows here: We know $U^* = B(U^*)$. This is called a \emph{fixed point}. Then
%
\begin{align*}
\vectornorm{B(U_i) - B(U^*)} &= \vectornorm{B(U_i) - U^*}
\leq \gamma \vectornorm{U_i - U^*}.
\end{align*}
You can view that last term as the \emph{error} in estimate $U_i$. And this equation tells us that it goes down by $\gamma$ each time.
Utilities of states are bounded by $\pm R_{\max} / (1 - \gamma)$ where $R_{\max}$ is the maximum reward (see book for explanation). Therefore,
\begin{align*}
\vectornorm{U_0 - U^*} \leq \frac{2 R_{\max}}{1 - \gamma}.
\end{align*}
Since error goes down by $\gamma$ each time, the error after $N$ iterations is $\gamma^N$. We can now solve $N$ for arbitrary error margin. The rest of this is given in the book (17.2). The board derivation was really bad here, but the book makes sense.
\subsection{Policy iteration.}
What if we just found the policy directly?
``Below, $\pi_i$ is \emph{not} the same as the $\pi_i$ above. Different algorithm.''
\begin{verbatim}
pi[0] = any policy
for i in (1 .. inf):
compute U[pi[i]]
for all s:
pi[i + 1](s) = argmax[a](
sum[s'](Pr[s' | s, a] U[pi[i]](s'))
)
\end{verbatim}
The algorithm looks like this: it takes $\pi_0$, finds $U^{\pi_0}$, uses it to find $\pi_1$, which is used tofind $U^{\pi_1}$ and so on and so forth. The $U_{\pi_i}$-to-$\pi_{i+1}$ step is called \emph{greedification}.
It's possible for this algorithm to converge faster than value iteration due to its greediness [proof by example on board].
This is all fine, but how do you compute $U[pi[i]]$? Bellman equations!
%
\begin{align*}
U^*(s) = R(s) + \gamma \sum_{s'} \Pr[s' | s, a] U^\pi(s').
\end{align*}
%
But $U^\pi$ is unknown! What do we do? Well, we get $n$ equations with $n$ unknowns, one per each $s$. So we can use linear algebra. \emph{Or} we can do a simplified value iteration where we plug a random value for $U^\pi(s')$ and hope it converges. This is called \emph{policy evaluation}.
\subsection{Proof of convergence for policy iteration.}
The \emph{policy improvement theorem} states that utility is always increasing with each iteration; i.e. for all $s$
%
\begin{align*}
U^{\pi_{i+1}}(s) >= U^{\pi_i}(s);
\end{align*}
%
furthermore, this is a \emph{strict} inequality unless for at least one $s$ $\pi_i(s) = \pi^*(s)$. Proof omitted due to difficulty. But this implies the algorithm is optimal but doesn't give bounds. In practice, pretty fast. But there's open research on better bounds.
\subsection{Partially observable Markov decision process (POMDP).}
But robots don't know all their states! These algorithms break when that happens. But we can combine what we do know with these algorithms. We're only surveying these at a high level in this class.
Here's what we assume: state is hidden; transition model; actions; rewards; observation model $O$! Wait, that's new! Define it as
\begin{align*}
O(s, o) = \Pr[o | s] = \Pr[\text{observe $o$ in state $s$}].
\end{align*}
We can't compute the real state, but we \emph{can} compute the belief state. And view it as the visible state of a fully observable MDP (i.e. self-delusion). These belief states (i.e. full-on probability distributions, not discrete objects) are bigger and harder to work with, so we still need to optimize.
\subsection{On the next \emph{Arrested Development}...}
Machine learning! Categorizing objects! Face detection!
\end{document}
# Local Variables:
# mode: longlines
# End:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment