hlian/cos402-notes-mdp-2.tex

## cos402-notes-mdp-2.tex
\documentclass[oneside, article, twocolumn]{memoir}
\usepackage{amsmath, microtype}
\usepackage[margin=0.5in]{geometry}
\usepackage[charter]{mathdesign}

% Line spacing.
\linespread{1.1}
% Vector norm notation.
\newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|}
% arg max operator.
\newcommand{\argmax}{\operatornamewithlimits{arg\,max}}
% Make verbatim smaller.
\setverbatimfont{\normalfont\ttfamily\footnotesize}

\begin{document}
\title{COS 402 Fall 2010: Value Iteration.}
\author{we can make broad strokes}
\maketitle

% Evil paragraph indenting style is required for these short notes.
\nonzeroparskip
\setlength{\parindent}{0pt}

\subsection{Last time.}
Last time, we had our toy environment with $s_0, s_1, \cdots$ states where $s_i$ can transition to $s_{i+1}$ and $R(s_i)$ is the reward.
]
The utility of state sequence we defined as $\sum_{t=0}^{\infty} \gamma^t R(s_t)$ where $0 < \gamma < 1$. We let $\pi$ be a policy and $\pi^*$ to be the optimal policy.

We then derived $U^{\pi}(s) = E[\text{utility of state sequence} | \pi, s_0 = s]$. Let $U^*(s) = U^{\pi*}(s)$.

But if we can find $U^*$, we can find
%
\begin{align*}
\pi^* = \argmax_a \sum_{s'} \Pr[s' | s, a] U^*(s'),
\end{align*}
%
i.e. choose the action that maximizes the short-range utility. This then leads to the Bellman equations:
\begin{align*}
U^*(s) = R(s) + \gamma \max_a \sum_{s'} \Pr[s' | s, a] U^*(s').
\end{align*}

This then led to \emph{value iteration}:
\begin{verbatim}
U[0](s) = 0 for all s
for i in (1 .. inf):
  for all s:
    U[i + 1](s) = R(s) +
      gamma * max[a](sum[s'](Pr[s' | s, a] U*(s')))
  if max[s]|U[i](s) - U[i + 1](s)| < epsilon:
    return U[i + 1].
\end{verbatim}

\subsection{Proof of convergence for value iteration.}
We can think of each $U_0, U_1, \cdots, U_*$ as a vector. Then how \emph{long} is this vector? Measure the Euclidean distance from each $U_i$ to each $U_{i+1}$? It turns out that Euclidean distance (2-norm) is a bad estimate. Instead we use max-norm $L_\infty$. That is,
%
\begin{align*}
\vectornorm{U}_\infty = \max_s |U(s)|.
\end{align*}
%
Then what value iteration actually does is perform a \emph{Bellman update} to produce a new vector element i.e.
\begin{verbatim}
...
U[i + 1] = B(U[i]).
...
\end{verbatim}

We claim $\vectornorm{B(U) - B(U')}_\infty \leq \gamma \vectornorm{U - U'}_\infty$. (Proof is assigned as homework in homework 6.) The rest of the proof follows here: We know $U^* = B(U^*)$. This is called a \emph{fixed point}. Then
%
\begin{align*}
\vectornorm{B(U_i) - B(U^*)} &= \vectornorm{B(U_i) - U^*}
\leq \gamma \vectornorm{U_i - U^*}.
\end{align*}

You can view that last term as the \emph{error} in estimate $U_i$. And this equation tells us that it goes down by $\gamma$ each time.

Utilities of states are bounded by $\pm R_{\max} / (1 - \gamma)$ where $R_{\max}$ is the maximum reward (see book for explanation). Therefore,
\begin{align*}
\vectornorm{U_0 - U^*} \leq \frac{2 R_{\max}}{1 - \gamma}.
\end{align*}

Since error goes down by $\gamma$ each time, the error after $N$ iterations is $\gamma^N$. We can now solve $N$ for arbitrary error margin. The rest of this is given in the book (17.2). The board derivation was really bad here, but the book makes sense.

\subsection{Policy iteration.}

What if we just found the policy directly?

``Below, $\pi_i$ is \emph{not} the same as the $\pi_i$ above. Different algorithm.''

\begin{verbatim}
pi[0] = any policy
for i in (1 .. inf):
  compute U[pi[i]]
  for all s:
    pi[i + 1](s) = argmax[a](
      sum[s'](Pr[s' | s, a] U[pi[i]](s'))
    )
\end{verbatim}

The algorithm looks like this: it takes $\pi_0$, finds $U^{\pi_0}$, uses it to find $\pi_1$, which is used tofind $U^{\pi_1}$ and so on and so forth. The $U_{\pi_i}$-to-$\pi_{i+1}$ step is called \emph{greedification}.

It's possible for this algorithm to converge faster than value iteration due to its greediness [proof by example on board].

This is all fine, but how do you compute $U[pi[i]]$? Bellman equations!
%
\begin{align*}
U^*(s) = R(s) + \gamma \sum_{s'} \Pr[s' | s, a] U^\pi(s').
\end{align*}
%
But $U^\pi$ is unknown! What do we do? Well, we get $n$ equations with $n$ unknowns, one per each $s$. So we can use linear algebra. \emph{Or} we can do a simplified value iteration where we plug a random value for $U^\pi(s')$ and hope it converges. This is called \emph{policy evaluation}.

\subsection{Proof of convergence for policy iteration.}
The \emph{policy improvement theorem} states that utility is always increasing with each iteration; i.e. for all $s$
%
\begin{align*}
U^{\pi_{i+1}}(s) >= U^{\pi_i}(s);
\end{align*}
%
furthermore, this is a \emph{strict} inequality unless for at least one $s$ $\pi_i(s) = \pi^*(s)$. Proof omitted due to difficulty. But this implies the algorithm is optimal but doesn't give bounds. In practice, pretty fast. But there's open research on better bounds.

\subsection{Partially observable Markov decision process (POMDP).}
But robots don't know all their states! These algorithms break when that happens. But we can combine what we do know with these algorithms. We're only surveying these at a high level in this class.

Here's what we assume: state is hidden; transition model; actions; rewards; observation model $O$! Wait, that's new! Define it as
\begin{align*}
O(s, o) = \Pr[o | s] = \Pr[\text{observe $o$ in state $s$}].
\end{align*}

We can't compute the real state, but we \emph{can} compute the belief state. And view it as the visible state of a fully observable MDP (i.e. self-delusion). These belief states (i.e. full-on probability distributions, not discrete objects) are bigger and harder to work with, so we still need to optimize.

\subsection{On the next \emph{Arrested Development}...}

Machine learning! Categorizing objects! Face detection!
\end{document}

# Local Variables:
#   mode: longlines
# End:
	\documentclass[oneside, article, twocolumn]{memoir}
	\usepackage{amsmath, microtype}
	\usepackage[margin=0.5in]{geometry}
	\usepackage[charter]{mathdesign}

	% Line spacing.
	\linespread{1.1}
	% Vector norm notation.
	\newcommand{\vectornorm}[1]{\left\|\left\|#1\right\|\right\|}
	% arg max operator.
	\newcommand{\argmax}{\operatornamewithlimits{arg\,max}}
	% Make verbatim smaller.
	\setverbatimfont{\normalfont\ttfamily\footnotesize}

	\begin{document}
	\title{COS 402 Fall 2010: Value Iteration.}
	\author{we can make broad strokes}
	\maketitle

	% Evil paragraph indenting style is required for these short notes.
	\nonzeroparskip
	\setlength{\parindent}{0pt}

	\subsection{Last time.}
	Last time, we had our toy environment with $s_0, s_1, \cdots$ states where $s_i$ can transition to $s_{i+1}$ and $R(s_i)$ is the reward.
	]
	The utility of state sequence we defined as $\sum_{t=0}^{\infty} \gamma^t R(s_t)$ where $0 < \gamma < 1$. We let $\pi$ be a policy and $\pi^*$ to be the optimal policy.

	We then derived $U^{\pi}(s) = E[\text{utility of state sequence} \| \pi, s_0 = s]$. Let $U^(s) = U^{\pi}(s)$.

	But if we can find $U^*$, we can find
	%
	\begin{align*}
	\pi^* = \argmax_a \sum_{s'} \Pr[s' \| s, a] U^*(s'),
	\end{align*}
	%
	i.e. choose the action that maximizes the short-range utility. This then leads to the Bellman equations:
	\begin{align*}
	U^(s) = R(s) + \gamma \max_a \sum_{s'} \Pr[s' \| s, a] U^(s').
	\end{align*}

	This then led to \emph{value iteration}:
	\begin{verbatim}
	U[0](s) = 0 for all s
	for i in (1 .. inf):
	for all s:
	U[i + 1](s) = R(s) +
	gamma * max[a](sum[s'](Pr[s' \| s, a] U*(s')))
	if max[s]\|U[i](s) - U[i + 1](s)\| < epsilon:
	return U[i + 1].
	\end{verbatim}

	\subsection{Proof of convergence for value iteration.}
	We can think of each $U_0, U_1, \cdots, U_*$ as a vector. Then how \emph{long} is this vector? Measure the Euclidean distance from each $U_i$ to each $U_{i+1}$? It turns out that Euclidean distance (2-norm) is a bad estimate. Instead we use max-norm $L_\infty$. That is,
	%
	\begin{align*}
	\vectornorm{U}_\infty = \max_s \|U(s)\|.
	\end{align*}
	%
	Then what value iteration actually does is perform a \emph{Bellman update} to produce a new vector element i.e.
	\begin{verbatim}
	...
	U[i + 1] = B(U[i]).
	...
	\end{verbatim}

	We claim $\vectornorm{B(U) - B(U')}_\infty \leq \gamma \vectornorm{U - U'}_\infty$. (Proof is assigned as homework in homework 6.) The rest of the proof follows here: We know $U^* = B(U^*)$. This is called a \emph{fixed point}. Then
	%
	\begin{align*}
	\vectornorm{B(U_i) - B(U^)} &= \vectornorm{B(U_i) - U^}
	\leq \gamma \vectornorm{U_i - U^*}.
	\end{align*}

	You can view that last term as the \emph{error} in estimate $U_i$. And this equation tells us that it goes down by $\gamma$ each time.

	Utilities of states are bounded by $\pm R_{\max} / (1 - \gamma)$ where $R_{\max}$ is the maximum reward (see book for explanation). Therefore,
	\begin{align*}
	\vectornorm{U_0 - U^*} \leq \frac{2 R_{\max}}{1 - \gamma}.
	\end{align*}

	Since error goes down by $\gamma$ each time, the error after $N$ iterations is $\gamma^N$. We can now solve $N$ for arbitrary error margin. The rest of this is given in the book (17.2). The board derivation was really bad here, but the book makes sense.

	\subsection{Policy iteration.}

	What if we just found the policy directly?

	``Below, $\pi_i$ is \emph{not} the same as the $\pi_i$ above. Different algorithm.''

	\begin{verbatim}
	pi[0] = any policy
	for i in (1 .. inf):
	compute U[pi[i]]
	for all s:
	pi[i + 1](s) = argmax[a](
	sum[s'](Pr[s' \| s, a] U[pi[i]](s'))
	)
	\end{verbatim}

	The algorithm looks like this: it takes $\pi_0$, finds $U^{\pi_0}$, uses it to find $\pi_1$, which is used tofind $U^{\pi_1}$ and so on and so forth. The $U_{\pi_i}$-to-$\pi_{i+1}$ step is called \emph{greedification}.

	It's possible for this algorithm to converge faster than value iteration due to its greediness [proof by example on board].

	This is all fine, but how do you compute $U[pi[i]]$? Bellman equations!
	%
	\begin{align*}
	U^*(s) = R(s) + \gamma \sum_{s'} \Pr[s' \| s, a] U^\pi(s').
	\end{align*}
	%
	But $U^\pi$ is unknown! What do we do? Well, we get $n$ equations with $n$ unknowns, one per each $s$. So we can use linear algebra. \emph{Or} we can do a simplified value iteration where we plug a random value for $U^\pi(s')$ and hope it converges. This is called \emph{policy evaluation}.

	\subsection{Proof of convergence for policy iteration.}
	The \emph{policy improvement theorem} states that utility is always increasing with each iteration; i.e. for all $s$
	%
	\begin{align*}
	U^{\pi_{i+1}}(s) >= U^{\pi_i}(s);
	\end{align*}
	%
	furthermore, this is a \emph{strict} inequality unless for at least one $s$ $\pi_i(s) = \pi^*(s)$. Proof omitted due to difficulty. But this implies the algorithm is optimal but doesn't give bounds. In practice, pretty fast. But there's open research on better bounds.

	\subsection{Partially observable Markov decision process (POMDP).}
	But robots don't know all their states! These algorithms break when that happens. But we can combine what we do know with these algorithms. We're only surveying these at a high level in this class.

	Here's what we assume: state is hidden; transition model; actions; rewards; observation model $O$! Wait, that's new! Define it as
	\begin{align*}
	O(s, o) = \Pr[o \| s] = \Pr[\text{observe $o$ in state $s$}].
	\end{align*}

	We can't compute the real state, but we \emph{can} compute the belief state. And view it as the visible state of a fully observable MDP (i.e. self-delusion). These belief states (i.e. full-on probability distributions, not discrete objects) are bigger and harder to work with, so we still need to optimize.

	\subsection{On the next \emph{Arrested Development}...}

	Machine learning! Categorizing objects! Face detection!
	\end{document}

	# Local Variables:
	# mode: longlines
	# End: