Created
November 18, 2010 19:07
-
-
Save hlian/705441 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
\documentclass[oneside, article, twocolumn]{memoir} | |
\usepackage{amsmath, microtype} | |
\usepackage[margin=0.5in]{geometry} | |
\usepackage[charter]{mathdesign} | |
% Line spacing. | |
\linespread{1.1} | |
% Vector norm notation. | |
\newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|} | |
% arg max operator. | |
\newcommand{\argmax}{\operatornamewithlimits{arg\,max}} | |
% Make verbatim smaller. | |
\setverbatimfont{\normalfont\ttfamily\footnotesize} | |
\begin{document} | |
\title{COS 402 Fall 2010: Value Iteration.} | |
\author{we can make broad strokes} | |
\maketitle | |
% Evil paragraph indenting style is required for these short notes. | |
\nonzeroparskip | |
\setlength{\parindent}{0pt} | |
\subsection{Last time.} | |
Last time, we had our toy environment with $s_0, s_1, \cdots$ states where $s_i$ can transition to $s_{i+1}$ and $R(s_i)$ is the reward. | |
] | |
The utility of state sequence we defined as $\sum_{t=0}^{\infty} \gamma^t R(s_t)$ where $0 < \gamma < 1$. We let $\pi$ be a policy and $\pi^*$ to be the optimal policy. | |
We then derived $U^{\pi}(s) = E[\text{utility of state sequence} | \pi, s_0 = s]$. Let $U^*(s) = U^{\pi*}(s)$. | |
But if we can find $U^*$, we can find | |
% | |
\begin{align*} | |
\pi^* = \argmax_a \sum_{s'} \Pr[s' | s, a] U^*(s'), | |
\end{align*} | |
% | |
i.e. choose the action that maximizes the short-range utility. This then leads to the Bellman equations: | |
\begin{align*} | |
U^*(s) = R(s) + \gamma \max_a \sum_{s'} \Pr[s' | s, a] U^*(s'). | |
\end{align*} | |
This then led to \emph{value iteration}: | |
\begin{verbatim} | |
U[0](s) = 0 for all s | |
for i in (1 .. inf): | |
for all s: | |
U[i + 1](s) = R(s) + | |
gamma * max[a](sum[s'](Pr[s' | s, a] U*(s'))) | |
if max[s]|U[i](s) - U[i + 1](s)| < epsilon: | |
return U[i + 1]. | |
\end{verbatim} | |
\subsection{Proof of convergence for value iteration.} | |
We can think of each $U_0, U_1, \cdots, U_*$ as a vector. Then how \emph{long} is this vector? Measure the Euclidean distance from each $U_i$ to each $U_{i+1}$? It turns out that Euclidean distance (2-norm) is a bad estimate. Instead we use max-norm $L_\infty$. That is, | |
% | |
\begin{align*} | |
\vectornorm{U}_\infty = \max_s |U(s)|. | |
\end{align*} | |
% | |
Then what value iteration actually does is perform a \emph{Bellman update} to produce a new vector element i.e. | |
\begin{verbatim} | |
... | |
U[i + 1] = B(U[i]). | |
... | |
\end{verbatim} | |
We claim $\vectornorm{B(U) - B(U')}_\infty \leq \gamma \vectornorm{U - U'}_\infty$. (Proof is assigned as homework in homework 6.) The rest of the proof follows here: We know $U^* = B(U^*)$. This is called a \emph{fixed point}. Then | |
% | |
\begin{align*} | |
\vectornorm{B(U_i) - B(U^*)} &= \vectornorm{B(U_i) - U^*} | |
\leq \gamma \vectornorm{U_i - U^*}. | |
\end{align*} | |
You can view that last term as the \emph{error} in estimate $U_i$. And this equation tells us that it goes down by $\gamma$ each time. | |
Utilities of states are bounded by $\pm R_{\max} / (1 - \gamma)$ where $R_{\max}$ is the maximum reward (see book for explanation). Therefore, | |
\begin{align*} | |
\vectornorm{U_0 - U^*} \leq \frac{2 R_{\max}}{1 - \gamma}. | |
\end{align*} | |
Since error goes down by $\gamma$ each time, the error after $N$ iterations is $\gamma^N$. We can now solve $N$ for arbitrary error margin. The rest of this is given in the book (17.2). The board derivation was really bad here, but the book makes sense. | |
\subsection{Policy iteration.} | |
What if we just found the policy directly? | |
``Below, $\pi_i$ is \emph{not} the same as the $\pi_i$ above. Different algorithm.'' | |
\begin{verbatim} | |
pi[0] = any policy | |
for i in (1 .. inf): | |
compute U[pi[i]] | |
for all s: | |
pi[i + 1](s) = argmax[a]( | |
sum[s'](Pr[s' | s, a] U[pi[i]](s')) | |
) | |
\end{verbatim} | |
The algorithm looks like this: it takes $\pi_0$, finds $U^{\pi_0}$, uses it to find $\pi_1$, which is used tofind $U^{\pi_1}$ and so on and so forth. The $U_{\pi_i}$-to-$\pi_{i+1}$ step is called \emph{greedification}. | |
It's possible for this algorithm to converge faster than value iteration due to its greediness [proof by example on board]. | |
This is all fine, but how do you compute $U[pi[i]]$? Bellman equations! | |
% | |
\begin{align*} | |
U^*(s) = R(s) + \gamma \sum_{s'} \Pr[s' | s, a] U^\pi(s'). | |
\end{align*} | |
% | |
But $U^\pi$ is unknown! What do we do? Well, we get $n$ equations with $n$ unknowns, one per each $s$. So we can use linear algebra. \emph{Or} we can do a simplified value iteration where we plug a random value for $U^\pi(s')$ and hope it converges. This is called \emph{policy evaluation}. | |
\subsection{Proof of convergence for policy iteration.} | |
The \emph{policy improvement theorem} states that utility is always increasing with each iteration; i.e. for all $s$ | |
% | |
\begin{align*} | |
U^{\pi_{i+1}}(s) >= U^{\pi_i}(s); | |
\end{align*} | |
% | |
furthermore, this is a \emph{strict} inequality unless for at least one $s$ $\pi_i(s) = \pi^*(s)$. Proof omitted due to difficulty. But this implies the algorithm is optimal but doesn't give bounds. In practice, pretty fast. But there's open research on better bounds. | |
\subsection{Partially observable Markov decision process (POMDP).} | |
But robots don't know all their states! These algorithms break when that happens. But we can combine what we do know with these algorithms. We're only surveying these at a high level in this class. | |
Here's what we assume: state is hidden; transition model; actions; rewards; observation model $O$! Wait, that's new! Define it as | |
\begin{align*} | |
O(s, o) = \Pr[o | s] = \Pr[\text{observe $o$ in state $s$}]. | |
\end{align*} | |
We can't compute the real state, but we \emph{can} compute the belief state. And view it as the visible state of a fully observable MDP (i.e. self-delusion). These belief states (i.e. full-on probability distributions, not discrete objects) are bigger and harder to work with, so we still need to optimize. | |
\subsection{On the next \emph{Arrested Development}...} | |
Machine learning! Categorizing objects! Face detection! | |
\end{document} | |
# Local Variables: | |
# mode: longlines | |
# End: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment