ChristopherRabotin/project.tex

## project.tex
\documentclass[11pt, oneside]{article}   	% use "amsart" instead of "article" for AMSLaTeX format

\usepackage{geometry}                		% See geometry.pdf to learn the layout options. There are lots.
\geometry{letterpaper}                   		% ... or a4paper or a5paper or ...

%\usepackage[parfill]{parskip}    		% Activate to begin paragraphs with an empty line rather than an indent
\usepackage{graphicx}				% Use pdf, png, jpg, or eps§ with pdflatex; use eps in DVI mode
								% TeX will automatically convert eps --> pdf in pdflatex

\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{mathabx}

\usepackage{bm}
\usepackage{bbm}

\usepackage{hyperref}

\usepackage{listings}
\usepackage{color}
\usepackage{ textcomp }

\usepackage{caption}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{pifont}

\usepackage{siunitx}
\sisetup{output-exponent-marker=\textsc{e}}

\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}

\lstset{frame=tb,
  language=Python,
  aboveskip=3mm,
  belowskip=3mm,
  showstringspaces=false,
  columns=flexible,
  basicstyle={\small\ttfamily},
  numbers=none,
  numberstyle=\tiny\color{gray},
  keywordstyle=\color{blue},
  commentstyle=\color{dkgreen},
  stringstyle=\color{mauve},
  breaklines=true,
  breakatwhitespace=true,
  tabsize=3
}

\usepackage{subcaption}

\captionsetup{width=0.7\textwidth}

\usepackage{pdfpages}

\usepackage{environ}

%SetFonts

%SetFonts

\newcommand{\ud}{\,\mathrm{d}}

\renewcommand{\thesubsection}{\thesection.\alph{subsection}}

\NewEnviron{myalign}{%
    \begin{align}
    \scalebox{1.5}{$\BODY$}
    \end{align}
    }

\NewEnviron{myequation}{%
    \begin{equation}
    \scalebox{1.5}{$\BODY$\nonumber}
    \end{equation}
    }

\newcommand{\framesName}[1]{
	\ensuremath{
		{\uppercase{\mathcal #1}}
	}
}


\title{APPM 5720: Final Project\\An Introduction to Differential Dynamic Programming for Optimal Control Problems}
\author{Manuel F. Diaz Ramos, David Iglesias Echavarr\'ia, Christopher Rabotin}
\date{\today}							% Activate to display a given date or no date

\begin{document}

\maketitle

\abstract{This paper gives an introduction to dynamic-programming-derived methods for optimal control. Dynamic Programming, Differential Dynamic Programming, and stagewise Newton's method are carefully studied and compared. Simulations show that Differential Dynamic Programming stands out between these three as the faster method to solve optimal control problems.}

\section{Introduction}

The optimal control problem can be written, in the Bolza form\cite{longuski2014}, as
\begin{align}
J(\bm x(t_0), \bm x(t_f), t_0, t_f) &= \underset{\bm u(t)}{\text{min}} \quad \int_{t_0}^{t_f} L(\bm x(t), \bm u(t), t) \ud t + \phi(\bm x(t_f), t_f)\\
\dot{\bm x}(t) &= \bm F(\bm x(t), \bm u(t), t)\\
\bm u(t) &\in U \subseteq \mathbb{R}^m\\
\bm x(t) &\in X\subseteq \mathbb{R}^n\\
\bm \psi(\bm x(t_f), t_f) &= \bm 0
\end{align}
In plain words, in the optimal control problem, a piecewise continuous function $\bm u(t): \mathbb{R}^+ \rightarrow \mathbb{R}^m$ is to be found such that the function $J$ is minimized and some constraints are met. The problem has a sequential nature, described by the dynamics of the state $\bm x(t): \mathbb{R}^+ \rightarrow \mathbb{R}^n$.

Ideally, a feedback control form $\bm u(t) = \bm G(\bm x(t), t)$ is desired\cite{kirk1970}.

There are several ways to solve optimal control problems: indirect methods, direct methods, and dynamic-programming-derived methods.

Indirect methods are based on the necessary conditions for optimality\cite{longuski2014}, which can be seen as a generalization of the Karush-Khun-Tucker (KKT) conditions\cite{betts2010}. This set of equations constitute a Two-Point Boundary Value Problem (TPBVP)\cite{fox1957}, and are usually known as Euler-Lagrange equations and Pontryagin's minimum principle\cite{longuski2014,bryson1975}.

The Hamiltonian function $H$ is defined as
\begin{equation}
H(\bm x, \bm u, t, \bm \lambda) = L(\bm x, \bm u, t,) + \bm\lambda^T \bm F(\bm x, \bm u, t,)
\end{equation}

The TPBVP Hamiltonian-like\cite{sussman2010} system can be written as\cite{longuski2014}\footnote{Transversality conditions must be added if the final time is not fixed.}
\begin{align}
\dot{\bm x} &=  H_{\bm \lambda}\\
\dot{\bm \lambda} &=  -H_{\bm x}\\
\bm u(t) &= \underset{\bm u(t) \in U}{\text{argmin}}(H)\\
\Phi(\bm x(t_f), t_f) &= \phi(\bm x(t_f), t_f) + \bm\nu^T \bm \psi(\bm x(t_f), t_f)\\
\bm x(t_0) &= \bm x_0\\
\bm \lambda(t_f) &= \Phi_{\bm x}(t_f)\\
\end{align}
The co-states $\bm \lambda$ are the continuous equivalent to Lagrange's multipliers\cite{betts2010}. Unfortunately, the necessary conditions give an open-loop solution.

There are several, indirect methods that solve the TBVP. Most techniques are based on shooting methods\cite{betts1998}. Lawden\cite{lawden1963} proposed the primal theory to solve the trajectory optimization problem. Collocation, in which the solution of the TPBVP is approximated using sets of polynomials, has also been used as an indirect method\cite{dickmanns1975}. However, the main problem with indirect methods is that the co-states are very hard to guess, making this technique usually diverge\cite{betts1998, conway2010}. Indirect methods are not usually used anymore.

Direct methods are based on the observation that the Euler-Lagrange equations and Pontryagin's minimum principle are a generalization of the KKT conditions. Therefore, the problem can be discretized and solved using Nonlinear Programming (NLP) techniques\cite{betts2010}. Once discretized, any NLP technique can be used to solve the problem. In the trajectory optimization community, direct transcription (or collocation) has become a very popular direct method\cite{conway2010,hargraves1987, enright1992, herman1996}. Collocation divides the time interval into segments. In each segment, the solution is approximated using a polynomial whose coefficients are optimized such that the approximation error at the center of each segment is minimized. The advantage of direct methods is that co-states are not needed and NLP tools can be readily used\cite{betts1998}.

The third family of methods is based on Bellman's principle of optimality\cite{bellman1957}, which gives necessary and sufficient conditions for optimality. Dynamic Programming (DP)\cite{bertsekas2017} is a search algorithm based on this principle. The method has the advantage that obtains a closed-loop control law. However, it is not usually practical due to the ``curse of dimensionality''\cite{bellman1957}: the computational cost grows exponentially with the number of states and controls.

Differential Dynamic Programming (DDP) is also based on Bellman's principle of optimality, but instead of fully quantizing the state and control space and performing a backwards search, as in DP, the method propagates the state using an initial guess for the controls and corrects backwards using a quadratic approximation of the principle of optimality\cite{jacobson1970}. The method, that was invented in the 60s\cite{mayne1966} and popularized in the 80s\cite{murray1984,yakowitz1984}, is living a renaissance in the trajectory optimization community\cite{lantoine2012,ozaki2015,aziz2017}, being used in one of the most important trajectory optimization software at NASA's Jet Propulsion Laboratory: Mystic\cite{whiffen2006}. The advantage of DDP over direct methods is that it preserves and leverages the sequential structure of the problem\cite{yakowitz1984}.

The goal of this work is to give an introduction to dynamic optimization methods, derived from Bellman's principle of optimality. In particular, to Differential Dynamic Programming. Three methods are discussed, implemented, simulated, and compared using a test discretized optimal control problem: DP, DDP, and stagewise Newton's method.

\section{Dynamic Programming}

Dynamic programming (DP) is a method used to solve problems by dividing them into simpler subproblems and combining the solutions. In contrast to other algorithms, in dynamic programming the solutions to the subproblems are stored so they do not have to be recomputed again. This is due to the fact that the subproblems are not independent from each other, so a recursive structure is used.

DP's cornerstone is the principle of optimality\cite{bellman1957}, which establishes that, if an optimal path that goes from $a$ to $c$ via $b$, $\mathbf{abc}$, is made up of two paths, $\mathbf{ab}$ and $\mathbf{bc}$, then $\mathbf{bc}$ has to be the optimal path from $b$ to $c$. Even though it is an intuitive and easy to grasp concept, DP would not be possible if this principle did not hold.

There are four main steps to solving a problem with DP:
\begin{enumerate}

\item Characterizing the structure of the optimal solution.

\item Defining the value of the optimal solution as a function of previous optimal solutions.

\item Performing bottom-up computations to find the value of the optimal solution.

\item Constructing the final optimal solution based on the computations from 3.

\end{enumerate}

For the rest of this work, and for simplicity, the time-discrete problem with fixed final time $N$ is considered
\begin{align}
\underset{\{\bm u_k\}_{k=0,..., N-1}}{\text{min}} J &= \sum_{k=1}^{N-1} L_k(\bm x_k, \bm u_k, k) + \phi(\bm x_N, N)\nonumber\\
\bm x_{k+1} &= \bm f(\bm x_k, \bm u_k, k)\nonumber\\
\bm x(t_0) &= \bm x_0\label{eq:discr_prob}\\
\bm x &\in X\nonumber\\
\bm u &\in U\nonumber
\end{align}

DP algorithm can be outline as follows. Firstly, the real-valued state and control input need to be \textit{quantized}. Even though interpolating is in general necessary, the final optimal control input will take quantized values. A very fine quantization obviously yield more accurate results at the cost of added computation time.

The DP algorithm works in a bottom-up manner; starting from the final time step, it moves backwards until the initial time step is reached, computing solutions to the subproblems one at a time. The process of breaking the main problem down into subproblems is known as \textit{segmentation}. Each subproblem consists of advancing from one time step to the next with as little cost as possible, and depends on the solution of the rest of the subproblems that have already been computed and stored. This optimal substructure, that is, the possibility of expressing a subproblem as a cost function plus a previously computed solution, is another requirement that an optimization problem has to satisfy in order to be solved by DP.

In order to implement DP, we need to find a \textit{recurrence} in the expression of the cost function between these subproblems. In other words, express the cost at the current step as a function of the cost at the previous step. This recursive expression is known as Bellman's equation\cite{bellman1957}, and has the following form:
\begin{equation}
V^*_{k}(\bm x_k, k)= \underset{u_k} {\text{min}} \:[J_{k,k+1}(\bm x_k,\bm  u_k, k)+V^*_{k+1}(\bm x_{k+1},k+1)]
\end{equation}
where $ J_{k,k+1}(\bm x_k,\bm u_k) {\Delta}{=} L_k(\bm x_k,\bm u_k,k) $ represents the cost of transitioning from $k$ to $k+1$, and $V^*_{k+1}(\bm x_{k+1}, k+1)$ is the optimal cost to go from $k+1$ to $N$, which has already been computed and stored, in a process called \textit{memoization}. We note that $ V^*_{k}(\bm x_k,k) $ is a function of $\bm x_k$ and, perhaps, time $k$. The final solution is not computed until time step 0 has been reached. It is not until the process is complete and $ V^*_{0}(\bm x_0,0) $ has been computed that the final solution $J^* $ is found: $J^*(\bm x_0) =V^*_{0}(\bar{\bm x}_0,0)$ is a function of the initial condition. If the initial conditions are fixed, i.e. $\bm x_0=\bar{\bm x}_0$, then we will plug that value in and find the solution, $ V^*_{0}({\bm x}_0,0) $. Together with $ V^*_{k}(\bm x_k, k) $, at each time step $k$ and for each grid point we will also store the control input that minimizes $V_{k}(\bm x_k,\bm u_k) $, $\bm u^*(\bm x_k,k)$.

The general DP computational algorithm is shown as Algorithm~\ref{alg:dp}.

\begin{algorithm}
\caption{Dynamic Programming Algorithm.}\label{alg:dp}
\begin{algorithmic}[1]
\State{Initialize $V^*(\bm x_N)=\phi(\bm x_N) \: \forall \bm x_N \in X$}
\For{$k \in {N-1,\dots,0}$}
\For{$\bm x_k \in X$}
\For{$\bm u_k \in U$}
\State{$\bm x_{next}= \bm f(\bm x_k,\bm u_k, k)$}
\State{$V(\bm x_k,\bm u_k, k)=L_k(\bm x_k, \bm u_k, k)+V^*(\bm x_{next}, k+1)$}
\EndFor
\State{$V^*(\bm x_k)=\text{min}[V(\bm x_k,\bm u_k, k)]$}
\State{$\bm u^*(\bm x_k,k)=\bm u_{k-\text{min}}$}
\EndFor
\EndFor
\end{algorithmic}
\end{algorithm}

One of the main positive aspects of DP is that we are guaranteed to find a global optimal solution for the problem we are trying to solve, regardless of its convexity or lack thereof. This is due to the fact that, similarly to a brute force method, we are performing a direct search among our quantized states and controls.

Another positive aspect of dynamic programming is the fact that the optimal control input is in closed-loop form. In other words, for every state value in the admissible discretized region there is an optimal control defined. This makes DP very desirable for implementation in real-life applications.

The main issue associated with DP is the ``curse of dimensionality''. In particular, DP becomes prohibitively expensive from a computational standpoint if the state or control to be quantized has a high dimension or the grid used is dense. The main implication of this is that if the search space is not constrained, or no prior knowledge of the nature of the solution exists, the DP algorithm is arguably a far-from-good solution.

\section{Differential Dynamic Programming}

DDP was invented in the 60s by Mayne and Jacobson\cite{jacobson1970,mayne1966} and further investigated in the 80s and 90s\cite{murray1984,yakowitz1984,murray1981,yakowitz1986,liao1992,liao1991}.

In this analysis, the discrete optimal control problem given by Equation~\eqref{eq:discr_prob} is used, but without any constraint. The problem, however, does not need to be discretized in time\cite{lantoine2012} and can be easily extended to constrained problems\cite{lantoine2012,yakowitz1986}. Since the algorithm details may be hard to grasp, only the unconstrained problem will be analyzed.

\begin{figure}[t]
\centering
\includegraphics[width=0.6\textwidth]{include/ddp.png}
\caption{A graphical description of DDP. Taken from \cite{ozaki2015}.}\label{fig:ddp}
\end{figure}

Since no quantization of state and control space in needed, DDP overcomes the curse of dimensionality. Moreover, it can converge quadratically\cite{liao1991} under conditions that are discussed later in this paper.

Given an initial guess of the controls $\{\bar{\bm u}_k\}_{k=0,..., N-1}$, the idea of the method is to propagate forward to obtain the set of states $\{\bar{\bm x}_k\}_{k=0,..., N}$, and then correct backwards using a quadratic approximation of the principle of optimality. The algorithm is depicted in Figure~\ref{fig:ddp}.

The backwards sweep analysis starts with the optimal $V^*_k(\bm x_k, k)$ and the principle of optimality
\begin{equation}
V^*_k(\bm x_k, k) = \underset{\bm u_k}{\text{min}} \quad \left[ L_k(\bm x_k, \bm u_k, k)  + V_{k+1}^*(\bm x_k, k) \right]
\end{equation}

The quadratic approximation of $V^*_{k+1}(\bm x_{k+1}, k+1)$ and $L_k(\bm x_k, \bm u_k, k)$ around the reference trajectory $\{\bar{\bm u}_k\}_{k=0,..., N-1}$, $\{\bar{\bm x}_k\}_{k=0,..., N}$  can thus be written as
\begin{align}
V^*_{k+1}(\bm x_{k+1}, k+1) &\approx V^*_{k+1}(\bar{\bm x}_{k+1}, k+1) + \left.\frac{\partial V^*_{k+1}}{\partial \bm x}\right|^T_{(\bar{\bm x}_{k+1}, \bar{\bm u}_{k+1}, {k+1})} \delta\bm x_{k+1} \nonumber\\
&+ \left.\delta \bm x_{k+1}^T \frac{\partial^2 V^*_{k+1}}{\partial \bm x^2}\right|_{(\bar{\bm x}_{k+1}, \bar{\bm u}_{k+1}, {k+1})} \delta \bm x_{k+1}\label{eq:V_k1}\\
L_k(\bm x_k, \bm u_k) &\approx L_k(\bar{\bm x}_k, \bar{\bm u}_k) + \left.\frac{\partial L_k}{\partial \bm x}\right|^T_{(\bar{\bm x}_k, \bar{\bm u}_k, k)} \delta\bm x_k + \left.\frac{\partial L_k}{\partial \bm u}\right|^T_{(\bar{\bm x}_k, \bar{\bm u}_k, k)} \delta\bm u_k\nonumber\\
&+ \frac{1}{2} \delta\bm x_k^T \left.\frac{\partial^2 L_k}{\partial \bm x^2}\right|_{(\bar{\bm x}_k, \bar{\bm u}_k, k)} \delta\bm x_k + \delta\bm x_k^T \left.\frac{\partial^2 L_k}{\partial \bm x \partial \bm u}\right|_{(\bar{\bm x}_k, \bar{\bm u}_k, k)} \delta\bm u_k \nonumber\\
&+ \frac{1}{2} \delta\bm u_k^T \left.\frac{\partial^2 L_k}{\partial \bm u^2}\right|_{(\bar{\bm u}_k, \bar{\bm u}_k, k)} \delta\bm u_k\label{eq:L_k}
\end{align}
The states at time $k$ and $k+1$ are coupled through the dynamics. Hence
\begin{multline}
\delta \bm x_{k+1} \approx \left.\frac{\partial f}{\partial \bm x}\right|_{(\bar{\bm u}_k, \bar{\bm u}_k, k)} \delta\bm x_k +  \left.\frac{\partial f}{\partial \bm u}\right|_{(\bar{\bm u}_k, \bar{\bm u}_k, k)} \delta\bm u_k + \frac{1}{2} \delta\bm x_k^T *  \left.\frac{\partial^2 f}{\partial \bm x^2}\right|_{(\bar{\bm u}_k, \bar{\bm u}_k, k)} \delta\bm x_k \\
+ \delta\bm x_k^T *  \left.\frac{\partial^2 f}{\partial \bm x \partial \bm u}\right|_{(\bar{\bm u}_k, \bar{\bm u}_k, k)} \delta\bm u_k + \frac{1}{2} \delta\bm u_k^T *  \left.\frac{\partial^2 f}{\partial \bm u^2}\right|_{(\bar{\bm u}_k, \bar{\bm u}_k, k)} \delta\bm u_k\label{eq:trans}
\end{multline}
The matrices are the first and second order state transition matrices. The symbol ``$*$'' represents a product between a vector and a third order tensor ($A * B)_{jk} = A_i B_{ijk}$.

Plugging Equation~\eqref{eq:trans} into Equations~\eqref{eq:V_k1}~and~\eqref{eq:L_k}, dropping higher order terms, and summing
\begin{multline}
V_k(\bm x_{k}, \bm u_k, k) = L_k(\bm x_k, \bm u_k) + V^*_{k+1}(\bm x_{k+1}, k+1) \approx\\
 Q_0 + \begin{bmatrix} \bm q_x^T & \bm q_u^T\end{bmatrix} \begin{bmatrix} \delta\bm x_k\\ \delta\bm u_k\end{bmatrix} + \frac{1}{2} \begin{bmatrix} \delta\bm x_k^T & \delta\bm u_k^T\end{bmatrix} \begin{bmatrix} \bm Q_{xx} & \bm Q_{xu}\\ \bm Q_{xu}^T & \bm Q_{uu}\end{bmatrix} \begin{bmatrix} \delta\bm x_k\\ \delta\bm u_k\end{bmatrix} \label{eq:tomin}
\end{multline}
where\footnote{In order to simplify the notation, the evaluations are replaced by the corresponding index.}
\begin{align}
Q_0 &= V^*_{k+1}(\bar{\bm x}_{k+1}, k+1) + L_k(\bar{\bm x}_k, \bar{\bm u}_k)\\
\bm q_x &= \left.\frac{\partial f}{\partial \bm x}\right|^T_{k} \left.\frac{\partial V^*_{k+1}}{\partial \bm x}\right|_{k+1} + \left.\frac{\partial L_k}{\partial \bm x}\right|_{k}\label{eq:q_x}\\
\bm q_u &= \left.\frac{\partial f}{\partial \bm u}\right|^T_{k} \left.\frac{\partial V^*_{k+1}}{\partial \bm x}\right|_{k+1} + \left.\frac{\partial L_k}{\partial \bm u}\right|_{k}\label{eq:q_u}\\
\bm Q_{xx} &= \left.\frac{\partial f}{\partial \bm x}\right|^T_{k} \left.\frac{\partial^2 V^*_{k+1}}{\partial \bm x^2}\right|_{k+1} \left.\frac{\partial f}{\partial \bm x}\right|_{k} + \left.\frac{\partial V^*_{k+1}}{\partial \bm x}\right|^T_{k+1} * \left.\frac{\partial^2 f}{\partial \bm x^2}\right|_{k} + \left.\frac{\partial^2 L_k}{\partial \bm x^2}\right|_{k}\label{eq:Q_xx}\\
\bm Q_{xu} &= \left.\frac{\partial f}{\partial \bm x}\right|^T_{k} \left.\frac{\partial^2 V^*_{k+1}}{\partial \bm x^2}\right|_{k+1} \left.\frac{\partial f}{\partial \bm u}\right|_{k} + \left.\frac{\partial V^*_{k+1}}{\partial \bm x}\right|^T_{k+1} * \left.\frac{\partial^2 f}{\partial \bm x \partial\bm u}\right|_{k} + \left.\frac{\partial^2 L_k}{\partial \bm x\partial\bm u}\right|_{k}\label{eq:Q_xu}\\
\bm Q_{uu} &= \left.\frac{\partial f}{\partial \bm u}\right|^T_{k} \left.\frac{\partial^2 V^*_{k+1}}{\partial \bm x^2}\right|_{k+1} \left.\frac{\partial f}{\partial \bm u}\right|_{k} + \left.\frac{\partial V^*_{k+1}}{\partial \bm x}\right|^T_{k+1} * \left.\frac{\partial^2 f}{\partial^2\bm u}\right|_{k} + \left.\frac{\partial^2 L_k}{\partial^2\bm u}\right|_{k}\label{eq:Q_uu}
\end{align}
Solving for $\delta \bm u_k$ such that Equation~\eqref{eq:tomin} is minimized
\begin{equation}
\delta \bm u_k = \bm \alpha_k + \bm \beta_k \delta \bm x_k = -\bm Q_{uu}^{-1} \bm q_u - \bm Q_{uu}^{-1} \bm Q_{xu}^T \delta \bm x_k\label{eq:optimal_u}
\end{equation}

Plugging the optimal of Equation~\eqref{eq:optimal_u} into Equation~\eqref{eq:tomin}, the approximation of $V^*_k(\bm x_{k}, \bm u_k, k)$ can be obatined and, therefore, so can its partials. The resulting partials are the following:
\begin{align}
 \left.\frac{\partial V^*_{k}}{\partial \bm x}\right|_{k} &= \bm q_x + \bm\beta_k^T \bm q_u + \bm Q_{xu} \bm \alpha_k + \bm \beta_k^T \bm Q_{uu} \bm \alpha_k\label{eq:v_x}\\
  \left.\frac{\partial^2 V^*_{k}}{\partial \bm x^2}\right|_{k} &= \bm Q_{xx} + \bm Q_{xu} \bm \beta_k + \bm \beta_k^T \bm Q_{xu}^T + \bm \beta_k^T \bm Q_{uu} \bm \beta_k\label{eq:v_xx}
\end{align}
The boundary conditions for these partials derivatives are easy to get, since $V^*(\bm x_N) = \phi(\bm x_N, N)$. Thus,
\begin{align}
 \left.\frac{\partial V^*_{N}}{\partial \bm x}\right|_{N} &=  \left.\frac{\partial \phi}{\partial \bm x}\right|_{N}\\
 \left.\frac{\partial^2 V^*_{N}}{\partial \bm x^2}\right|_{N} &=  \left.\frac{\partial^2 \phi}{\partial \bm x^2}\right|_{N}
\end{align}
An update step and a stop condition must still be chosen. A linesearch is usually performed for the update\cite{jacobson1970,liao1992}. The stop condition is given by the reduction in the cost due to the change in controls $\delta \bm u_k$. This reduction is defined as ${\Delta J}$ and it can be shown to be\cite{yakowitz1984}
\begin{equation}
\Delta J = - \frac{1}{2} \sum_{k=1}^N \bm q_u^T \bm Q_{uu}^{-1} \bm q_u
\end{equation}

The algorithm can be stated as Algorithm~\ref{alg:ddp}.
\begin{algorithm}
\caption{Differential Dynamic Programming Algorithm.}\label{alg:ddp}
\begin{algorithmic}[1]
\State Select an initial guess of the control policy $\{\bar{\bm u}_k\}_{k=0,..., N-1}$ and a threshold $\overline{\Delta J}$.
\State Propagate using the dynamics to get $\{\bar{\bm x}_k\}_{k=0,..., N}$.
\State Compute the Cost $J$.
\State Do the backward sweep. Use Equations~\eqref{eq:q_x},~\eqref{eq:q_u},~\eqref{eq:Q_xx},~\eqref{eq:Q_xu},~\eqref{eq:Q_uu},~\eqref{eq:v_x},~\eqref{eq:v_xx} to compute $\bm \alpha_k$ and $\bm\beta_k$ in Equation~\eqref{eq:optimal_u}. Save these matrices for $k=0,..., N-1$. Accumulate the cost reduction $\Delta J_k = \bm q_u^T \bm Q_{uu}^{-1} \bm q_u + \Delta J_{k+1}$ ($\Delta J_{N} = 0$).
\State Perform the line-search correction $\bm u_{k}(\epsilon) = \bar{\bm u}_{k} + \epsilon \bm \alpha_k + \bm \beta_k (\bm x_k - \bar{\bm x}_k)$, propagate the dynamics, and compute the new cost $J(\epsilon)$.
\If {$J(\epsilon) - J < \epsilon \Delta J_0/2$}
\State $\bar{\bm u}_{k} = \bm u_{k}(\epsilon)$
\State $J = J(\epsilon)$
\If {$\Delta J_0 \geq {\Delta J}$}
\State Go to step 2.
\Else
\State Stop.
\EndIf
\EndIf
\end{algorithmic}
\end{algorithm}

If constraints are utilized, the only change required is in the update coefficients $\bm \alpha_k$ and $\bm \beta_k$\cite{lantoine2012,yakowitz1986,lin1991}.

\section{Stagewise Newton's Method}

The fact that DDP involves a stagewise quadratic minimization and its convergence is quadratic\cite{liao1991} seems to suggest that DDP might be equivalent to Newton's method\cite{boyd2004}. One of the main differences between DDP and Newton's method is that the former leverages the particular sequential structure of the problem, while the latter solves the whole problem inverting a large matrix. For fixed state dimension $n$ and control dimension $m$ and with $N$ stages, the direct application of Newton's method would result in a complexity of $\mathcal{O}(N^3)$, while DDP has a complexity of $\mathcal{O}(N)$\cite{liao1992}. This might explain why Newton's method is not directly applied to the trajectory optimization problem, where the number of stages $N$ might be very large. Approximations like direct transcription (collocation) or the use of DDP are needed.

Pantoja\cite{pantoja1988} invented a slight modification of DDP that makes the algorithm numerically equivalent to Newton's method, but keeping the order $\mathcal{O}(N)$ of DDP. The algorithm involves a change in the forward and the backward sweep. In the backward sweep, the following matrices are slightly changed relative to those in DDP method\cite{pantoja1988,liao1992}
\begin{align}
\bm q_x &= \left.\frac{\partial f}{\partial \bm x}\right|^T_{k} \left.\frac{\partial V^*_{k+1}}{\partial \bm x}\right|_{k+1} + \left.\frac{\partial L_k}{\partial \bm x}\right|_{k}\label{eq:q_x_stageNew}\\
\bm q_u &= \left.\frac{\partial f}{\partial \bm u}\right|^T_{k} \left.\frac{\partial V^*_{k+1}}{\partial \bm x}\right|_{k+1} + \left.\frac{\partial L_k}{\partial \bm u}\right|_{k}\label{eq:q_u_stageNew}\\
\bm Q_{xx} &= \left.\frac{\partial f}{\partial \bm x}\right|^T_{k} \left.\frac{\partial^2 V^*_{k+1}}{\partial \bm x^2}\right|_{k+1} \left.\frac{\partial f}{\partial \bm x}\right|_{k} + \bm G_{k+1}^T * \left.\frac{\partial^2 f}{\partial \bm x^2}\right|_{k} + \left.\frac{\partial^2 L_k}{\partial \bm x^2}\right|_{k}\label{eq:Q_xx_stageNew}\\
\bm Q_{xu} &= \left.\frac{\partial f}{\partial \bm x}\right|^T_{k} \left.\frac{\partial^2 V^*_{k+1}}{\partial \bm x^2}\right|_{k+1} \left.\frac{\partial f}{\partial \bm u}\right|_{k} + \bm G_{k+1}^T * \left.\frac{\partial^2 f}{\partial \bm x \partial\bm u}\right|_{k} + \left.\frac{\partial^2 L_k}{\partial \bm x\partial\bm u}\right|_{k}\label{eq:Q_xu_stageNew}\\
\bm Q_{uu} &= \left.\frac{\partial f}{\partial \bm u}\right|^T_{k} \left.\frac{\partial^2 V^*_{k+1}}{\partial \bm x^2}\right|_{k+1} \left.\frac{\partial f}{\partial \bm u}\right|_{k} + \bm G_{k+1}^T * \left.\frac{\partial^2 f}{\partial^2\bm u}\right|_{k} + \left.\frac{\partial^2 L_k}{\partial^2\bm u}\right|_{k}\label{eq:Q_uu_stageNew}
\end{align}
where
\begin{equation}
\bm G_k =\left.\frac{\partial f}{\partial \bm x}\right|^T_{k} \left.\frac{\partial V^*_{k+1}}{\partial \bm x}\right|_{k+1} + \left.\frac{\partial L_k}{\partial \bm x}\right|_{k}
\end{equation}
and $\bm G_N = \phi_{\bm x}(\bm x_N, N)$.

The forward sweep is characterized by a linear approximation\cite{liao1992}
\begin{align}
\bm u_{k}(\epsilon) &= \bar{\bm u}_{k} + \epsilon \bm \alpha_k + \bm \beta_k (\hat{\bm x}_k - \bar{\bm x}_k)\\
\hat{\bm x}_{k+1} &= \bar{\bm x}_{k+1} + \left.\frac{\partial f}{\partial \bm x}\right|_{k}  (\hat{\bm x}_k - \bar{\bm x}_k)   + \left.\frac{\partial f}{\partial \bm u}\right|_{k}  (\bm u_k(\epsilon) - \bar{\bm u}_k)
\end{align}

The linear approximation used by Pantoja's method, or stagewise Newton's method, makes the convergence slightly slower than DDP, as it is shown in the Numerical Results in this paper.

\section{Convergence Issues}

DDP has quadratic convergence if the stagewise Hessian $\bm Q_{uu}$ is Positive-Definite (PD)\cite{murray1984}. This is the case if the dynamics are linear and the objective is convex\cite{liao1991}. However, in any other case, the stagewise Hessian is not PD. Furthermore, even if both functions are convex, the matrix might not be PD\cite{liao1991}.

To mitigate this problem, several solutions have been proposed. The simplest one, known as constant shift\cite{yakowitz1984}, is to compute the eigenvalues of the matrix and if any is nonpositive, sum a constant diagonal matrix $\xi \bm I$ to $\bm Q_{uu}$, where $\xi$ is a constant that has to be picked. The method is simple and easy to tune, but reduces the convergence rate of the problem to linear\cite{liao1991}

Another method, known as adaptive shift\cite{liao1992}, uses two steps. The first is the constant shift. In the second step, $\bm Q_{uu}$ is changed as $\bm Q_{uu} = \bm Q_{uu} + \xi_a(\delta) \bm I$, where $\xi_a(\delta)$ is a function of a parameter $\delta$ and the minimum eigenvalue of $\bm Q_{uu}$, $\lambda$.
\begin{equation}
\xi_a(\delta) = \begin{cases}
\delta -\lambda \quad &\text{if } \lambda < \delta\\
0 &\lambda \geq \delta
\end{cases}
\end{equation}
The method maintains the quadratic convergence, but we have found it is very hard to tune. In addition to the parameters, when to switch from step 1 to step 2 has to be decided.


\section{Numerical Results}

The following test system is simulated
\begin{align}
J &= \sum_{k=1}^{N-1} \left(\sum_{i=1}^{n} \left(x_k^i  + \frac{1}{4}\right)^4 +  \sum_{j=1}^{m} \left(u_k^i  + \frac{1}{2}\right)^4 \right) + \sum_{i=1}^{n} \left(x_N^i  + \frac{1}{4}\right)^4\\
\bm x_{k+1} &= \bm f(\bm x_k, \bm u_k, k) = A \bm x_k + B \bm u_k + \bm\gamma \bm x_k^T C \bm u_k\\
\bm x_0 &= \begin{bmatrix}0 & ...& 0\end{bmatrix}^T \in \mathbb{R}^n \qquad \bm u \in \mathbb{R}^m\\
A_{ij} &=\begin{cases}
0.5 \quad &\text{if } i=j\\
0.25 \quad &\text{if } i=j-1\\
-0.25 \quad &\text{if } i=+j\\
\end{cases}\quad
B_{ij} = \frac{i-j}{n+m}\quad
C_{ij} = \mu  \frac{i+j}{n+m}\quad \gamma_i = 1
\end{align}
The advantage of analyzing DDP with this system is that, for $\mu$ small, the dynamics are almost linear and, since the cost is convex, the stagewise Hessian will always be positive-definite. If $\mu$ is increased, the linearity is broken.

\subsection{Comparison between Dynamic Programming and Differential Dynamic Programming}

DP is practically useless for an unconstrained problem. Using a small dimension  ($n=2$, $m=3$, $N=20$, $\mu=1/20$), DP takes more than an hour to converge and it is very sensitive to the quantization used. DDP, on the other hand, takes a fraction of a second.

\subsection{Comparison between Stagewise Newton's Method and Differential Dynamic Programming}

In Table~\ref{tab:ddp_new_1}, simulations are shown with 100 states, 50 controls and 200 time steps with $\mu=1/200$. The Hessian is always PD, due to the fact that the nonlinearity is small. DDP is faster to converge than the stagewise Newton's method.

If the nonlinearity is increased, the Hessian is not always PD and some kind of shifting has to be used. In Table~\ref{tab:ddp_new_2}, constant shift is used with a value of 100. Convergence is slower than in the previous case due to the large value used in the constant shift. DDP is still faster than stagewise Newton.

Since the simulations are performed in Python, a parsed language not characterized by speed, the absolute elapsed times are not important. The comparison between algorithms, however, is.

\begin{table}
\centering
\begin{tabular}{c|c|c}
            	\hline
            	\hline
		\multicolumn{3}{|c|}{$n=100$, $m=50$, $N=199$, $\mu=1/200$}\\
            	\hline
                \hline
        Iteration & DDP & Newton \\
            	\hline
                \hline
1 & 1094.02237274 & 1094.02237274\\
\hline
2 & 713.208711911 & 774.081691362\\
\hline
3 & 649.87708551 & 760.166793269\\
\hline
4 & 630.666565295 & 749.878023335\\
\hline
5 & 622.299705392 & 746.217647457\\
\hline
6 & 616.855458297 & 679.394796329\\
\hline
7 & 616.282210333 & 678.130133956\\
\hline
8 & 616.242809973 & 652.467587611\\
\hline
9 & 616.240343843 & 626.701169557\\
\hline
10 & - & 623.107618479\\
\hline
11 & - & 621.764937873\\
\hline
12 & - & 618.930811549\\
\hline
13 & - & 617.179642756\\
\hline
14 & - & 616.443972954\\
\hline
15 & - & 616.320271865\\
\hline
16 & - & 616.246094954\\
\hline
17 & - & 616.240456469\\
\hline
\hline
Time [s] & 150.635533094 & 301.299507141\\
\hline
\hline
\end{tabular}
\caption{Results for $n=100$, $m=50$, $N=199$, $\mu=1/200$.}\label{tab:ddp_new_1}
\end{table}

\begin{table}
\centering
\begin{tabular}{c|c|c}
            	\hline
            	\hline
		\multicolumn{3}{|c|}{$n=100$, $m=50$, $N=199$, $\mu=1/100$}\\
            	\hline
                \hline
        Iteration & DDP & Newton \\
            	\hline
                \hline
1 &  1156.95638225 & 1156.95638225\\
\hline
2 &  761.041028597 & 692.907483186\\
\hline
3 &  666.198089302 & 650.764807857\\
\hline
4 &  639.764041664 & 649.92639916\\
\hline
5 &  630.88594283 & 636.241077824\\
\hline
6 &  626.830251978 & 628.010389669\\
\hline
7 &  621.488428685 & 621.542592299\\
\hline
8 &  619.182278538 & 619.700007977\\
\hline
9 &  618.50467636 & 618.957800226\\
\hline
10 & 618.327440059 & 618.58976345\\
\hline
11 &  618.280977278 & 618.423456138\\
\hline
12 &  618.267711093 & 618.345313727\\
\hline
13 &  618.263172887 & 618.305597624\\
\hline
14 &  618.261181302 & 618.284487824\\
\hline
15 &  618.260085508 & 618.272946695\\
\hline
16 & - & 618.266516824\\
\hline
17 & - & 618.26288007\\
\hline
18 & - & 618.260792474\\
\hline
19 & - & 618.259574333\\
\hline
\hline
Time [s] & 294.781922102 & 344.581017017\\
\hline
\hline
\end{tabular}
\caption{Results for $n=100$, $m=50$, $N=199$, $\mu=1/100$.}\label{tab:ddp_new_2}
\end{table}

\section{Conclusions}

A brief introduction to DP-derived methods has been given. DDP is specially suited to solve optimal control problems because it leverages their sequential structure. DDP can solve optimal control problems with thousands of variables in seconds. Additionally, it can be extended to constrained problems.

When compared to other DP-derived methods, DDP performs better because it does not perform a full space search, as DP, and does not linearize the dynamics in the propagation, as stagewise Newton's method.

The main problem of DDP is the shift required in the stagewise Hessian if it is not positive-definite. Shifting strategies are problem-specific and might be hard to tune.

\bibliographystyle{ieeetr}	% or "siam", or "alpha", etc.
\bibliography{references}		% Bib database in "refs.bib"


\end{document}