guyabel/thesis.tex

## thesis.tex

\documentclass[11pt]{report}
\usepackage[a4paper,margin=2cm, bindingoffset=2cm]{geometry}
\usepackage{appendix}
\usepackage{amsmath}
\usepackage{booktabs}
\usepackage{threeparttable}
\usepackage{natbib}
\bibliographystyle{chicago}

%to stop orphan lines
\widowpenalty=10000
\clubpenalty=10000
\raggedbottom

%line spacing
\linespread{1.3}

\begin{document}
\begin{titlepage}
\begin{center}
\large
\textsc{University of Southampton} \\
\ \\
\textsc{Faculty of Law, Arts \& Social Sciences} \\
\ \\
\textsc{School of Social Sciences}\\
\ \\
\ \\
\ \\
\ \\
\ \\
\ \\
\ \\
\Huge
\textbf{International Migration Flow Table Estimation}
\ \\
\ \\
\large by
\ \\
\ \\
Guy J. Abel
\vfill
Thesis for the degree of Doctor of Philosophy \\
\ \\
\ \\
April 2009
\end{center}
\end{titlepage}
\begin{titlepage}
\begin{center}
\ \\
\ \\
\ \\
\ \\
\ \\
\ \\
\ \\
\ \\
\ \\
\ \\
\ \\
\ \\
To Ed and Diana
\end{center}
\end{titlepage}

%Roman Page Numbering
\pagenumbering{roman}
\chapter*{{Abstract}\markboth{Acknowledgements}{Acknowledgements}}
\addcontentsline{toc}{chapter}{Abstract}
A methodology is developed to estimate comparable international migration flows between a set of countries. International migration flow data may be missing, reported by the sending country, reported by the receiving country or reported by both the sending and receiving countries. For the last situation, reported counts rarely match due to differences in definitions and data collection systems. In this thesis, reported counts are harmonized using correction factors estimated from a constrained optimization procedure. Factors are applied to scale data known to be of a reliable standard, creating an incomplete migration flow table of harmonized values. Cells for which no reliable reported flows exist are then estimated from a negative binomial regression model fitted using the Expectation-Maximization (EM) type algorithm. Covariate information for this model is drawn from international migration theory. Finally, measures of precision for all missing cell estimates are derived using the Supplemented EM algorithm. Recent data on international migration between countries in Europe are used to illustrate the methodology. The results represent a complete table of comparable flows that can be used by regional policy makers and social scientist alike to better understand population behaviour and change.\\


\tableofcontents
\listoffigures
\addcontentsline{toc}{chapter}{List of Figures}
\listoftables
\addcontentsline{toc}{chapter}{List of Tabes}

\chapter*{Declaration Of Authorship}
\addcontentsline{toc}{chapter}{Declaration Of Authorship}
I, Guy Jonathan Abel, declare that the thesis entitled International Migration Flow Table Estimation and the work presented in the thesis are both my own, and have been generated by me as the result of my own original research. I confirm that:
\begin{itemize}
\item this work was done wholly or mainly while in candidature for a research degree at this University;
\item where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated;
\item where I have consulted the published work of others, this is always clearly attributed;
\item where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work;
\item I have acknowledged all main sources of help;
\item where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself;
\end{itemize}
\vspace{2cm}
Signed: \dotfill
\vspace{2cm}
\newline
\noindent
Date

%Include Acknowledgements in TOC
\chapter*{Acknowledgements}
\addcontentsline{toc}{chapter}{Acknowledgements}
This work was undertaken with financial support of the Economic and Social Research Council (PTA No-031-2004-00...).
This thesis would not have been possible without the support of many people. I would like to express my sincere gratitude to:
\begin{itemize}
\item My supervisors, James Raymer and Peter Smith, who were abundantly helpful and offered invaluable advice, assistance and support throughout my studies in Southampton.
\item ...
\end{itemize}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Introduction}
%Arabic Page Numbering
\pagenumbering{arabic}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Migration flow data inform policy makers, the media and academic community to the level and direction of population movements. ...

\section{International Migration Data}

Migration can be measured as either a flow or stock. ...

\section{International Migration Flow Tables}

Data on migration between a set of regions are commonly presented in a square table with off diagonal entries containing the number of people moving from any given origin to any given destination. ...

\section{Thesis Aims and Scope}

The study of transition patterns, such as migration flows, generally involves three steps \citep{rogers1980imm}. ...

\section{Thesis Structure}

The study is structured in seven chapters. ...

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Statistical Modelling}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}

This chapter outlines the statistical modelling techniques for migration flow tables to set the stage for future chapters. ...

\section{Regression Models}

In many scientific studies, interest lies in the relationship between two or more observable quantities. ...

\section{Generalized Linear Models}

Linear regression models are part of a range of statistical models, known as generalized linear models \citep{nelder1972glm}.

\subsection{Normal and Log-Normal Distribution}

In a continuous case, the response variable can be assumed to be independently normally distributed with parameters $( \mu_{i} ,\sigma
^{2}) $ for the mean and variance respectively.

\subsection{Poisson Distribution}

In a discrete case, a response variable of count data can be assumed to have a Poisson distribution with rate parameter $\mu$.

\section{Fitting Generalized Linear Models}

Maximum likelihood estimates are frequently used in migration models as they posses very desirable asymptotic properties such as consistency, asymptotic normality and asymptotic robustness \citep[p457-69]{sen1995gms}.

\subsection{Mean and Variance}

The mean and variance of the random component in a generalized linear model may be obtained in a general form, allowing the maximum likelihood estimates to be found using IRLS.

\subsection{Likelihood Equations}

In order to obtain maximum likelihood parameter estimates for a generalized linear model we must first obtain the likelihood equations.

\subsection{Asymptotic Variance-Covariance Matrix of Parameters Estimates}

The asymptotic variance-covariance matrix for parameter estimates is required to provide a useful simplification in the IRLS procedure.

\subsection{Iterative Reweighted Least Squares}

For the likelihood equations of a classic linear regression model the maximum likelihood estimators of $\boldsymbol\beta$ can be found by re-expressing (\ref{eqn:norleq}) for $\boldsymbol\beta$, in a matrix notation:

\section{Negative Binomial Regression Models}

The negative binomial distribution has two-parameters that allow a mean and variance to be fitted separately, as opposed to a single parameter Poisson regression model.

\subsection{Asymptotic Variance Covariance Matrix}

\citet[p71]{cameron1998rac} showed that for the negative binomial regression model the maximum likelihood estimates are the solution to the first order conditions

\subsection{Fitting Negative Binomial Regression Model}\label{sec:nbreg}

\citet[p560-1]{agresti2002cda} noted that a negative binomial model may be fitted in a similar manner as Poisson regression models when the dispersion parameter is known.

\section{Statistical Modelling of Missing Data}

International migration flow data is often missing.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{A Review of Methodologies for Estimating an International Migration Flow Table of Comparable Data}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}

At present, the responsibility for the collection of international migration flow data rests with individual national statistics institutes. ...

\section{Problems of Comparability in International Migration Flow Data}

The lack of comparability in international migration data can be traced to the multi-dimensional nature of migration \citep{goldstein1976frr}. ...

\subsection{Data Production Techniques}

Differences in the production of migration flow statistics can be derived from distinctive data collection methods and definitional measurements used by national statistics institutes. ...

\section{International Migration Flow Tables}

Migration data are commonly represented in square tables, with off diagonal entries containing the number of people moving from any given origin $i$, to any given destination $j$, in a single time period. ...

\section{Constrained Optimization}

Estimates of a complete migration flow table between 28 European nations in 2004 were calculated by \cite{poulain2007mim} (and \cite{poulain2008emm}) as part of the MIMOSA project. ...

\section{Model Component Modelling}

A multiplicative component approach was applied by \cite{raymer2007eim} to estimate international migration flows between ten countries in Northern Europe in 1999. ...


\section{Discussion of Frameworks} \label{sec:predis}

Discussion on the presented frameworks and possible extensions is undertaken in the succeeding subsections. ...

\subsection{Constrained Optimization}

The framework proposed by \cite{poulain1993csm} was the first effort to estimate an international migration flow table of comparable data. ...

\subsection{Model Component Modelling}

The multiplicative component methodology of \cite{raymer2007eim} decomposes a flow table into a number of model parameters whose values are estimated using statistical models. ...

\section{Summary and Conclusion}

The methodologies presented in this chapter take vastly different approaches to estimating a complete migration table. ...

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Overcoming Inconsistencies in International Migration Flow Tables}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}

The lack of comparability in international migration data can be traced to the multi-dimensional nature of migration \citep{goldstein1976frr}. ...

\section{International Migration Flow Data for the EU15}

International migration flow data may be obtained from a number of international organizations. ...

\subsection{Ratings of Migration Data for EU15}

\begin{table}[h]
%\begin{center}
\caption{\cite{erf2007mim} Ratings of Migration Data for EU15 from 2002 to 2006} \label{tab:erfrat}
\begin{tabular}{ccccccc}
Country & \multicolumn{3}{c}{Receiving} & \multicolumn{3}{c}{Sending} \\
& Timing & Completeness & Accuracy & Timing & Completeness & Accuracy \\ \toprule
AUT & 3 & 4 & 4 & 3 & 4 & 4 \\
BEL & 3 & 9 & 9 & 3 & 9 & 9 \\
DNK & 2(3) & 4(4) & 4(4) & 3 & 4 & 4 \\
FIN & 2(4) & 4(4) & 4(4) & 4 & 4 & 4 \\
FRA & 3 & 2 & 9 & & & \\
DEU & 2 & 4 & 4 & 2 & 4 & 4 \\
GRC & & & & & & \\
IRL & 2 & 2 & 2 & 2 & 2 & 2 \\
ITA & 2(3) & 3(3) & 3(3) & 4 & 3 & 3 \\
LUX & 2 & 3 & 3 & 2 & 3 & 3 \\
NLD & 3 & 4 & 4 & 4 & 4 & 4 \\
PRT & 4 & 9 & 9 & 3 & 2 & 2 \\
ESP & 2 & 3 & 3 & 2 & 3 & 3 \\
SWE & 4 & 4 & 4 & 4 & 4 & 4 \\
GBR & 4 & 2 & 2 & 4 & 2 & 2 \\ \bottomrule
\end{tabular}
\footnotesize
\begin{tablenotes}
\item[] 0:Worst 1:Worse 2:Insufficient 3:Reasonable 4:Good 5:Excellent 9:Unknown
\item[] Scores in parentheses are for non-national, when national and non-national data are collected differently.
\end{tablenotes}
\end{table}

In order to obtain a comparison of the European migration flow data, \cite{erf2007mim} provided subjective judgements by three characteristics: definitions of migration, measurement systems and intended coverage. ...

\subsection{Data Dissemination in the EU15}

Plots of the available counts of migrants with unknown origins or destinations, as a proportion of total sending and receiving countries, are shown in Figure \ref{fig:harmunk} for EU15 nations between 2002 and 2006. ...

\section{Methodology for Creating Comparable Data from Reliable Data Sources}

In this section, a general methodology that allows the estimation of incomplete international migration flow tables is described. ...

\subsection{Counts of Unknown Migrant Origins and Destinations}

As previously discussed, international migration flow data are accompanied by a count of migrants with unknown origins or destinations. ...

\subsection{Constrained Optimization}

Differences in counts between nations with better quality data can be considered as fixed, where data production techniques do change over time. ...

\section{Estimating Comparable Data from Reliable Data Sources}

In order to estimate comparable data from reliable data sources, reported counts are adjusted for unknowns produced in the dissemination of data by national statistics institutes. ...

\subsection{Correction for Unknown Counts}

All unknown counts, displayed in Figure \ref{fig:harmunk} are distributed to origins and destinations using the equations in (\ref{eqn:unksca}). ...

\subsection{Comparison of Distance Measures} \label{sec:optdis}

Alternative distance functions, to the Chi-Squared distance measure, could provide more stable correction factors over time, and hence better reflect the assumption that data collection methods and definitions remain constant. ...

\subsection{Constrained Optimization Over Time}

For the distance measure associated with the smallest variance, a new set of time constant correction factors $(\mathbf{r},\mathbf{s})$ are estimated. ...

\section{Summary and Conclusion}

In this chapter a methodology for the harmonization of data for international migration flows tables was outlined. ...


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Estimating Missing Data in International Migration Flow Tables}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Introduction}

In this chapter, model based imputations for missing data in flow tables are derived. ...

\section{Models for Migration Flow Tables}

\cite{flowerdew1991prm} outlined two main approaches to the analysis of flow tables that are commonly used for internal mobility data: the gravity model and the spatial interaction model. ...

\section{The Expectation-Maximization (EM) Algorithm} \label{sec:em}

The EM algorithm is an iterative algorithm for maximum likelihood estimation in incomplete data problems. ...

\section{Modelling Incomplete International Migration Flow Tables} \label{sec:covadd}

In this section, negative binomial regression models are fitted to incomplete international migration flow data for the EU15 countries, presented in Figure \ref{fig:harmfn}. ...

\subsection{Additional Information} \label{sec:covdis}

In order to provide more reasonable imputations, the quasi-independent model was expanded upon. ...

\subsection{Main Effects Model}

In order to attain a better model fit and more realistic imputation the Akaike Information Criterion (AIC) was used to select the most suitable variables for a main effects model. ...

\subsection{Interaction Models}

To gain a further superior fit the \texttt{stepAIC} function was run once more with an extended scope of models to consider all two-way interactions, with one exemption, the origin-destination interaction. ...

\section{Summary and Discussion}

In this chapter, a complete set of estimates of international migration flow tables are created, using a spatial interaction model fitted using the EM algorithm on the harmonized flows from the previous chapter. ...

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Estimating Measures of Precision for Missing Data in International Migration Flow Tables}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Introduction}

In this chapter, estimates for the measures of precision of missing cells in international migration flow tables are derived. ...

\section{Properties of the EM Algorithm}

The derivation of the SEM algorithm is dependent on both analytical expressions for the rate of convergence of the EM algorithm and manipulations of the asymptotic variance-covariance matrix of parameter estimates. ...

\subsection{Rate of Convergence in the EM Algorithm}

For the EM algorithm described in Section \ref{sec:em}, the mapping $\boldsymbol\theta \rightarrow M( \boldsymbol\theta )$ from the parameter space of $\boldsymbol\theta$, to itself is implied. ...

\section{Supplemented EM algorithm}

The estimation of $\Delta\mathbf{V}$ in (\ref{eqn:semv2}) can be obtained using the SEM algorithm introduced by \cite{meng1991uoa}. ...

\section{Akaike Information Criterion for Incomplete Data}

Finding a suitable dimension for parameters $\boldsymbol\theta $ can be undertaken by comparing several models based on their values of an information criteria, such as the Akaike Information Criterion (AIC) of (\ref{eqn:emaic}). ...

\section{Estimates of Precision for Missing Data in International Migration Flow Tables}

The SEM algorithm can be utilized in the estimation of international migration flow tables. ...

\subsection{Modelling of Complete Data}

As no implementable stepwise model selection routine existed for incomplete data, a fit all models function was written to run the SEM algorithm on the complete range of main effects models from the covariate set proposed in Section \ref{sec:covdis}. ...

\section{Summary and Discussion}

The SEM algorithm provides a useful technique when applied to international migration flow tables, where data is often incomplete. ...


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Conclusion}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Summary}

This study applied computationally intensive mathematical and statistical techniques to develop a methodology to estimate international migration flow tables of comparable data. ...

\subsection{Estimation Over Time}

The relative stability in migration definitions and data collection systems provides a basis for harmonizing international migration flow data. ...

\subsection{Accounting for Data Dissemination Problems}

As a prelude to the estimation of correction factors, counts of known migrants with unknown origins or destinations were accounted for by distributing these flows according to the existing distributional patterns. ...

\subsection{Ignoring Poor Quality Data}

Careful consideration was taken in deciding the eligibility of countries for the estimation of correction factors to scale reported data. ...

\subsection{Model Selection}

The EM algorithm was used to impute missing migration flow values. ...

\subsection{Measures of Variation}

The SEM algorithm was used in Chapter 6 to obtain an estimate for the asymptotic variance-covariance matrix for parameter estimates, using only the code for an EM algorithm, computations for asymptotic complete data variance-covariance matrix and standard matrix procedures. ...


\section{Context of Study}

\subsection{Modeling International Migration}

There exists a wide range of literature on modeling migration (see for example, \cite{massey1993tim} or \cite{greenwood2003ehm}). ...

\subsection{International Migration Data}

International migration flow data is often incomparable across multiple nations. ...


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{small}
\addcontentsline{toc}{chapter}{Bibliography}
\bibliography{ThesisBib}
\end{small}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\newpage
\appendix
\noappendicestocpagenum
\addappheadtotoc

\chapter{S-Plus/R Code}
\section{Poulain Constrained Minimization}

\linespread{1}
\begin{small}
\begin{verbatim}
poulain <- function(M, nr, base)
{
if(dim(M)[3] != 2)
stop("M must be a array of dimensions n x n x 2")
#tidy up data to exclude non-referee (nr) regions
M[is.na(M)] <- 0
...
\end{verbatim}
\end{small}
\newpage

\section{Distance Functions for Constrained Optimization}

\begin{small}
\begin{verbatim}
ChiSq <- function(x, M1, M2)
{
n <- length(x)
a <- matrix(x[1:c(n/2)], dim(M1)[1], dim(M1)[2], byrow = T)
b <- matrix(x[c(1 + n/2):n], dim(M2)[1], dim(M2)[2])
sum(abs(a * M1 - b * M2)^2/(M1 + M2), na.rm = T)
}
...
\end{verbatim}
\end{small}
\newpage

\section{EM Algorithm for Negative Binomial Regression Model}
\begin{small}
\begin{verbatim}
glm.nb.EM <- function(model, data, tol, max.it, z0)
{
if(all(is.missing(pmatch(names(data),"y")))==T)
stop("data must have a response column named y with some missing data")
data$original <- data$y
#Initial E-step with some unknown parameter set
data$y[is.na(data$original)] <- z0
...
\end{verbatim}
\end{small}
\newpage
\section{Supplemented EM Algorithm}
\begin{small}
\begin{verbatim}
em <- function(beta0, model, data)
{
#E step
fit <- exp(model.matrix(model, data) %*% beta0)
data$y[is.na(data$original)] <- c(fit)[is.na(data$original)]
#M step
...
\end{verbatim}
\end{small}

\end{document}