zerolocker/lazy update

## lazy update

### 2014-9-12
#### **lazy update**
+  This is an solution used to deal with the issue mentioned in **2014-8-28**.
+ It is described in this paper: *Efficient Online and Batch Learning Using Forward Backward Splitting, Section 6. Efficient Implementation in High Dimensions*.
+ This solution is quite simple. Here I describe it in a few lines.
First let's see what's our problem to solve. Typically, the cost function to be minimized could be summarized as $loss + regularization$. If I consider the cost function of only one example $\vec x$, which is true in SGD method, the cost function will be: $ J = l(\vec x) + \lambda r(\vec w)$.In SGD method, we need to compute the gradient w.r.t every component of $\vec w$ (denoted by $w_j$), and perform an update of  $w_j\leftarrow w_j-\eta\frac{\partial{J}}{\partial{\vec w_j}}  $,  where $\frac{\partial{J}}{\partial{\vec w_j}} =\frac{\partial l(\vec x)}{\partial w_j} + \lambda \frac{\partial r(\vec w)}{\partial w_j} $
The first term $\frac{\partial l(\vec x)}{\partial w_j}$ is typically of the form $\text{some_function_of}(w^Tx) \times  x_j\text{(the jth component of $\vec x$)}$" in loss functions like squared loss or logistic loss. Let's denote the second term by $tmp$.
If $x_j$ is zero, the first term would be zero,  and if there is no second term(i.e. $tmp=0$), the gradient will be zero and we don't need to update $w_j$ at all, which saves computation. Unfortunately, there is always a second term and this situation gets even worse in scenarioes of text classification where there are millions of features; the weight vector is of million dimension and we have to update every component of $\vec w$ even if most components of an example's feature vector $\vec x$ is zero.
+ So this is our problem, and the solution is simple:
This solution could be summerized as **"lazy update"**.
Let's illustrate it in an example. Suppose now comes a training example 1, which has only a few hundreds of non-zero features. Then we update only the non-zero features' corresponding components of $\vec w$. According to the update rule, those zero features' corresponding components of $\vec w$  should be updated---substracted by $\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$, but we don't update them. Suppose now comes another training example 2, which also has only a few hundreds of non-zero features. It needs only a few components of $w_j$ to compute $\frac{\partial l(\vec x)}{\partial w_j}$ because $\frac{\partial l(\vec x)}{\partial w_j}$ is some function of $w^Tx$. Then, before this computation starts, we update those few components of $\vec w$ needed for this computation, still leaving others unchanged. (Here "update" means substract them by $\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$). The philosophy of  this solution is, in a word, "update when needed for computation".
Now after the processing of training example 2 is done, the components of $\vec w$ which is never updated in these **2** iterations should be substracted by $\boldsymbol 2\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$.
After the processing of training example 3 is done, the components of$\vec w$ which is never updated in all these **3** iterations should be substracted by $\boldsymbol 3\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$.
.....
After the processing of training example N is done, the components of $\vec w$ which is never updated in all these **N** iterations should be substracted by $\boldsymbol N\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$.
If training example N+1 comes, and if it needs  $w_j$ to compute $\frac{\partial l(\vec x)}{\partial w_j}$ (because $\frac{\partial l(\vec x)}{\partial w_j}$ is some function of $w^Tx$), then, before this computation starts, we update those components of $\vec w$ needed for this computation, the way to update it would be subtract it by $ k \times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j} $, where k is the number of iterations passed since last update of $w_j$.
+ So this is the soltion, which is not hard to understand and prove. Note that here I assume $r(\vec w)$ is differentiable, but if $L_1$ penalty is used,  $r(\vec w)$ is non-differentiable, and the proof is different. But the idea is the same that we could lazily update those $w_j$ when it is needed for computation and the update is "cumulative"---i.e. *original updating times needed* $\times$ *the amount of a single update*.


> Written with [StackEdit](https://stackedit.io/).

	### 2014-9-12
	#### lazy update
	+ This is an solution used to deal with the issue mentioned in 2014-8-28.
	+ It is described in this paper: Efficient Online and Batch Learning Using Forward Backward Splitting, Section 6. Efficient Implementation in High Dimensions.
	+ This solution is quite simple. Here I describe it in a few lines.
	First let's see what's our problem to solve. Typically, the cost function to be minimized could be summarized as $loss + regularization$. If I consider the cost function of only one example $\vec x$, which is true in SGD method, the cost function will be: $ J = l(\vec x) + \lambda r(\vec w)$.In SGD method, we need to compute the gradient w.r.t every component of $\vec w$ (denoted by $w_j$), and perform an update of $w_j\leftarrow w_j-\eta\frac{\partial{J}}{\partial{\vec w_j}} $, where $\frac{\partial{J}}{\partial{\vec w_j}} =\frac{\partial l(\vec x)}{\partial w_j} + \lambda \frac{\partial r(\vec w)}{\partial w_j} $
	The first term $\frac{\partial l(\vec x)}{\partial w_j}$ is typically of the form $\text{some_function_of}(w^Tx) \times x_j\text{(the jth component of $\vec x$)}$" in loss functions like squared loss or logistic loss. Let's denote the second term by $tmp$.
	If $x_j$ is zero, the first term would be zero, and if there is no second term(i.e. $tmp=0$), the gradient will be zero and we don't need to update $w_j$ at all, which saves computation. Unfortunately, there is always a second term and this situation gets even worse in scenarioes of text classification where there are millions of features; the weight vector is of million dimension and we have to update every component of $\vec w$ even if most components of an example's feature vector $\vec x$ is zero.
	+ So this is our problem, and the solution is simple:
	This solution could be summerized as "lazy update".
	Let's illustrate it in an example. Suppose now comes a training example 1, which has only a few hundreds of non-zero features. Then we update only the non-zero features' corresponding components of $\vec w$. According to the update rule, those zero features' corresponding components of $\vec w$ should be updated---substracted by $\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$, but we don't update them. Suppose now comes another training example 2, which also has only a few hundreds of non-zero features. It needs only a few components of $w_j$ to compute $\frac{\partial l(\vec x)}{\partial w_j}$ because $\frac{\partial l(\vec x)}{\partial w_j}$ is some function of $w^Tx$. Then, before this computation starts, we update those few components of $\vec w$ needed for this computation, still leaving others unchanged. (Here "update" means substract them by $\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$). The philosophy of this solution is, in a word, "update when needed for computation".
	Now after the processing of training example 2 is done, the components of $\vec w$ which is never updated in these 2 iterations should be substracted by $\boldsymbol 2\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$.
	After the processing of training example 3 is done, the components of$\vec w$ which is never updated in all these 3 iterations should be substracted by $\boldsymbol 3\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$.
	.....
	After the processing of training example N is done, the components of $\vec w$ which is never updated in all these N iterations should be substracted by $\boldsymbol N\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$.
	If training example N+1 comes, and if it needs $w_j$ to compute $\frac{\partial l(\vec x)}{\partial w_j}$ (because $\frac{\partial l(\vec x)}{\partial w_j}$ is some function of $w^Tx$), then, before this computation starts, we update those components of $\vec w$ needed for this computation, the way to update it would be subtract it by $ k \times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j} $, where k is the number of iterations passed since last update of $w_j$.
	+ So this is the soltion, which is not hard to understand and prove. Note that here I assume $r(\vec w)$ is differentiable, but if $L_1$ penalty is used, $r(\vec w)$ is non-differentiable, and the proof is different. But the idea is the same that we could lazily update those $w_j$ when it is needed for computation and the update is "cumulative"---i.e. original updating times needed $\times$ the amount of a single update.



	> Written with [StackEdit](https://stackedit.io/).