Last active
August 29, 2015 14:06
-
-
Save zerolocker/f49274e00df98f8d8df9 to your computer and use it in GitHub Desktop.
lazy update
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### 2014-9-12 | |
#### **lazy update** | |
+ This is an solution used to deal with the issue mentioned in **2014-8-28**. | |
+ It is described in this paper: *Efficient Online and Batch Learning Using Forward Backward Splitting, Section 6. Efficient Implementation in High Dimensions*. | |
+ This solution is quite simple. Here I describe it in a few lines. | |
First let's see what's our problem to solve. Typically, the cost function to be minimized could be summarized as $loss + regularization$. If I consider the cost function of only one example $\vec x$, which is true in SGD method, the cost function will be: $ J = l(\vec x) + \lambda r(\vec w)$.In SGD method, we need to compute the gradient w.r.t every component of $\vec w$ (denoted by $w_j$), and perform an update of $w_j\leftarrow w_j-\eta\frac{\partial{J}}{\partial{\vec w_j}} $, where $\frac{\partial{J}}{\partial{\vec w_j}} =\frac{\partial l(\vec x)}{\partial w_j} + \lambda \frac{\partial r(\vec w)}{\partial w_j} $ | |
The first term $\frac{\partial l(\vec x)}{\partial w_j}$ is typically of the form $\text{some_function_of}(w^Tx) \times x_j\text{(the jth component of $\vec x$)}$" in loss functions like squared loss or logistic loss. Let's denote the second term by $tmp$. | |
If $x_j$ is zero, the first term would be zero, and if there is no second term(i.e. $tmp=0$), the gradient will be zero and we don't need to update $w_j$ at all, which saves computation. Unfortunately, there is always a second term and this situation gets even worse in scenarioes of text classification where there are millions of features; the weight vector is of million dimension and we have to update every component of $\vec w$ even if most components of an example's feature vector $\vec x$ is zero. | |
+ So this is our problem, and the solution is simple: | |
This solution could be summerized as **"lazy update"**. | |
Let's illustrate it in an example. Suppose now comes a training example 1, which has only a few hundreds of non-zero features. Then we update only the non-zero features' corresponding components of $\vec w$. According to the update rule, those zero features' corresponding components of $\vec w$ should be updated---substracted by $\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$, but we don't update them. Suppose now comes another training example 2, which also has only a few hundreds of non-zero features. It needs only a few components of $w_j$ to compute $\frac{\partial l(\vec x)}{\partial w_j}$ because $\frac{\partial l(\vec x)}{\partial w_j}$ is some function of $w^Tx$. Then, before this computation starts, we update those few components of $\vec w$ needed for this computation, still leaving others unchanged. (Here "update" means substract them by $\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$). The philosophy of this solution is, in a word, "update when needed for computation". | |
Now after the processing of training example 2 is done, the components of $\vec w$ which is never updated in these **2** iterations should be substracted by $\boldsymbol 2\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$. | |
After the processing of training example 3 is done, the components of$\vec w$ which is never updated in all these **3** iterations should be substracted by $\boldsymbol 3\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$. | |
..... | |
After the processing of training example N is done, the components of $\vec w$ which is never updated in all these **N** iterations should be substracted by $\boldsymbol N\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$. | |
If training example N+1 comes, and if it needs $w_j$ to compute $\frac{\partial l(\vec x)}{\partial w_j}$ (because $\frac{\partial l(\vec x)}{\partial w_j}$ is some function of $w^Tx$), then, before this computation starts, we update those components of $\vec w$ needed for this computation, the way to update it would be subtract it by $ k \times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j} $, where k is the number of iterations passed since last update of $w_j$. | |
+ So this is the soltion, which is not hard to understand and prove. Note that here I assume $r(\vec w)$ is differentiable, but if $L_1$ penalty is used, $r(\vec w)$ is non-differentiable, and the proof is different. But the idea is the same that we could lazily update those $w_j$ when it is needed for computation and the update is "cumulative"---i.e. *original updating times needed* $\times$ *the amount of a single update*. | |
> Written with [StackEdit](https://stackedit.io/). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment