Skip to content

Instantly share code, notes, and snippets.

@zerolocker
Last active August 29, 2015 14:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zerolocker/f49274e00df98f8d8df9 to your computer and use it in GitHub Desktop.
Save zerolocker/f49274e00df98f8d8df9 to your computer and use it in GitHub Desktop.
lazy update
### 2014-9-12
#### **lazy update**
+ This is an solution used to deal with the issue mentioned in **2014-8-28**.
+ It is described in this paper: *Efficient Online and Batch Learning Using Forward Backward Splitting, Section 6. Efficient Implementation in High Dimensions*.
+ This solution is quite simple. Here I describe it in a few lines.
First let's see what's our problem to solve. Typically, the cost function to be minimized could be summarized as $loss + regularization$. If I consider the cost function of only one example $\vec x$, which is true in SGD method, the cost function will be: $ J = l(\vec x) + \lambda r(\vec w)$.In SGD method, we need to compute the gradient w.r.t every component of $\vec w$ (denoted by $w_j$), and perform an update of $w_j\leftarrow w_j-\eta\frac{\partial{J}}{\partial{\vec w_j}} $, where $\frac{\partial{J}}{\partial{\vec w_j}} =\frac{\partial l(\vec x)}{\partial w_j} + \lambda \frac{\partial r(\vec w)}{\partial w_j} $
The first term $\frac{\partial l(\vec x)}{\partial w_j}$ is typically of the form $\text{some_function_of}(w^Tx) \times x_j\text{(the jth component of $\vec x$)}$" in loss functions like squared loss or logistic loss. Let's denote the second term by $tmp$.
If $x_j$ is zero, the first term would be zero, and if there is no second term(i.e. $tmp=0$), the gradient will be zero and we don't need to update $w_j$ at all, which saves computation. Unfortunately, there is always a second term and this situation gets even worse in scenarioes of text classification where there are millions of features; the weight vector is of million dimension and we have to update every component of $\vec w$ even if most components of an example's feature vector $\vec x$ is zero.
+ So this is our problem, and the solution is simple:
This solution could be summerized as **"lazy update"**.
Let's illustrate it in an example. Suppose now comes a training example 1, which has only a few hundreds of non-zero features. Then we update only the non-zero features' corresponding components of $\vec w$. According to the update rule, those zero features' corresponding components of $\vec w$ should be updated---substracted by $\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$, but we don't update them. Suppose now comes another training example 2, which also has only a few hundreds of non-zero features. It needs only a few components of $w_j$ to compute $\frac{\partial l(\vec x)}{\partial w_j}$ because $\frac{\partial l(\vec x)}{\partial w_j}$ is some function of $w^Tx$. Then, before this computation starts, we update those few components of $\vec w$ needed for this computation, still leaving others unchanged. (Here "update" means substract them by $\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$). The philosophy of this solution is, in a word, "update when needed for computation".
Now after the processing of training example 2 is done, the components of $\vec w$ which is never updated in these **2** iterations should be substracted by $\boldsymbol 2\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$.
After the processing of training example 3 is done, the components of$\vec w$ which is never updated in all these **3** iterations should be substracted by $\boldsymbol 3\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$.
.....
After the processing of training example N is done, the components of $\vec w$ which is never updated in all these **N** iterations should be substracted by $\boldsymbol N\times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j}$.
If training example N+1 comes, and if it needs $w_j$ to compute $\frac{\partial l(\vec x)}{\partial w_j}$ (because $\frac{\partial l(\vec x)}{\partial w_j}$ is some function of $w^Tx$), then, before this computation starts, we update those components of $\vec w$ needed for this computation, the way to update it would be subtract it by $ k \times\eta\lambda\frac{\partial r(\vec w)}{\partial w_j} $, where k is the number of iterations passed since last update of $w_j$.
+ So this is the soltion, which is not hard to understand and prove. Note that here I assume $r(\vec w)$ is differentiable, but if $L_1$ penalty is used, $r(\vec w)$ is non-differentiable, and the proof is different. But the idea is the same that we could lazily update those $w_j$ when it is needed for computation and the update is "cumulative"---i.e. *original updating times needed* $\times$ *the amount of a single update*.
> Written with [StackEdit](https://stackedit.io/).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment