Skip to content

Instantly share code, notes, and snippets.

@sazio
Last active July 31, 2020 14:26
Show Gist options
  • Save sazio/83daff934222955642f16b8c46c4847d to your computer and use it in GitHub Desktop.
Save sazio/83daff934222955642f16b8c46c4847d to your computer and use it in GitHub Desktop.
In this first part, we'd like to tell you about some practical tricks for making **gradient descent** work well, in particular, we're going to delve into feature scaling. As an introductory view, it seems reasonable to try to depict an intuition of the concept of *scale*.
## **Macro, Meso, Micro-scale in Science**
As scientists, we are well aware of the effects of using a specific measurement tool in order to characterize some quantity and describe reality. As an ideal example we consider the **length scale**.
<img src="https://raw.githubusercontent.com/MLJCUnito/ProjectX2020/master/HowToTackleAMLCompetition/img/Lecture1/1.0.png" width="500" height="300">
We can identify three different points of view: *microscopic*, *mesoscopic* and *macroscopic*; which are intimately related to the adopted lenght scale.
We usually deal with the *macroscopic scale* when the observer is in such a position (pretty far, in terms of distance), with respect to the object, that she/he can describe its global characteristics. Instead, we do refer to the *microscopic scale* when the observer is so close to the object that she/he can describe its atomistic details or elementary parts (e.g. molecules, atoms, quarks). Last but not least, we talk about *mesoscopic scale* everytime we are in between micro and macro.
These definitions are deliberately vague, since delineating a precise and neat explanation would be higly difficult and complex, and it's actually far from our purposes.
On the other side, this kind of introduction is quite useful, we should take a few minutes to think about the "active" role of the observer and about the fact that, to be honest, for every length scale, there's some specific theory, i.e. there's no global theory for a multi-scale description of some phenomenon.
## **Scaling in Data Science**
If our beloved observer (i.e. the scientist) has some kind of "privilege", i.e. choosing the right measurement tool, which is nothing but choosing the right scale in the description of some phenomenon, we can't really say the same for a data scientist.
It's a sort of paradox, but a data scientist can't really deal with data retrieval most of the times. Because of that, a data scientist is often left alone in front of data, without even knowing from which measurement tool they're coming from. There's no way to interact with the length scale for example.
Is there something that we can do about it? The only thing we can do is assuming that features are independent and scale these features in order to have something compatible from one to the other. This procedure is called **feature scaling**, and soon we'll understand why it is useful even for ML algorithms, such as gradient descent.
<img src="https://raw.githubusercontent.com/MLJCUnito/ProjectX2020/master/HowToTackleAMLCompetition/img/Lecture1/1.1.png" width="500" height="300">
If you make sure that features are on similar scales, i.e. features take on similar range of values, then gradient descent can converge more quickly.
More concretely, let's say we have a problem with two features where $x_1$ is the length of a football field and take values between $90$ (meters) and $115$ (meters) and $x_2$ is the radius of a ball which takes values between $10.5* 10^{-2}$ (meters) to $11.5* 10^{-2}$ (meters). If you plot the countours of the cost function $J(\omega)$ then you might get something similar to the *left plot*, and because of these very skewed elliptical shape, if we run gradient descent on this cost function, it may end up taking a long time and oscillating back and forth before reaching the global minimum.
In these settings, as stated previously, a useful thing to do is to scale the features. Generally, the idea is to get every feature into approximately a $-1$ to $+1$ range. By doing this, we get the *right plot*. In this way, you can find a much more direct path to the global minimum rather than taking a much more convoluted path where you're sort of trying to follow a very complicated trajectory.
## **Preprocessing Data**
> In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or *encoded*, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.
We're going to dive into Scikit-Learn for this section and exploit its powerful *processing* package.
We've been talking about scaling our data, now it's time to understand how to put our hands on code and try to do that. Usually, as previously stated, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers might be more appropriate. (Take a look at [Compare the effect of different scalers on data with outliers](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py), you'll see the behaviors of the different scalers, transformers and normalizers with outliers).
### Standardization
Many Machine Learning estimators require *standardization* of datasets, elseways they might behave badly because data are far from a Gaussian (with zero mean and unit variance) distribution.
Most of the times, we ignore the shape of the distribution and just transform the data by subtracting the mean value of each feature, then scale by dividing features by their standard deviation.
---
*Do you have in mind some models that assume that all features are centered around zero and have variance in the same order of magnitude? Can you think about possible issues related to the objective function in these cases?*
<ins>Answer</ins>: many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
---
There's a fast way to do that on a single array, by means of the *scale* function
https://gist.github.com/23a8b12c46138d08c4d2f70109d1cddf
[[2 3 0]
[2 1 4]
[3 4 1]]
[[-0.70710678 0.26726124 -0.98058068]
[-0.70710678 -1.33630621 1.37281295]
[ 1.41421356 1.06904497 -0.39223227]]
[-2.96059473e-16 1.48029737e-16 -1.11022302e-16]
[1. 1. 1.]
The *preprocessing* module provides a utility class [*StandardScaler*](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) that compute the mean and std on a training set so as to be able to later reapply the same transform on the test set.
(You should be well aware of what [*sklearn.pipeline.Pipeline*](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) is, it's crucial for strategies' deployment.)
https://gist.github.com/803ee9774f3392b9f1d4c78e35e0bc8f
[2.33333333 2.66666667 1.66666667]
[0.47140452 1.24721913 1.69967317]
[[-0.70710678 0.26726124 -0.98058068]
[-0.70710678 -1.33630621 1.37281295]
[ 1.41421356 1.06904497 -0.39223227]]
Now we can use the scaler instance on new data, to transform them in the same way we did previously.
https://gist.github.com/f5f089a56932b1264f6c6e6384bb1c79
array([[-7.07106781, -1.33630621, -0.98058068]])
It is possible to disable centering or scaling by passing *with_mean = False* or *with_std = False*. The first one might be particularly useful if applied to sparse CSR or CSC matrices to avoid breaking the sparsity structure of the data.
**<ins>Scaling Features to a Range</ins>**
Another standardization is scaling features to lie between a given minimum and maximum value, or so that the maximum absolute value of each future is scaled to unit size. This can be achieved with [*MinMaxScaler*](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) or [*MaxAbsScaler*](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler).
Here you can see how to scale a toy data matrix to the $[0,1]$ range:
https://gist.github.com/b73ef64608d1b0f7f27566732652b6a4
array([[1. , 1. , 1. ],
[0. , 0.66666667, 0. ],
[0.75 , 0. , 0. ]])
In the same way as above, the same instance of the transformer can be applied to some new test data: same scaling and shifting will be applied to be consistent.
https://gist.github.com/3b2b01e74b50a7bc081a40ea6aff13bc
array([[-1.25 , -0.66666667, 4. ]])
It's pretty useful to let the scaler reveal some details about the transformation learned on the training data:
https://gist.github.com/38aabd99d0b08253a00eace3e38374f9
[0.25 0.33333333 1. ]
[-0.5 -0.33333333 0. ]
---
Can you retrieve the explicit formula for *MinMaxScaler*?
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
---
*MaxAbsScaler* works in a similar fashion, the data will lie in the range $[-1,1]$. It is meant for data that is already centered at zero or sparse data.
https://gist.github.com/1b0eebacdc2e8cf1708e2f98a20c6549
array([[1. , 1. , 1. ],
[0.33333333, 0.75 , 0. ],
[0.83333333, 0.25 , 0. ]])
https://gist.github.com/1114a00104c96ca802bebd4fcce45780
array([[-0.5 , -0.25, 4. ]])
https://gist.github.com/4d33a8e0935871e3a5998cc981f4d4b9
array([6., 4., 1.])
**<ins>Scaling Data with Outliers</ins>**
If our data contain many outliers, scaling using the mean and variance of the data is not likely to work well. In this case, we can use [*RobustScaler*](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler).
This scaler removes the [median](https://en.wikipedia.org/wiki/Median) and scales data according to the [IQR](https://en.wikipedia.org/wiki/Interquartile_range) (InterQuartile Range).
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.
https://gist.github.com/c12fb900fe97693cfeb1a138341dfbbe
array([[ 0. , -2. , 0. ],
[-0.14285714, 0. , 0.4 ],
[ 1.85714286, 0. , -1.6 ]])
Median and interquartile range are then stored to be used on later data using the *transform* method.
<a id='section1.3.3'></a>
### Non-linear transformations
It's possible to generalize to non-linear transformations. We are going to talk about two types of transformations: *quantile transforms* and *power transforms*. The main take-home message is that we need *monotonic* transformations to preserve the rank of the values along each feature.
Quantile transforms smooth out unusual distributions and are less influenced by outliers than scaling methods. It does distort correlations and distances within and across features.
Power transforms are, indeed, a family of parametric transformations that aim to map data from any distribution to as close to a Gaussian distribution.
**<ins>Mapping to a Uniform Distribution</ins>**
[*QuantileTransformer*](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer) provides a non-parametric transformation to map the data to a uniform distribution with values between 0 and 1:
https://gist.github.com/2eda9617e044766f3a0a58be61f8c72b
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_data.py:2357: UserWarning: n_quantiles (1000) is greater than the total number of samples (112). n_quantiles is set to n_samples.
% (self.n_quantiles, n_samples))
array([4.3, 5.1, 5.8, 6.5, 7.9])
This feature corresponds to the sepal length in cm. Once the quantile transform is applied, those landmarks approach closely the percentiles previously defined
https://gist.github.com/2415a9cc5bc0f8e33259da00a0e03395
array([0. , 0.23873874, 0.50900901, 0.74324324, 1. ])
Some more applications [here](https://machinelearningmastery.com/quantile-transforms-for-machine-learning/)
**<ins>Mapping to a Gaussian Distribution</ins>**
Many machine learning algorithms prefer or perform better when numerical input variables and even output variables in the case of regression have a Gaussian distribution. Power transforms are a family of parametric, monotonic transforms that aim to map data from any distribution to as close to a Gaussian distribution as possible, in order to stabilize variance and minimize skewness.
[*PowerTransformer*](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html#sklearn.preprocessing.PowerTransformer) provides two transformations, the *Yeo-Johnson* transform:
$\begin{split}x_i^{(\lambda)} =
\begin{cases}
[(x_i + 1)^\lambda - 1] / \lambda & \text{if } \lambda \neq 0, x_i \geq 0, \\[8pt]
\ln{(x_i + 1)} & \text{if } \lambda = 0, x_i \geq 0 \\[8pt]
-[(-x_i + 1)^{2 - \lambda} - 1] / (2 - \lambda) & \text{if } \lambda \neq 2, x_i < 0, \\[8pt]
- \ln (- x_i + 1) & \text{if } \lambda = 2, x_i < 0
\end{cases}\end{split}$
and the *Box-Cox* transform:
$\begin{split}x_i^{(\lambda)} =
\begin{cases}
\dfrac{x_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt]
\ln{(x_i)} & \text{if } \lambda = 0,
\end{cases}\end{split}$
Box-Cox can only be applied to strictly positive data. In both methods, the transformation is parametrized by $\lambda$, which is determined trough maximum-likelihood estimation. Here an example of using Box-Cox to map samples drawn from a lognormal distribution to a normal distribution:
https://gist.github.com/9c100ed51de4ea7763c033c6ec9eee70
array([[ 0.49024349, 0.17881995, -0.1563781 ],
[-0.05102892, 0.58863195, -0.57612415],
[ 0.69420009, -0.84857822, 0.10051454]])
(Some more applications [here](https://machinelearningmastery.com/power-transforms-with-scikit-learn/))
Below some examples of the two transforms applied to various probability distributions, *any comment*?
<img src="https://raw.githubusercontent.com/MLJCUnito/ProjectX2020/master/HowToTackleAMLCompetition/img/Lecture1/1.3.png" width="400" height="800">
### Normalization
As scientists, we feel much more comfortable with Vector Space Models. *Normalization* is the process of scaling individual samples to have unit norm. This process might be useful if we plan to use a dot-product or some kernel to quantify similarities of pairs of samples.
The function [*normalize*](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html#sklearn.preprocessing.normalize) provides a quick and easy way to perform this operation on a single array, using [L1 or L2 norms](https://medium.com/@montjoile/l0-norm-l1-norm-l2-norm-l-infinity-norm-7a7d18a4f40c):
https://gist.github.com/905efc42afcec7e51672d6e460584f11
array([[ 0.40824829, -0.40824829, 0.81649658],
[ 1. , 0. , 0. ],
[ 0. , 0.70710678, -0.70710678]])
The *preprocessing* module provides a utility class *Normalizer* that implements the same operation using the *Transformer* API. This class is suitable for *sklearn.pipeline.Pipeline*
https://gist.github.com/992458255cd868bca39c98f70d4ba62a
array([[ 0.40824829, -0.40824829, 0.81649658],
[ 1. , 0. , 0. ],
[ 0. , 0.70710678, -0.70710678]])
https://gist.github.com/36e8cff194ece6355ad3274f7a97807a
array([[-0.70710678, 0.70710678, 0. ]])
### Encoding Categorical Features
In many cases, features are not continous values but categorical. E.g. a person could have some features: ``["from Italy", "from France", "from Germany"]``, ``["play sports", "doesn't play sports"]``, ``["uses Firefox", "uses Opera", "uses Chrome", "uses Safari", "uses Internet Explorer"]``.
Such features can be efficiently coded as integers, for instance ``["from France", "play sports", "uses Chrome"]`` could be ``[1,0,2]``.
To convert categorical features to such integer codes, we can use the [*OrdinalEncoder*](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder). In this way, we transform each feature to one new feature of integers *(0* to *n_categories-1)*:
https://gist.github.com/9c76da49358b611e5691e4e741f853ee
[[0. 0. 0.]]
[[1. 1. 1.]]
[[0. 1. 1.]]
Some scikit-learn estimators expect continuous input, and would interpret categories as ordered, which is usually not desired.
There's another way to convert categorical features to features that can be used with scikit-learn estimators: *one-hot encoding*. It can be obtained with the [*OneHotEncoder*](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder), which transforms each categorical feature with ``n_categories``possible values into ``n_categories`` binary features, with one of them 1, and all others 0.
Let's continue with the example above:
https://gist.github.com/04b7b1c2fa55c66cbb4a99fe3ab0b63e
[[1. 0. 1. 0. 1. 0.]]
[[0. 1. 0. 1. 0. 1.]]
[[1. 0. 0. 1. 0. 1.]]
The values each feature can take is inferred automatically from the dataset and can be found in the ``categories_``attribute:
https://gist.github.com/db51725b8b0e0921b0728e5a978fdc10
[array(['from Germany', 'from Italy'], dtype=object),
array(["doesn't play sports", 'play sports'], dtype=object),
array(['uses Firefox', 'uses Safari'], dtype=object)]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment