CatChenal/min_svd.md

## min_svd.md

      
    Raw
  

              min_svd.md
            
          
    Output (if show == True):


Motivation:

Scikit-learn decomposition module includes many decomposition algorithms that can be used for dimensionality reduction. PCA and SVD are two algorithms that differ in their functional implementation in regards to their parameters. PCA uses its n_components parameter in different ways depending on its type:

n_components  int, float or ‘mle’, default=None

Notably:

If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.

So if you set n_components to .95 and svd_solver to 'full', you will only get the components beyond that threshold. Great.

(See my wrapper function for PCA, get_min_pca.)
On the other hand, the n_components parameter for TruncatedSVD is used only to set the output dimensions:

n_components  int, default=2

Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.

We can set this parameter to any number between 1 (2?) and (1 - the number of features) without knowing if this number is optimal.

The function get_min_svd defined below uses 2 calls to sklearn.decomposition.TruncatedSVD, first with the maximum number of components, then with the number of components found to be above the given explained variance threshold, which are returned along with the number of components and the threshold.
Function definition:

def get_min_svd(data, min_var_explained=0.95,
                show=True,
                points_style='c--',
                line_color='m'):
  """
  Decompose `data` using singular value decomposition.
  Return the number of components above `min_var_explained` threshold, the
  threshold and the transformed data, along with the plot if `show` is True (default).
  """
  from sklearn.decomposition import TruncatedSVD

  # Use max number of components: data length - 1:
  d = data.shape[1]
  tsvd = TruncatedSVD(d - 1)
  tsvd.fit(X)

  # Cumulative explained variance
  cum_var = tsvd.explained_variance_ratio_.cumsum()

  # Number of components for % variance explained
  comp_min = list(cum_var > min_var_explained).index(True) + 1
  
  if show:
    try:
        fig = plt.figure()
    except NameError:
        import matplotlib.pyplot as plt
        fig = plt.figure()
    
    ax = fig.add_subplot()
    
    x_vals = [i for i in range(1, d)]
    ax.plot(x_vals, cum_var, points_style, markersize=8, linewidth=1)
    ax.hlines(min_var_explained, 0, comp_min, colors=line_color)
    ax.vlines(comp_min, cum_var.min(), min_var_explained, colors=line_color)
    ax.plot(comp_min, cum_var.min(), 'k+', markersize=12,
            label=f'Components for {min_var_explained:.0%} of\nvariance explained: {comp_min}')
    ax.set_xlim(xmin=1-.2,xmax=d-.2)
    ax.set(xlabel='svd components', ylabel='cumulative explained variance')
    ax.legend(markerscale=0, handlelength=0)

  return comp_min, min_var_explained, TruncatedSVD(n_components=comp_min).fit_transform(data)
Call example:

# X = sklearn.datasets.load_digits().data / 255
n_comps, thresh, reduced = get_min_svd(X)