Skip to content

Instantly share code, notes, and snippets.

@CatChenal
Last active December 17, 2021 20:12
Show Gist options
  • Save CatChenal/4c9748f085f1ad5670fc1eebc62cd0f4 to your computer and use it in GitHub Desktop.
Save CatChenal/4c9748f085f1ad5670fc1eebc62cd0f4 to your computer and use it in GitHub Desktop.
Function to get the data reduced with the minimal number of components when using SVD (Singular Value Decomposition).

Output (if show == True):

min_svd

Motivation:

Scikit-learn decomposition module includes many decomposition algorithms that can be used for dimensionality reduction. PCA and SVD are two algorithms that differ in their functional implementation in regards to their parameters. PCA uses its n_components parameter in different ways depending on its type:

n_components int, float or ‘mle’, default=None

Notably:

If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.

So if you set n_components to .95 and svd_solver to 'full', you will only get the components beyond that threshold. Great.
(See my wrapper function for PCA, get_min_pca.)

On the other hand, the n_components parameter for TruncatedSVD is used only to set the output dimensions:

n_components int, default=2
Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.

We can set this parameter to any number between 1 (2?) and (1 - the number of features) without knowing if this number is optimal.
The function get_min_svd defined below uses 2 calls to sklearn.decomposition.TruncatedSVD, first with the maximum number of components, then with the number of components found to be above the given explained variance threshold, which are returned along with the number of components and the threshold.

Function definition:

def get_min_svd(data, min_var_explained=0.95,
                show=True,
                points_style='c--',
                line_color='m'):
  """
  Decompose `data` using singular value decomposition.
  Return the number of components above `min_var_explained` threshold, the
  threshold and the transformed data, along with the plot if `show` is True (default).
  """
  from sklearn.decomposition import TruncatedSVD

  # Use max number of components: data length - 1:
  d = data.shape[1]
  tsvd = TruncatedSVD(d - 1)
  tsvd.fit(X)

  # Cumulative explained variance
  cum_var = tsvd.explained_variance_ratio_.cumsum()

  # Number of components for % variance explained
  comp_min = list(cum_var > min_var_explained).index(True) + 1
  
  if show:
    try:
        fig = plt.figure()
    except NameError:
        import matplotlib.pyplot as plt
        fig = plt.figure()
    
    ax = fig.add_subplot()
    
    x_vals = [i for i in range(1, d)]
    ax.plot(x_vals, cum_var, points_style, markersize=8, linewidth=1)
    ax.hlines(min_var_explained, 0, comp_min, colors=line_color)
    ax.vlines(comp_min, cum_var.min(), min_var_explained, colors=line_color)
    ax.plot(comp_min, cum_var.min(), 'k+', markersize=12,
            label=f'Components for {min_var_explained:.0%} of\nvariance explained: {comp_min}')
    ax.set_xlim(xmin=1-.2,xmax=d-.2)
    ax.set(xlabel='svd components', ylabel='cumulative explained variance')
    ax.legend(markerscale=0, handlelength=0)

  return comp_min, min_var_explained, TruncatedSVD(n_components=comp_min).fit_transform(data)

Call example:

# X = sklearn.datasets.load_digits().data / 255
n_comps, thresh, reduced = get_min_svd(X)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment