{{ message }}

Instantly share code, notes, and snippets.

# danoneata/fisher_vector.py

Last active Mar 11, 2021
Fisher vectors with sklearn
 import numpy as np import pdb from sklearn.datasets import make_classification from sklearn.mixture import GaussianMixture as GMM def fisher_vector(xx, gmm): """Computes the Fisher vector on a set of descriptors. Parameters ---------- xx: array_like, shape (N, D) or (D, ) The set of descriptors gmm: instance of sklearn mixture.GMM object Gauassian mixture model of the descriptors. Returns ------- fv: array_like, shape (K + 2 * D * K, ) Fisher vector (derivatives with respect to the mixing weights, means and variances) of the given descriptors. Reference --------- J. Krapac, J. Verbeek, F. Jurie. Modeling Spatial Layout with Fisher Vectors for Image Categorization. In ICCV, 2011. http://hal.inria.fr/docs/00/61/94/03/PDF/final.r1.pdf """ xx = np.atleast_2d(xx) N = xx.shape # Compute posterior probabilities. Q = gmm.predict_proba(xx) # NxK # Compute the sufficient statistics of descriptors. Q_sum = np.sum(Q, 0)[:, np.newaxis] / N Q_xx = np.dot(Q.T, xx) / N Q_xx_2 = np.dot(Q.T, xx ** 2) / N # Compute derivatives with respect to mixing weights, means and variances. d_pi = Q_sum.squeeze() - gmm.weights_ d_mu = Q_xx - Q_sum * gmm.means_ d_sigma = ( - Q_xx_2 - Q_sum * gmm.means_ ** 2 + Q_sum * gmm.covariances_ + 2 * Q_xx * gmm.means_) # Merge derivatives into a vector. return np.hstack((d_pi, d_mu.flatten(), d_sigma.flatten())) def main(): # Short demo. K = 64 N = 1000 xx, _ = make_classification(n_samples=N) xx_tr, xx_te = xx[: -100], xx[-100: ] gmm = GMM(n_components=K, covariance_type='diag') gmm.fit(xx_tr) fv = fisher_vector(xx_te, gmm) pdb.set_trace() if __name__ == '__main__': main()

### iakash2604 commented May 31, 2017

 thanks for the code. I just have one doubt. based on the derivation of a fisher vector for an image (as given on this page here a fisher vector will have dimension of only 2KD and not K + 2 * D * K. can you please look into it. i may be wrong. thanks again

### sitzikbs commented Jul 11, 2017

 @iakash-1326 - It is not wrong (at least not regarding what you mentioned). In your link, they only use the derivatives with respects to the means and sigmas. In the original paper they also use the derivatives with respects to the weights. In another paper they mentioned that the derivatives with respects to the weights dont contribute much Hence the difference. Here is a link to the original paper What currently bothers me is the sign of the derivative according to sigma. I am not sure its correct (it should by *(-1)) but I am still checking

### danoneata commented May 7, 2018 • edited

 Sorry for the late reply – unfortunately, Gist doesn't trigger a notification when a comment is posted. @iakash2604 wrote: based on the derivation of a fisher vector for an image [...] a fisher vector will have dimension of only 2KD and not K + 2DK As @sitzikbs already explained, my implementation includes the derivatives with respect to the weights, hence the additional K values. My code is based on the paper of (Krapac et al., 2011), equations 15–17. @iakash2604 wrote: What currently bothers me is the sign of the derivative according to sigma. I am not sure its correct (it should by *(-1)) but I am still checking I believe the current implementation uses the correct sign (if you are not convinced I can go into more details). But I want to point out that compared to the paper, the derivatives with respect to the variances miss a scaling by 0.5. From a practical point of view, this scaling doesn't matter, since the Fisher vectors are standardized (zero mean, unit variance across each dimension) in order to approximate the transformation with inverse of the Fisher information matrix, see section 3.5 from (Krapac et al., 2011). This standardization step removes the scaling of each dimension. Also, do not forget to L2 and power normalize the Fisher vectors before feeding them into the classifier; it has been shown it visibly improves performance, see table 1 in (Sanchez et al., 2013). Here is a code example of how the Fisher vectors are prepared for classification by normalizing and building the corresponding kernel matrices.

### vanetoj commented Apr 27, 2020

 what is the Q sum?

### danoneata commented Apr 28, 2020 • edited

 Hi @vanetoj what is the Q sum? The matrix `Q` corresponds to the posterior distribution of the latent variables (the clusters from 1 to K) given the data (denoted by `xx`), that is: ``````Q[n, k] = p(k | xx[n]) `````` To obtain`Q_sum` we average over the first dimension of `Q`, namely the `N` data points. This step leaves us with `K` values, which can be interpreted as a new estimate of the prior probability on the `K` clusters (`pi`) or, similarly, as an M-step update given the data `xx`; see (Bishop, 2006), eqs. (9.26) and (9.27). Does this help?

### vanetoj commented May 5, 2020

 yes, thank you!

### khizerali commented Jul 3, 2020

 hello , what are the equation numbers of dmu,d_pi ,d_sigma , in the paper mentioned? Becasue i could not understand how d_sigma equation is computed from paper.

### danoneata commented Jul 3, 2020

 hello , what are the equation numbers of dmu,d_pi ,d_sigma , in the paper mentioned? Becasue i could not understand how d_sigma equation is computed from paper. Hello @khizerali! The code implements equations 15–17 from (Krapac et al., 2011). Specifically, `d_sigma` is based on equation 17: ``````q_nk . (Σ_k - x_nk ** 2) / 2 `````` Since `x_nk` is defined as `x_n - μ_k`, we have ``````q_nk . (Σ_k - (x_n - μ_k) ** 2) / 2 `````` and by further expanding the terms we get ``````(q_nk . Σ_k - q_nk . x_n ** 2 - q_nk . μ_k ** 2 + 2 * q_nk . x_n . μ_k) / 2 `````` The code is a vectorized version of the above formula that also averages over the set of `N` descriptors (the dimension denoted by `n`). It might be worth keeping in mind the following correspondences between code and math notation: `Q_sum` corresponds to averaging `q_nk` over `n` (that is, `1 / N . Σ_n q_nk`) `Q_xx_2` corresponds to averaging `q_nk . x_n ** 2` over `n` (that is, `1 / N . Σ_n q_nk . x_n ** 2`) `gmm.means_` corresponds to the means `μ_k` `gmm.covars_` coresponds to diagonal of the covariance matrix `Σ_k`. I hope this is somewhat clear 🙂

### khizerali commented Jul 4, 2020

 @danoneata ,thanks for your explanation . Now i understand how this equation is formed. But why inv(gmm.covars) is not used in d_mu equation according to equation 16?

### danoneata commented Jul 4, 2020 • edited

 But why `inv(gmm.covars)` is not used in `d_mu` equation according to equation 16? Yes, that's a valid observation, @khizerali! The reason is similar to what I've previously explained when motivating the missing 0.5 factor in `d_sigma` – since I'm standardizing the Fisher vectors, any linear transformation on a given dimension will be canceled away (and in this case, the inverse covariance represents a constant scaling applied to the `d_mu` component of all Fisher vectors). Practically, avoiding the multiplication with `gmm.covars_ ** -1` doesn't change the results, but saves some computation. In fact, now I notice that for the same reason, I could have also avoided subtracting the GMM weights when computing `d_pi`.

### sobhanhemati commented Nov 24, 2020

 Thanks for the implementation. I had a question about your implementation. I would be grateful if you could reply. I just realized there is no Fisher information matrix in your implementation. However, In the paper "Fisher Kernels on Visual Vocabularies for Image Categorization" authors mentioned: To normalize the dynamic range of the different dimensions of the gradient vectors, we need to compute the diagonal of the Fisher information matrix F. So isn't this degrading performance of FV if you are not using Fisher information matrix F?

### danoneata commented Nov 24, 2020 • edited

 Hello @sobhanhemati You are right, my code doesn't include the normalization with the Fisher information matrix. In practice, we usually approximate this normalization by standardizing the Fisher vectors (scaling to zero mean and unit variance); the implementation will look something along these lines: ```from sklearn.preprocessing import StandardScaler fvs = np.vstack([fisher_vector(get_descs(img), gmm) for img in imgs]) scaler = StandardScaler() fvs = scaler.fit(fvs).transform(fvs)``` Standardizing the Fisher vectors corresponds to using a diagonal approximation of the sample covariance matrix of the Fisher vectors. For more information please check section 3.5 in (Krapac et al., 2011) and for an empirical evaluation of the performance see page 9 (approximate FIM vs. empirical FIM) in (Sanchez et al., 2013); the latter report the following accuracies for image classification (so the higher the values the better): 61.8% for analytical diagonal approximation; 60.6% for empirical diagonal approximation (the approach mentioned above); 59.8% for using the identity matrix as the Fisher information matrix (that is, performing no normalization). Hope this helps! P.S.: If the dimensionality of your data allows, you can also estimate the full sample covariance matrix (which is equivalent to whitening the Fisher vectors).

### sobhanhemati commented Nov 25, 2020

 Thank you for clarification. Do you have any implementation of the analytical diagonal approximation so that I can add that the current implementation? It seems that analytical diagonal approximation works about 1 percent better :)) Thank you in advance

### danoneata commented Nov 25, 2020 • edited

 @sobhanhemati Equations (16–18) from (Sanchez et al., 2013) provide the Fisher vectors that include the analytical approximation; hence, you can modify the computation of `d_pi`, `d_mu`, `d_sigma` in the gist above as follows: ``` # at line 43 s = np.sqrt(gmm.weights_)[:, np.newaxis] d_pi = (Q_sum.squeeze() - gmm.weights_) / s.squeeze() d_mu = (Q_xx - Q_sum * gmm.means_) * np.sqrt(gmm.covariances_) ** -1 / s d_sigma = - ( - Q_xx_2 - Q_sum * gmm.means_ ** 2 + Q_sum * gmm.covariances_ + 2 * Q_xx * gmm.means_) / (s * np.sqrt(2))``` Note that I haven't tested this implementation, so you might want to double check it. And I would suggest to try both methods for estimating the diagonal Fisher information matrix and see which one works better for you — Sanchez et al. mention in their paper: Note that we do not claim that this difference is significant nor that the closed-form approximation is superior to the empirical one in general. Finally, do not forget to L2 and power normalise the Fisher vectors — these transformations yield much more substantial improvements (about 6-7% points each) than the choice of the approximation for the Fisher information matrix (see Table 1 from Sanchez et al.).

### sobhanhemati commented Nov 27, 2020

 Thank you so much for the comprehensive answer. I really appreciate that.