Skip to content

Instantly share code, notes, and snippets.

@fabianp
Last active February 1, 2024 10:02
  • Star 62 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save fabianp/2020955 to your computer and use it in GitHub Desktop.
Pairwise ranking using scikit-learn LinearSVC
"""
Implementation of pairwise ranking using scikit-learn LinearSVC
Reference:
"Large Margin Rank Boundaries for Ordinal Regression", R. Herbrich,
T. Graepel, K. Obermayer 1999
"Learning to rank from medical imaging data." Pedregosa, Fabian, et al.,
Machine Learning in Medical Imaging 2012.
Authors: Fabian Pedregosa <fabian@fseoane.net>
Alexandre Gramfort <alexandre.gramfort@inria.fr>
See also https://github.com/fabianp/pysofia for a more efficient implementation
of RankSVM using stochastic gradient descent methdos.
"""
import itertools
import numpy as np
from sklearn import svm, linear_model, cross_validation
def transform_pairwise(X, y):
"""Transforms data into pairs with balanced labels for ranking
Transforms a n-class ranking problem into a two-class classification
problem. Subclasses implementing particular strategies for choosing
pairs should override this method.
In this method, all pairs are choosen, except for those that have the
same target value. The output is an array of balanced classes, i.e.
there are the same number of -1 as +1
Parameters
----------
X : array, shape (n_samples, n_features)
The data
y : array, shape (n_samples,) or (n_samples, 2)
Target labels. If it's a 2D array, the second column represents
the grouping of samples, i.e., samples with different groups will
not be considered.
Returns
-------
X_trans : array, shape (k, n_feaures)
Data as pairs
y_trans : array, shape (k,)
Output class labels, where classes have values {-1, +1}
"""
X_new = []
y_new = []
y = np.asarray(y)
if y.ndim == 1:
y = np.c_[y, np.ones(y.shape[0])]
comb = itertools.combinations(range(X.shape[0]), 2)
for k, (i, j) in enumerate(comb):
if y[i, 0] == y[j, 0] or y[i, 1] != y[j, 1]:
# skip if same target or different group
continue
X_new.append(X[i] - X[j])
y_new.append(np.sign(y[i, 0] - y[j, 0]))
# output balanced classes
if y_new[-1] != (-1) ** k:
y_new[-1] = - y_new[-1]
X_new[-1] = - X_new[-1]
return np.asarray(X_new), np.asarray(y_new).ravel()
class RankSVM(svm.LinearSVC):
"""Performs pairwise ranking with an underlying LinearSVC model
Input should be a n-class ranking problem, this object will convert it
into a two-class classification problem, a setting known as
`pairwise ranking`.
See object :ref:`svm.LinearSVC` for a full description of parameters.
"""
def fit(self, X, y):
"""
Fit a pairwise ranking model.
Parameters
----------
X : array, shape (n_samples, n_features)
y : array, shape (n_samples,) or (n_samples, 2)
Returns
-------
self
"""
X_trans, y_trans = transform_pairwise(X, y)
super(RankSVM, self).fit(X_trans, y_trans)
return self
def decision_function(self, X):
return np.dot(X, self.coef_.ravel())
def predict(self, X):
"""
Predict an ordering on X. For a list of n samples, this method
returns a list from 0 to n-1 with the relative order of the rows of X.
The item is given such that items ranked on top have are
predicted a higher ordering (i.e. 0 means is the last item
and n_samples would be the item ranked on top).
Parameters
----------
X : array, shape (n_samples, n_features)
Returns
-------
ord : array, shape (n_samples,)
Returns a list of integers representing the relative order of
the rows in X.
"""
if hasattr(self, 'coef_'):
return np.argsort(np.dot(X, self.coef_.ravel()))
else:
raise ValueError("Must call fit() prior to predict()")
def score(self, X, y):
"""
Because we transformed into a pairwise problem, chance level is at 0.5
"""
X_trans, y_trans = transform_pairwise(X, y)
return np.mean(super(RankSVM, self).predict(X_trans) == y_trans)
if __name__ == '__main__':
# as showcase, we will create some non-linear data
# and print the performance of ranking vs linear regression
np.random.seed(1)
n_samples, n_features = 300, 5
true_coef = np.random.randn(n_features)
X = np.random.randn(n_samples, n_features)
noise = np.random.randn(n_samples) / np.linalg.norm(true_coef)
y = np.dot(X, true_coef)
y = np.arctan(y) # add non-linearities
y += .1 * noise # add noise
Y = np.c_[y, np.mod(np.arange(n_samples), 5)] # add query fake id
cv = cross_validation.KFold(n_samples, 5)
train, test = iter(cv).next()
# make a simple plot out of it
import pylab as pl
pl.scatter(np.dot(X, true_coef), y)
pl.title('Data to be learned')
pl.xlabel('<X, coef>')
pl.ylabel('y')
pl.show()
# print the performance of ranking
rank_svm = RankSVM().fit(X[train], Y[train])
print 'Performance of ranking ', rank_svm.score(X[test], Y[test])
# and that of linear regression
ridge = linear_model.RidgeCV(fit_intercept=True)
ridge.fit(X[train], y[train])
X_test_trans, y_test_trans = transform_pairwise(X[test], y[test])
score = np.mean(np.sign(np.dot(X_test_trans, ridge.coef_)) == y_test_trans)
print 'Performance of linear regression ', score
@agramfort
Copy link

would be neat to work with SGD and a partial_fit that assembles the pairs online to avoid storing the whole X_trans

also I think SVC(kernel='linear') is more adapted for dense data.

@fabianp
Copy link
Author

fabianp commented Mar 12, 2012

As suggested by @pprett and @mblondel, it would be great to have a way to group samples, i.e. select which pairs should be considered and which should be ignored.

From the API point of view, a third argument to .fit() will break Pipeline and GridSearch, so I suggest the possibility that y might have two columns: one for group membership (pairs with same group will be considered) and the other one is the usual class label. If y is one-dimensional, it degrades to the current behaviour.

@mblondel
Copy link

CC me

@agramfort
Copy link

@fabianp did you have time to do what you said above? @pprett and @mblondel, do you think such an estimator is worth being added to the scikit?

@fabianp
Copy link
Author

fabianp commented Mar 18, 2012

yes, but I haven't tested it on real data yet (could be buggy)

@agramfort
Copy link

@mblondel
Copy link

The API should probably be discussed on the ML. Maybe other people who are familiar with ranking will be able to comment. Also I wonder if we should call the predict method predict or predict_rank.

@agramfort
Copy link

+1 for bringing the discussion to the ML @fabianp can you send the email?

@fabianp
Copy link
Author

fabianp commented Mar 19, 2012

thanks @agramfort, merged

@kigawas
Copy link

kigawas commented Apr 8, 2016

Can you let me use this code in my repository? I will keep it same as here.

@TheEdoardo93
Copy link

TheEdoardo93 commented Oct 17, 2017

Hi, I'm Edoardo, a master degree computer science student based in Milan.
My main task is the recommendation of items to users.
I would like to use this Python script for my following goal: "given a set of items as input, obtain a ranking list of this set of items, according to the ranking model trained with RankSVM model."

Now, I've got users and items latent features obtained from the execution of PMF model on the ratings matrix.
My next step is to use these latent features for training a ranking model (with RankSVM) and after that, I want to predict a ranking list given a set of items as input of the trained model.

It is possible to do that? How?

Thanks, best regards!

@fabianp
Copy link
Author

fabianp commented Oct 18, 2017

The decision function (decision_function) gives you a real number, the ranking should be computed from this number (i.e., those with higher decision function are ranked higher than those with small decision function)

@TheEdoardo93
Copy link

TheEdoardo93 commented Oct 18, 2017

Thanks for the answer.

Which is the input that a rankSVM "variable" expect? I don't understand the input and the output of your source code.
Can you give me an example of using this code? Is there any documentation about this library?
How can I give as input my latent features to rankSVM?

@queirozfcom
Copy link

@fabianp Is this version of the gist updated with respect to @agramfort 's fork?

@senjed
Copy link

senjed commented Nov 5, 2018

Thanks for the answer.

Which is the input that a rankSVM "variable" expect? I don't understand the input and the output of your source code.
Can you give me an example of using this code? Is there any documentation about this library?
How can I give as input my latent features to rankSVM?

I have the same question. what is the value of y? is it the relevance score or the rank? I also do not understand how to interpret the array that the predict function returns. Can this algorithm be used if I only have a single query and for each grade(relevance score) I have only one instance? For example I only have 1 item with relevance score 5. Another question is that can the relevance score be a float number?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment