Skip to content

Instantly share code, notes, and snippets.

@beader
Last active January 24, 2017 20:50
Show Gist options
  • Save beader/119049e95df37ef9814c to your computer and use it in GitHub Desktop.
Save beader/119049e95df37ef9814c to your computer and use it in GitHub Desktop.
Convert a dgcMatrix to libsvm format
#' convert a dgcMatrix to libsvm format
#' @param sm A sparse matrix of class "dgcMatrix"
#' @param label label for dataset, default is 0
#' @return a vector of characters containing index:value
#' @example
#' regMat <- matrix(runif(16), 4, 4)
#' regMat[sample(16, 5)] <- 0
#' sparseMat <- Matrix(regMat, sparse = T)
#' conv2libsvm(sparseMat)
conv2libsvm <- function(sm, label = rep(0, dim(sm)[1])) {
stopifnot(dim(sm)[1] == length(label))
tsm <- Matrix::t(sm)
i <- tsm@i
p <- tsm@p
x <- tsm@x
vapply(seq(dim(tsm)[2]), function(c) {
idx <- (p[c]+1):p[c+1]
paste(label[c], paste(i[idx], x[idx], sep = ":", collapse = " "))
}, FUN.VALUE = character(1))
}
@madmanminkin
Copy link

madmanminkin commented Jun 8, 2016

You can use this to create a labeled libsvm by setting label to a data frame column instead of rep(0, dim(sm)[1])

@aolney
Copy link

aolney commented Jan 24, 2017

Thanks for this 👍 I think the index is off by 1 though. Here is a suggested fix:

data(agaricus.train, package='xgboost')
conv2libsvm <- function(sm, label = rep(0, dim(sm)[1])) {
  stopifnot(dim(sm)[1] == length(label))
  tsm <- Matrix::t(sm)
  i <- tsm@i
  p <- tsm@p
  x <- tsm@x
  vapply(seq(dim(tsm)[2]), function(c) {
    idx <- (p[c]+1):p[c+1]
    paste(label[c], paste(i[idx]+1, x[idx], sep = ":", collapse = " ")) #Note +1 here
  }, FUN.VALUE = character(1))
}
conv2libsvm(agaricus.train$data,agaricus.train$label)

Gives first line output of

1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1

which matches first line of file xgboost-master/demo/binary_classification/agaricus.txt.train

Without the +1 we get first 3 lines

[1] "1 2:1 9:1 10:1 20:1 29:1 33:1 35:1 39:1 40:1 52:1 57:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 104:1 116:1 123:1"
[2] "0 2:1 9:1 19:1 20:1 22:1 33:1 35:1 38:1 40:1 52:1 55:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 105:1 115:1 119:1"
[3] "0 0:1 9:1 18:1 20:1 23:1 33:1 35:1 38:1 41:1 52:1 55:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 105:1 115:1 121:1"

Note by the 3rd row we have zero based index, which is not consistent with R being 1 based.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment