Skip to content

Instantly share code, notes, and snippets.

@zachmayer
Created September 16, 2015 19:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zachmayer/f2b643d8d1b4d1589dcc to your computer and use it in GitHub Desktop.
Save zachmayer/f2b643d8d1b4d1589dcc to your computer and use it in GitHub Desktop.
Sparse != Big Data
#Define the problem
t1 <- Sys.time()
set.seed(1)
n_nodes <- 300000L
n_edges <- 900000L
nodes <- 1L:n_nodes
edge_node_1 <- sample(nodes, n_edges, replace=TRUE)
edge_node_2 <- sample(nodes, n_edges, replace=TRUE)
#Sparse matrix
library(Matrix)
M <- sparseMatrix(
i = edge_node_1,
j = edge_node_2
)
#Row-wise Jaccard similarity
#http://stats.stackexchange.com/a/89947/2817
jaccard <- function(m) {
A = tcrossprod(m)
im = which(A > 0, arr.ind=TRUE)
b = rowSums(m)
Aim = A[im]
J = sparseMatrix(
i = im[,1],
j = im[,2],
x = Aim / (b[im[,1]] + b[im[,2]] - Aim),
dims = dim(A)
)
return(J)
}
J <- jaccard(M)
#Dims and times
dim(M)
dim(J)
Sys.time() - t1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment