Skip to content

Instantly share code, notes, and snippets.

@mikelove
Last active August 25, 2016 15:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mikelove/aca1b012b2c1ffe98af673bf1871c9dd to your computer and use it in GitHub Desktop.
Save mikelove/aca1b012b2c1ffe98af673bf1871c9dd to your computer and use it in GitHub Desktop.
looking to see if t-SNE replicates linear separation of groups
n <- 50
m <- 40
m_inform <- 10
set.seed(1)
niter <- 200
intradist <- numeric(niter)
interdist <- numeric(niter)
mus <- seq(from=0, to=3, length=niter)
library(Rtsne)
cols <- rep(1:2, each=n/2)
for (i in seq_len(niter)) {
mu <- mus[i]
cat(i,"")
x <- cbind(rbind(matrix(rnorm(n/2 * m_inform, -mu/2), ncol=m_inform),
matrix(rnorm(n/2 * m_inform, mu/2), ncol=m_inform)),
matrix(rnorm(n * (m - m_inform)), nrow=n))
# see comment below on raising perplexity to 16 for n=50 and 30 for n=100
res <- Rtsne(x, perplexity=10)
#plot(res$Y, col=cols, pch=20, xlab="", ylab="")
mid1 <- colMeans(res$Y[cols==1,])
mid2 <- colMeans(res$Y[cols==2,])
intradist[i] <- mean(c(sqrt(colSums((t(res$Y[cols==1,]) - mid1)^2)),
sqrt(colSums((t(res$Y[cols==2,]) - mid2)^2))))
interdist[i] <- sqrt(sum((mid1 - mid2)^2))
}
# make plot
dat <- data.frame(mu=sqrt(m_inform)*rep(mus,2),
dist=c(intradist, interdist),
type=rep(c("intra","inter"),each=niter))
library(ggplot2)
print(
ggplot(dat, aes(x=mu,y=dist,col=type)) + geom_point() + geom_smooth() +
xlab("distance between sub-population centers") +
ylab("distance recovered by t-SNE") + ggtitle(paste(n,"points"))
)
@mikelove
Copy link
Author

mikelove commented Aug 25, 2016

I had previously lowered perplexity until I didn't get an error, but then Michael Schubert asked what if I use higher values:

https://twitter.com/_ms03/status/768827491536502785

I tried again, using default value (perplexity=30) for n=100:

https://twitter.com/mikelove/status/768830108761255937

And using perplexity=16 (highest without error) for n=50:

https://twitter.com/mikelove/status/768830491042652160

Using these values the plots look more linear on the left side until a breakpoint at which the two populations are spread apart beyond their actual distance. This is more or less what the method advertises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment