Last active
August 25, 2016 15:30
-
-
Save mikelove/aca1b012b2c1ffe98af673bf1871c9dd to your computer and use it in GitHub Desktop.
looking to see if t-SNE replicates linear separation of groups
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
n <- 50 | |
m <- 40 | |
m_inform <- 10 | |
set.seed(1) | |
niter <- 200 | |
intradist <- numeric(niter) | |
interdist <- numeric(niter) | |
mus <- seq(from=0, to=3, length=niter) | |
library(Rtsne) | |
cols <- rep(1:2, each=n/2) | |
for (i in seq_len(niter)) { | |
mu <- mus[i] | |
cat(i,"") | |
x <- cbind(rbind(matrix(rnorm(n/2 * m_inform, -mu/2), ncol=m_inform), | |
matrix(rnorm(n/2 * m_inform, mu/2), ncol=m_inform)), | |
matrix(rnorm(n * (m - m_inform)), nrow=n)) | |
# see comment below on raising perplexity to 16 for n=50 and 30 for n=100 | |
res <- Rtsne(x, perplexity=10) | |
#plot(res$Y, col=cols, pch=20, xlab="", ylab="") | |
mid1 <- colMeans(res$Y[cols==1,]) | |
mid2 <- colMeans(res$Y[cols==2,]) | |
intradist[i] <- mean(c(sqrt(colSums((t(res$Y[cols==1,]) - mid1)^2)), | |
sqrt(colSums((t(res$Y[cols==2,]) - mid2)^2)))) | |
interdist[i] <- sqrt(sum((mid1 - mid2)^2)) | |
} | |
# make plot | |
dat <- data.frame(mu=sqrt(m_inform)*rep(mus,2), | |
dist=c(intradist, interdist), | |
type=rep(c("intra","inter"),each=niter)) | |
library(ggplot2) | |
print( | |
ggplot(dat, aes(x=mu,y=dist,col=type)) + geom_point() + geom_smooth() + | |
xlab("distance between sub-population centers") + | |
ylab("distance recovered by t-SNE") + ggtitle(paste(n,"points")) | |
) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I had previously lowered perplexity until I didn't get an error, but then Michael Schubert asked what if I use higher values:
https://twitter.com/_ms03/status/768827491536502785
I tried again, using default value (perplexity=30) for n=100:
https://twitter.com/mikelove/status/768830108761255937
And using perplexity=16 (highest without error) for n=50:
https://twitter.com/mikelove/status/768830491042652160
Using these values the plots look more linear on the left side until a breakpoint at which the two populations are spread apart beyond their actual distance. This is more or less what the method advertises.