timchurches/t-SNE-in-R.Rmd

## t-SNE-in-R.Rmd
---
title: "t-SNE example using R"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. Execute each code chunk in sequence by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*.

The material here is based largely on [this post](https://datavizpyr.com/how-to-make-tsne-plot-in-r/) on the excellent (although advertisement-infested) _Data Viz with Python and R_ blog.

You may have to install several of the packages listed below (they are all on CRAN). RStudio may automatically prompt you to do so if you are running a recent version.

```{r setup}
library(tidyverse)
library(palmerpenguins)
library(Rtsne)
library(GGally)
library(plotly)
```

We'll be using a dataset on penguins studied on some Antarctic islands.

```{r penguins-glimpse}
glimpse(penguins)
```

```{r penguins-head}
head(penguins)
```

Notice that we have observations on  333 penguins of three different species collected on three islands. For each penguin there are 4 numeric variables collected on beak and flipper dimensions and weight.

The year column isn't relevant so we'll drop it, and we'll make an ID column which is just the sequential row number using the handy row_number() function in `dplyr`.

```{r drop-year}
penguins <- penguins %>%
  drop_na() %>%
  select(-year)%>%
  mutate(ID=row_number())

glimpse(penguins)
```

We'll also extract some metadata from the data frame. t-SNE only works with numeric variables/columns, so we need to split off the categorical data and join it back after we have transformed the numeric columns through t-SNE.

```{r make-meta}
penguins_meta <- penguins %>%
  select(ID,species,island,sex)

head(penguins_meta)
```

t-SNE requires us to normalise and scale the numeric columns. `dplyr` has a handy `scale()` function that does that (there are many other ways of doing it in R as well). We also set a random seed so the t-SNE step that follows is repeatable. Note the `select(where(is.numeric))` step which is a handy way of selecting (keeping) only the numeric columns. Note that the `scale()` function returns a matrix, not a data frame (or tibble), so we need to convert it back to a data frame in the last step below.

```{r scaled-penguins}
set.seed(142)
penguins_scaled <- penguins %>%
  select(where(is.numeric)) %>%
  column_to_rownames("ID") %>%
  scale() %>%
  as_tibble()

head(penguins_scaled)
```

Before we use t-SNE to reduce the four numeric dimensions in our penguin data down to 2 dimensions so we can visualise it, let's look at pairwise bivariate scatterplots of that set of four numeric dimensions. The ``ggpairs()` function in the `GGally` package (which is an extension to `ggplot2`) does that nicely.

```{R pair-wise-scatter-plots, message=FALSE}
ggpairs(penguins_scaled)
```

From that we can see that each pair of numeric variable has a distinct relationship, as we might expect.

Ok, now let's reduce those four numeric dimension down to two, using the `tSNE_fit()` function from the `Rtnse` package. The default number of dimensions to reduce to is two. We get back a complex object containing the two resulting dimensions in `Y` and a bunch of other diagnostic information about the t-SNE iteractive fit process, which we will ignore for now.

```{r squash-dims}
tSNE_fit <- penguins_scaled %>%
  Rtsne()

glimpse(tSNE_fit)
```

So let's extract the `Y` object (which is a matrix) from the results, convert that matrix back to a data frame (tibble), then rename the columns as "tSNE1" and "tSNE2", and add a sequential ID column which we can use to join back the categorical variables we split off earlier.

```{r wrangle-results, message=FALSE}
tSNE_df <- tSNE_fit$Y %>%
  as_tibble() %>%
  rename(tSNE1="V1",
         tSNE2="V2") %>%
  mutate(ID=row_number())

head(tSNE_df)
```

Now join back the categorical variables.

```{r restore-categoricals}
tSNE_df <- tSNE_df %>%
  inner_join(penguins_meta, by="ID")

tSNE_df %>% head()
```

Now we can visualise the t-SNE transformed version of teh data, with the four original numeric variables squashed down to just two, mapped to the x- and y-axes.

```{r 2d-viz-1}
tSNE_df %>%
  ggplot(aes(x = tSNE1,
             y = tSNE2,
             color = species,
             shape = sex))+
  geom_point()+
  theme(legend.position="bottom")
```

Same but with the shape aestethic mapped to island instead.

```{r 2d-viz-2}
tSNE_df %>%
  ggplot(aes(x = tSNE1,
             y = tSNE2,
             color = species,
             shape = island))+
  geom_point()
```

So, despite the reduction of dimensions from four down to two, t-SNE seems to have nicely preserved the clustering of penguin characteristics by species.

However, notice the small number of chinstrap penguins on Dream island that appear to be clustered with the Adelie penguins. They could be data errors, or there might be something different about them that is hidden because we have lost information by squashing down from four to two dimensions.

So let's look at a reduction to three dimensions. We can do that by specifying `dims=3`. The steps are otherwise the same as above, except we have three resulting dimensions instead of two after using t-SNE on the four original dimensions.

```{r three-d-penguins}
tSNE_fit_3d <- penguins_scaled %>%
  Rtsne(dims=3)

tSNE_df_3d <- tSNE_fit_3d$Y %>%
  as.data.frame() %>%
  rename(tSNE1="V1",
         tSNE2="V2",
         tSNE3="V3") %>%
  mutate(ID=row_number())

tSNE_df_3d <- tSNE_df_3d %>%
  inner_join(penguins_meta, by="ID")

tSNE_df_3d %>% head()
```

Now let's visualise that using `plotly`, which is one of the several ways to look at 3D plots in R, which we discuss in the final interactive session in week 11.

```{r plotly}
plot_ly(tSNE_df_3d, x = ~tSNE1, y = ~tSNE2, z = ~tSNE3) %>%
  add_markers(color = ~species)

```

You can rotate and zoom in on the chart. If you spin the data around in the right way, you will notice that our small group of chinstrap penguins that seemed to be in the wrong cluster when viewed in two dimensions are, in fact, distinct from (or at least can be separated by a plane from) that cluster when viewed in three dimension (and are probably even more distinct in four dimensions, if we had the ability to visualise that - but as mere humans, we don't).
	---
	title: "t-SNE example using R"
	output: html_notebook
	---

	This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. Execute each code chunk in sequence by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

	The material here is based largely on [this post](https://datavizpyr.com/how-to-make-tsne-plot-in-r/) on the excellent (although advertisement-infested) _Data Viz with Python and R_ blog.

	You may have to install several of the packages listed below (they are all on CRAN). RStudio may automatically prompt you to do so if you are running a recent version.

	```{r setup}
	library(tidyverse)
	library(palmerpenguins)
	library(Rtsne)
	library(GGally)
	library(plotly)
	```

	We'll be using a dataset on penguins studied on some Antarctic islands.

	```{r penguins-glimpse}
	glimpse(penguins)
	```

	```{r penguins-head}
	head(penguins)
	```

	Notice that we have observations on 333 penguins of three different species collected on three islands. For each penguin there are 4 numeric variables collected on beak and flipper dimensions and weight.

	The year column isn't relevant so we'll drop it, and we'll make an ID column which is just the sequential row number using the handy row_number() function in `dplyr`.

	```{r drop-year}
	penguins <- penguins %>%
	drop_na() %>%
	select(-year)%>%
	mutate(ID=row_number())

	glimpse(penguins)
	```

	We'll also extract some metadata from the data frame. t-SNE only works with numeric variables/columns, so we need to split off the categorical data and join it back after we have transformed the numeric columns through t-SNE.

	```{r make-meta}
	penguins_meta <- penguins %>%
	select(ID,species,island,sex)

	head(penguins_meta)
	```

	t-SNE requires us to normalise and scale the numeric columns. `dplyr` has a handy `scale()` function that does that (there are many other ways of doing it in R as well). We also set a random seed so the t-SNE step that follows is repeatable. Note the `select(where(is.numeric))` step which is a handy way of selecting (keeping) only the numeric columns. Note that the `scale()` function returns a matrix, not a data frame (or tibble), so we need to convert it back to a data frame in the last step below.

	```{r scaled-penguins}
	set.seed(142)
	penguins_scaled <- penguins %>%
	select(where(is.numeric)) %>%
	column_to_rownames("ID") %>%
	scale() %>%
	as_tibble()

	head(penguins_scaled)
	```

	Before we use t-SNE to reduce the four numeric dimensions in our penguin data down to 2 dimensions so we can visualise it, let's look at pairwise bivariate scatterplots of that set of four numeric dimensions. The ``ggpairs()` function in the `GGally` package (which is an extension to `ggplot2`) does that nicely.

	```{R pair-wise-scatter-plots, message=FALSE}
	ggpairs(penguins_scaled)
	```

	From that we can see that each pair of numeric variable has a distinct relationship, as we might expect.

	Ok, now let's reduce those four numeric dimension down to two, using the `tSNE_fit()` function from the `Rtnse` package. The default number of dimensions to reduce to is two. We get back a complex object containing the two resulting dimensions in `Y` and a bunch of other diagnostic information about the t-SNE iteractive fit process, which we will ignore for now.

	```{r squash-dims}
	tSNE_fit <- penguins_scaled %>%
	Rtsne()

	glimpse(tSNE_fit)
	```

	So let's extract the `Y` object (which is a matrix) from the results, convert that matrix back to a data frame (tibble), then rename the columns as "tSNE1" and "tSNE2", and add a sequential ID column which we can use to join back the categorical variables we split off earlier.

	```{r wrangle-results, message=FALSE}
	tSNE_df <- tSNE_fit$Y %>%
	as_tibble() %>%
	rename(tSNE1="V1",
	tSNE2="V2") %>%
	mutate(ID=row_number())

	head(tSNE_df)
	```

	Now join back the categorical variables.

	```{r restore-categoricals}
	tSNE_df <- tSNE_df %>%
	inner_join(penguins_meta, by="ID")

	tSNE_df %>% head()
	```

	Now we can visualise the t-SNE transformed version of teh data, with the four original numeric variables squashed down to just two, mapped to the x- and y-axes.

	```{r 2d-viz-1}
	tSNE_df %>%
	ggplot(aes(x = tSNE1,
	y = tSNE2,
	color = species,
	shape = sex))+
	geom_point()+
	theme(legend.position="bottom")
	```

	Same but with the shape aestethic mapped to island instead.

	```{r 2d-viz-2}
	tSNE_df %>%
	ggplot(aes(x = tSNE1,
	y = tSNE2,
	color = species,
	shape = island))+
	geom_point()
	```

	So, despite the reduction of dimensions from four down to two, t-SNE seems to have nicely preserved the clustering of penguin characteristics by species.

	However, notice the small number of chinstrap penguins on Dream island that appear to be clustered with the Adelie penguins. They could be data errors, or there might be something different about them that is hidden because we have lost information by squashing down from four to two dimensions.

	So let's look at a reduction to three dimensions. We can do that by specifying `dims=3`. The steps are otherwise the same as above, except we have three resulting dimensions instead of two after using t-SNE on the four original dimensions.

	```{r three-d-penguins}
	tSNE_fit_3d <- penguins_scaled %>%
	Rtsne(dims=3)

	tSNE_df_3d <- tSNE_fit_3d$Y %>%
	as.data.frame() %>%
	rename(tSNE1="V1",
	tSNE2="V2",
	tSNE3="V3") %>%
	mutate(ID=row_number())

	tSNE_df_3d <- tSNE_df_3d %>%
	inner_join(penguins_meta, by="ID")

	tSNE_df_3d %>% head()
	```

	Now let's visualise that using `plotly`, which is one of the several ways to look at 3D plots in R, which we discuss in the final interactive session in week 11.

	```{r plotly}
	plot_ly(tSNE_df_3d, x = ~tSNE1, y = ~tSNE2, z = ~tSNE3) %>%
	add_markers(color = ~species)

	```

	You can rotate and zoom in on the chart. If you spin the data around in the right way, you will notice that our small group of chinstrap penguins that seemed to be in the wrong cluster when viewed in two dimensions are, in fact, distinct from (or at least can be separated by a plane from) that cluster when viewed in three dimension (and are probably even more distinct in four dimensions, if we had the ability to visualise that - but as mere humans, we don't).