rrrlw/TDA-in-R.md

## TDA-in-R.md

      
    Raw
  

              TDA-in-R.md
            
          
    Topological data analysis (TDA) in R

Purpose

Topological data analysis (or TDA) is a growing discipline with great potential.
Unfortunately, the difficulty of learning the software used to perform TDA on real datasets is a nontrivial task adds an oft intimidating barrier to entry.
This Gist provides a straightforward tutorial to using the TDAstats package to conduct topological data analysis (specifically, persistent homology using Vietoris-Rips simplicial complexes) in R.
N.B. this Gist largely assumes the reader is familiar with a Vietoris-Rips simplicial complex, persistent homology, topological barcodes, and persistence diagrams.
Please make sure you have at least a basic understanding of these terms prior to proceeding.
Knowing some R syntax would also be useful.
Setup

First, you need to install the R programming language (I also suggest installing the Rstudio IDE).
Once that is complete, you need to install the TDAstats package, either the quality-checked version on CRAN or the development version on GitHub (for which you need the devtools package), as follows.
N.B. since TDAstats is dependent on other R packages (e.g. Rcpp and ggplot2), installation may take a bit long.
# install TDAstats from CRAN (recommended)
install.packages("TDAstats")

# install development version from GitHub (for advanced useRs)
install.packages("devtools")
devtools::install_github("rrrlw/TDAstats", build_vignettes = TRUE)
This Gist uses sample datasets provided with the TDAstats package as examples.
Doing TDA in R

First, we load TDAstats and the relevant sample datasets into R's working memory.
# make TDAstats functions available
library("TDAstats")

# make relevant TDAstats datasets available
data("unif2d")
data("circle2d")
The unif2d and circle2d datasets each contain the coordinates for 100 points in a 2-dimensional Cartesian space.
They are stored in R as matrices with 2 columns (1 for the x-coordinate and 1 for the y-coordinate) and 100 rows (1 row per point).
The two datasets are topologically quite distinct, as there is no discernible pattern in unif2d while the points in circle2d are clearly placed on a unit circle.
We can visually confirm this by plotting the points in each dataset.
# plot the points in unif2d
plot(unif2d)

# plot the points in circle2d
plot(circle2d)
This Gist will show you how to visualize the topological differences between the two datasets using TDAstats.
First, we calculate the persistent homology of each dataset.
# calculate persistent homology for unif2d
phom.unif <- calculate_homology(unif2d)

# calculate persistent homology for circle2d
phom.circ <- calculate_homology(circle2d)
Next, we visualize each dataset using topological barcodes.
# plot barcode for unif2d
plot_barcode(phom.unif)

# plot barcode for circle2d
plot_barcode(phom.circ)
In the plots created by the code block above, it is important to notice that the horizontal axes are scaled differently.
Thus, the barcode for unif2d contains relatively un-persistent features (all with a length under 0.25), whereas the barcode for circle2d contains a single, very persistent feature of dimension 1 (with a persistence of over 1.5).
In topological barcodes visualizing persistent homology, features of dimension 1 correspond to cycles in a point cloud.
In circle2d, the very long feature corresponds to the circle on which the points lie.
And that was it!
You should now be able to perform basic exploratory topological data analysis on point clouds using R.
To test your skills, try performing the above exercise for the unif3d and sphere3d sample datasets that come with TDAstats.
There are many resources to learning more about topological data analysis, most of which are just a Google search away.
For more detailed tutorials on using TDAstats for topological data analysis (including conducting statistical inference for TDA), take a look at the vignettes on the CRAN page for TDAstats or with the code below.
# check out the vignettes/tutorials for TDAstats
vignette("intro", package = "TDAstats")
vignette("inference", package = "TDAstats")
vignette("inputformat", package = "TDAstats")
If you have any suggestions on how to improve TDAstats, please report them here.
If you would like to contribute to TDAstats, you can fork the GitHub repository (hopefully, for an eventual pull request) here.