Skip to content

Instantly share code, notes, and snippets.

@rewee
Last active August 29, 2017 14:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rewee/4e08e076515056235b64af944061f49a to your computer and use it in GitHub Desktop.
Save rewee/4e08e076515056235b64af944061f49a to your computer and use it in GitHub Desktop.
GSoC 2017 Work Product - Constrained Hierarchical Agglomerative Clustering

gsoc_logo

Student Developer: Shubham Chaturvedi
Mentors: Pierre Neuvial and Nathalie Villa-Vialaneix
Official Project Link: Constrained Hierarchical Agglomerative Clustering

Introduction

Adjacency-constrained hierarchical agglomerative clustering is hierarchical agglomerative clustering in which each observation is associated to a position, and the clustering is constrained so as only adjacent clusters are merged. It is a common method used widely in various application fields including ecology (Quaternary data) and bioinformatics (for instance in Genome Wide Association Studies). In most of these applications, the dataset is massive and it is pretty common to have the number of items to cluster of order of 10^5 to 10^6. Since previous available implementations of the adjacency-constrained hierarchical agglomerative clustering used an algorithm which is intrinsically quadratic in the number of items to cluster, there was a need for a method which is more efficient in time and space complexity. A low-level implementation of the such an algorithm was implemented in the existing beta version of the adjclust package.

In most of the applications which use adjacency-constrained HAC, the similarity between two items is a decreasing function of their physical distance. For example, linkage disequilibrium between SNPs over a 1Mb region of chromosome 22 in sample of Europeans looks as shown below.

ld

The above figure suggests that the LD signal is concentrated close to the diagonal. Thus it is reasonable to assume that similarity between two items is close to zero when the two items are at a distance of more than h, where h is a constant. Under this assumption, it is possible to propose an algorithm whose complexity in number of items to cluster p is quasi-linear in time (O(p(log(p)+h)) and linear in space (O(ph)). It takes as an input a band similarity matrix of size ph, and combines pre-calculation of certain partial sums of similarities with an efficient storage of the successive merges in a binary heap. This algorithm is completely described in the [2]. It is an extension of the algorithm described in [3].

adjclust is a package that provides methods to perform adjacency-constrained HAC. The aim of this GSoC project was to build a CRAN submittable version of adjclust package that performs adjacency-constrained HAC in an efficient manner.

Contributions during GSoC

Phase I

The main objective of phase-I was to create the basic package with minimal but sufficient features.

The beta version of the package which was previously available didn't have user level functions implemented.So only a part of the matrix, located in the neighborhood of the diagonal had to be passed to the previously available adjClustBand_heap function. To avoid this, a function band was written to written to take standard matrix or dist objects as input and generate the required data. Unit tests were written to justify the correctness of the band function.

Also the previously available version of adjClustBandHeap function had a silent assumption that similarity of an item with itself equal to 1 which means all the diagonal elements should be equal to 1. However this may not be the case for every matrix that is passed as input to the function. In this case,incorrect clustering results will be obtained. To avoid this, the clustering function should be modified to be able to accept inputs without any silent assumptions. A lot of thought was given on how this could be achieved. With the approval of mentors, the clustering function was redesigned so as to accept all kind of inputs without any silent assumptions. modify function was written for input validation and modification

The equivalence of the clustering function with maximum bandwidth with the existing rioja package was tested. Results were found to deviate when there (dis)simmilarity matrix had duplicate elements because it is ambiguous to chose minimum distance pair if there are multiple such pair. Also, small tests were written using testthat package in order to increase the code coverage of the package.These tests can be found in tests directory of adjclust. After each user level function was finalized, corresponding roxygen comments were written.Informative examples and missing bits, if any, were added in the documentation.

Phase II

Package was thoroughly checked for all errors, warnings and notes. The code was modified in order to pass R CMD check without any errors, warnings or notes.

Next step was to provide sparse matrix implementation. The implementation of constrained HAC in the adjclust package was modified so that it can handle situations where p is large (i.e. p>10^6), but the similarity matrix is possibly sparse and its signal mostly located near the diagonal. In such situations, the input data is typically a sparse matrix. A function sparseBand was designed to take care of input transformation to band in case of Matrix::dgCMatrix and Matrix::dsCMatrix sparse matrices.

After discussion with mentors, it was decided that package should provide support for both similarity as well as dissimilarity matrices. Thus the clustering function was restructed for the same. User will now have an argument type which can be set to similarity or dissimilarity. At this point of int the project, we found a problem with current implementation in case when cases where h was not equal to maximum bandwidth. Mathematical background for this case was studied carefully. It was decided to change current implementation of modify and modifySparse functions so that spurious results are not produced.

It was found that .tomatleft and .tomatright functions create computational bottleneck because of creation of p * h matrices which are then used to pre-calculate cumulative sums of similarities which are in turn required by the constrained HAC algorithm. Hence findMatL and findRmatR functions were written to bypass .tomatleft and .tomatright functions in case of standard matrices. For sparse matrix implementation, findSparseMatL and findSparseRmatR functions were written which performed the same task as their standard matrix counterpart. Unit tests were written to justify equivalence of standard matrix functions with corresponding sparse matrix functions.

Phase III

Genome-Wide Association Studies

In phase III, we provided user interface for specific application of Genome-Wide Association studies (GWAS). In GWAS, the objects to be clustered are p single Nucleotide Polymorphisms (SNPs). The typical input is either a pXp matrix of linkage disequilibrium (LD, see e.g. the matrix ‘ld.ceph’ in the vignette), or a nXp matrix giving the genotypes of p SNPs for n individuals (see e.g. the matrix ‘ceph.1mb’). Accordingly a function snpClust was written that performs adjacency-constrianed HAC from input in following formats :

  • LD matrix of class Matrix::dgCMatrix
  • genotype matrix of class base::matrix or snpStats::SnpMatrix

The correctness of snpClust was justified with unit tests. snpClust function was documented with an informative example. Also vignettes were provided on how they may be used on the data from the GWAS:see the GWAS vignette.

Hi-C data analysis

Another application specific interface was provided for Hi-C data analysis. In Hi-C data analyis, the objects to be clustered are p genomic regions or loci, and the input data is in the form of a contact map. A contact map is a p x p similarity matrix, where the similarity between any pair of loci quantifies the intensity of the physical interaction between these loci. As most pairs of loci have no interaction, the contact map is usually very sparse. Accordingly a function hicClust was written that performs adjacency-constrianed HAC from input in following formats :

  • p x p Matrix::dsCMatrix
  • object of class HiTC::HTCexp
  • text file with one line per pair of loci for which an interaction has been observed (in the format: locus1locus2signal)

The correctness of hicClust was justified with unit tests. hicClust function was documented with an informative example. Also vignettes were provided on how they may be used on the HiC data:see the HiC vignette.

After everything was finalized, the package was checked for erros, if any. An important thing to note is that adClustBand_heap function was renamed to adjClust to make it easily writeable. Minor changes (like indenatation) were made and package was finalized was work product submission.

Summary

We have completed almost all goals that we had intially planned. We were also able to accomodate time for new essential features(like support for type argument,HTCexpobjects) that were not planned initially. Implementation of single and complete linkage criterion is decided to be released in later versions.

Commits and repository

List of all commits made by during GSoC 2017 can be found here

Link to the repository that I have worked on :repository

References

[1] Clayton, D. (2015). snpStats: SnpMatrix and XSnpMatrix classes and methods. R package version 1.20.0

[2] Dehman A. (2015). Spatial clustering of linkage disequilibrium blocks for genome-wide association studies. Phd Thesis, Université Paris Saclay.

[3] Dehman, A. Ambroise, C. and Neuvial, P. (2015). Performance of a blockwise approach in variable selection using linkage disequilibrium information. BMC Bioinformatics 16:148.

[4] Servant N., Lajoie B.R., Nora E.P., Giorgetti L., Chen C., Heard E., Dekker J., Barillot E. (2012) HiTC : Exploration of High-Throughput 'C' experiments. Bioinformatics

[5] Fotuhi Siahpirani A, Ay F, Roy S. (2016) A multi-task graph-clustering approach for chromosome conformation capture data sets identifies conserved modules of chromosomal interactions. Genome Biology. 17: 114. PMID 27233632 DOI: 10.1186/s13059-016-0962-8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment