Human immune system produces a vast variety of antibodies in order to respond to the external stimuli. Next-generation sequencing technology allows researchers to obtain the sequences of all antibodies from a single person. Clustering these antibody sequences allows us to understand how an antibody is produced. However, the number of antibody sequences from a single sample can be up to 1 million scale. Clustering with such a big scale poses a big computation challenge.
The current algorithm for clustering antibody sequences computes a pairwise distance matrix, and then perform a hierarchical clustering to group sequences into clusters. This algorithm is implemented in Python as provided clonify_contest.py
script.