Skip to content

Instantly share code, notes, and snippets.

@code-bele
Last active August 26, 2024 17:08
Show Gist options
  • Select an option

  • Save code-bele/59a3df1a6bb39cc697a4e3fe3a8d9169 to your computer and use it in GitHub Desktop.

Select an option

Save code-bele/59a3df1a6bb39cc697a4e3fe3a8d9169 to your computer and use it in GitHub Desktop.
GSoC 2024 : Integration of GRN Inference Methods into BEELINE

Integration of GRN Inference Methods into BEELINE

Introduction

BEELINE is a benchmarking framework for unsupervised algorithms which infer Gene Regulatory Networks (GRNs) from single cell data. Since 2020, it contains 12 algorithms which use single cell RNA sequence information to predict GRN’s which help infer interactions between Transcription Factors and Target Genes. These cell state specific or cell type specific GRNs help understand the processes which govern cell differentiation and development and disease progression as well.

Recent advancements in single cell sequencing technologies which allows varied information about cells to be captured and integration of these multi omics data, a plethora of new algorithms have merged to infer GRN with greater predictability of GRNs across samples. The primary goal of this effort was to update and integrate a few of these new algorithms, datasets and evaluation techniques into the BEELINE framework.

Multimodal Gene Regulatory Network Algorithms

Initially, the BEELINE framework was explored and existing algorithms were tested with the synthetic and GSD dataset to obtain results and GRN inference and understand its working. This was followed by understanding the algorithms and thier code structure, requirements. Test runs were done on the code to try to replicate papers findings. Synthetic datasets, existing datasets were used to test newly the integrated algorithm in BEELINE before trying out real world datasets in most cases.

While there are many new methods coming up to integrate and build GRN’s from single cell multiomics data. These are few which try integration single cell transcriptomics data and chromatin accessibility data. As we move from MICA to DeepMaps, the algorithms become better at integarting modalities and inferring other details like highly variable genes:

  • MICA uses mutual information, spearman correlation, GENIE3 or L0L2 sparse regression to build the initial gene regulatory network based on co-expression matrix. L0L2 regression combines sparse regression with L2 to remove edges which are weak and strengthen those which are significant. This network is then updated based on chromatin accessibility information and regulators. The ATAC information helps narrow down targets and adds information when datasets are small. It is primarly designed for embryonic development datasets. Initially, these methods were tested on synthetic datasets and the docker image was built using the same. This image was then tested with real RNA datasets alone followed by multi omics datasets as well. The ATAC datasets usually require some preprocessing to obtain a regulatory gene and target gene combination to update the built GRN. Some methods used to preprocess datasets have been explored in the working with real-world datasets section.

  • scTIE uses autoencoders to infer gene regulatory networks with an unified framework for multiple modalities and time specific information. It finds an embedding space using the transcriptomics data and chromatin accessibility data separately. This is followed by mapping the two embedding together to improve the network built. scTIE requires datasets across multiple time points for state specific GRNs. It uses optimal transport iteratively to align cells in similar states across embedding of ATAC and RNA and obtains transition probabilities. The loss incorporates finetune embeddings through backpropagation, integration modalities, reconstruction of input and pairing of different time states.

    The results of the scTIE algorithm on the HSPC differentiation data were replicated to obtain the embeddings and gradients. These would be further used based on differential expression and further github updates of the scTIE repository.

  • DeepMAPS uses heterogeneous graph transformers to integrate different modalities of sequencing data and infer gene regulatory networks. It build a graph connecting cells and genes (bipartite GNN) which can be expanded into a multi relation knowledge graph if more modalities are provided. While training, it breaks the large graph into subgraphs and learns realtionships for the same before integration. The scRNA_scATAC integration was used to infer GRNs from the 10X dataset of lymphoma cells obtained from a lymph node as given in the github repository. The already existing docker image of DeepMAPS helped manage the system requirements while running the transformer on the GPU machine. The results in the tutorial were replicated to obtain this UMAP of 12 cell clusters in the lymph node.

    plot

    Further, an adjacency matrix with RAS1 score ( regulon activity score) was also obtained and is used as the basis of the gene regulatory network. Gene sets regulated by the same transcription factors are regulons.

Pull request made : Murali-group/Beeline#115

Visualising GRN’s

While BEELINE has a BLPlot method for visualising results of GRN evaluation, it didn't have a GRN visualizer. Using the igraph package, a simple network visualizer with associated weights of GRN was made. This method in the BLPlot will help visualise the top 30 interactions and list out the associated weights as well. Self-interactions are also considered in making these networks. Example with the visualisation of SCNS output (an already existing method) :

GRN-SCNS_page-0001

Working with Real-world Datasets and Experimental Datasets

Apart from working with simple synthetic datasets to integrate the algorithms into BEELINE, several real world datasets mentioned in this recent paper were used to test the built algorithms. These included paired and unpaired multi omics datasets which required detailed and separate preprocessing to allow their usage in the models. scTIE cannot be used for most of these datasets as they dont contain timepoint information. DeepMAPS reads in hierarchical data structure files with multiples modalities with the option of giving scRNA and scATAC matrix seperately as well. However, it doesnt require much preprocessing apart from the formats. It uses reference genome and JASPAR transcription factors files as well.

Majority of the preprocessing required is for MICA which requires the scATAC file to have information about regulatory gene and target gene relationship. Most publicly available datasets dont have this information in the peaks matrix of the scATAC file. In the MICA paper, the processing is done on the bam or fastq files using nextflow pipelines, however these files are not provided in many paired publicly datasets (which have provided processed files usually in RDS, h5ad, csv or similar formats). Two approaches were explored and used to obtain the processed ATAC file and regulators file :

  • Using Spearman correlation based approach to obtain important regulatory genes for the peaks seen after combining the scATAC and scRNA files. Folowed by annotating the chromosome locations of the peaks with the relevant gene names using API calls to ensembl.
  • Using bedtools to obtain closest gene to a peak and intersecting genes in a peak. Relating these two finding based on location of peak and gene. Finally, obtaining gene names using API calls to ensembl.

Most of the previous datasets are unpaired whereas recent 10X datasets and others are paired where gene expression and chromatin accessibility, protein content data are captured together from the same cells. This reduces batch effects and makes integration easier. This is preferred for GRN building and has improved accuracy. Further, DeepMAPS and scTIE use methods to integrate multiple modatlies through deep networks whereas MICA is expected to works better with paired datasets. Although many datasets were explored, finally the preprocessing was done on Mouse retina dataset which is unpaired but the datasets are highly correlated, Human lymph node lymphoma and T cell depleted Bone marrow datasets which were paired.

Future Work

Each algorithm reads inputs in different formats and with different preprocessing steps depending on the file types. It is also essential to obtain or devise common methods for reading and preprocessing multiple publicly available datasets and experimental datasets which could be in different formats and have different structures internally. These steps need to be made uniform at least for each algorithm. This would allow wider usage across different file types with ease. This way more real-world and experimental datasets could be used to benchmark different methods.

Once the scTIE repository is updated with the gene regulatory networks method, it can be used in conjecture with the existing embeddings and gradients finding methods to construct and infer GRNs. Further, more evaluation techniques like EigenValue centrality mentioned in DeepMAPS can be added into BEELINE’s framework.

Moreover, we could further expand the multi omics based methods for Gene regulatory network inference. Several other methods like SCENIC also comprehensively use chromatin accessibility data and gene expression information to infer gene regulatory networks. Recently, methods that incorporate spatial transcriptomics data have been shown to provide improved inference with construction of region specific, state specific and cell type specific GRNs with greater accuracy. These include CLARIFY, SCING and SCIPro with powerful frameworks. These allow newer finds based on the obtained GRN. They are accompanied with unique methods for evaluation of gene regulatory networks.

Lastly, visualising GRNs and running the algorithms can be abstracted for a wider user base. For instance, Cytoscape or similar tools can be used for dynamic visualisation on the web. In the long run, employing tools like cloud GPU/ machines for access and hosting BEELINE publicly with user interface could help scientists and researchers build and infer gene regulatory networks with greater ease as well.

Links and References

  1. https://github.com/Murali-group/Beeline
  2. https://www.nature.com/articles/s41592-019-0690-6
  3. https://www.biorxiv.org/content/10.1101/2024.02.01.578507v1.full
  4. scTIE paper
  5. DeepMAPS paper
  6. MICA paper
  7. SCENIC
  8. SCING
  9. SCIPro
  10. CLARIFY

Acknowledgements

I would like to express my gratitude to Yiqi Su, Maryam Haghani, and Prof. T.M. Murali for providing me with this opportunity and for their invaluable guidance throughout my contributions to BEELINE. I also extend my thanks to IBAB, Bengaluru, for granting access to the GPU machine that enabled the execution of the algorithms mentioned above. Finally, I would like to thank Shweta Ramdas for introducing me to the opportunity that is Google Summer of Code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment