Notice: This post has been published on the International Journal of Epidemiology's Blog. This repository contains the source for the post. The published version may contain some additional minor copyedits. Please refer to the IJE Blog for the authoritative version.
Genome-wide association study gives rise to a new breed of disease network
By Daniel Himmelstein
A puzzling similarity
Researchers have long noted puzzling similarities between Hodgkin lymphoma and multiple sclerosis. Although the first is a cancer and the second is an autoimmune disease, risk for both diseases appears to increase due to the Epstein–Barr virus and a lack of sunlight. In fact having a family member with multiple sclerosis may place you at increased risk for Hodgkin lymphoma and vice versa. Now, a recent study, on which I am a co-author, has identified genetic similarities.
Our analysis compared two studies designed to pinpoint the genetic variants behind disease susceptibility. Together these studies, referred to as genome-wide association studies or GWAS, analysed 1,816 Hodgkin lymphoma patients, 9,772 multiple sclerosis patients, and 25,255 healthy individuals. The prevalence of 404,069 genetic variants were compared between patients and healthy individuals for each disease. We found a large number of variants that appeared to affect susceptibility in both diseases. Additionally, a genetic risk model designed for multiple sclerosis also predicted Hodgkin lymphoma.
Building a network
We were excited to find common genetic signals, but the similarity lacked context. For example, is Hodgkin lymphoma more similar to other cancers than to multiple sclerosis? To answer these questions, we needed GWAS results for many diseases. We turned to the GWAS Catalog, whose team of curators reads through GWAS publications and extracts the associated variants into a public database.
For 82 diseases, we identified associated regions of the genome, called loci. Then for each disease pair, we calculated a similarity score based on the number of overlapping loci. The score adjusts for the number of loci per disease which can vary widely — multiple sclerosis has 55 loci, whereas Hodgkin lymphoma has 7. Of the 3,321 possible disease pairs, 433 had at least one overlapping locus.
Next, we calculated the network proximity of disease pairs. Proximity is calculated by transmitting similarity scores between related diseases. By leveraging the same insight as PageRank — the founding algorithm behind Google's web search — the approach helps improve the robustness and connectedness of our network.
Below we display our network of disease proximities. See these tables for disease abbreviations and specific proximity scores (plotted with edge thickness). We applied a layout that pushes proximal diseases together and distant diseases apart.
Autoimmune diseases form a distinct cluster. Solid cancers cluster as well but less cohesively. And the three blood cancers span from the solid cancer to autoimmune extremes. Multiple myeloma sits in solid cancer territory; chronic lymphocytic leukemia rests in between; and Hodgkin lymphoma resides with the autoimmune. We interpret these findings as evidence that Hodgkin lymphoma is special amongst cancers in that its genetics align primarily with autoimmune disease.
Why is this important? Complex human diseases, such as Hodgkin lymphoma and multiple sclerosis, are often poorly understood, complicating prevention and treatment efforts. We hypothesize that genetically similar diseases will share more than just genetics. As lead author Dr. Khankhanian explains, "genetic similarity between diseases may have clinical implications. Drugs that treat one disease may be repurposed to treat a genetically similar disease." Likewise, two diseases with similar genetics may also share risk factors, as we see with Hodgkin lymphoma and multiple sclerosis.
Three examples of the new breed
Why do we consider our approach a new breed of disease network? Many early approaches, such as this prominent example, were exposed to two biases. First, the genetic profiles used to describe each disease relied on targeted studies of disease association. Such studies are biased by researchers' existing knowledge. GWAS offers a systematic, comprehensive, and hypothesis-free alternative. However, the early GWAS-based approaches suffered from a second bias. As Dr. Ben Voight — Assistant Professor of Systems Pharmacology and Translational Therapeutics at the University of Pennsylvania — explains, "for many loci we just don't know what the causal variant(s) are, and we certainly don't know the causal gene(s) linked to these variants." Approaches which require converting GWAS variants to genes introduce bias and potentially obscure signals.
Here, we investigate the new breed of approaches that avoid the two biases. The disease networks we mention below use only GWAS data and do not operate in gene space. Our method operates on loci rather than genes. To define loci, we identify a region around each lead variant uncovered by GWAS. The region boundaries are calculated by looking at the patterns of variation across the human genome. Farh et al. 2015 took a similar approach, which also used the GWAS Catalog loci (see Figure 1a and the "Shared genetic loci" section).
Both Farh et al. and our approach faced the same hurdles. We both applied p-value filters to remove low-confidence associations. We both condensed multiple studies on the same disease. Additionally, some GWAS studies lack statistical power and should thus be discarded. Accordingly, Farh et al. excluded studies with fewer than 6 significant variants, while we excluded studies on fewer than 1000 individuals. Consequently, the Farh network is considerably smaller with just 39 diseases. However, both networks offer a genome-wide glimpse into the genetic similarities between complex disease.
Nonetheless, these methods are not without limitations. As Dr. Voight notes, oftentimes there may be "multiple associations at the same locus arising from different variants." Our approaches interpret variant co-localization as shared genetic architecture, which is not always the case. Dr. Voight continues that even if two diseases associate with the same variant, "the risk allele for one disease may be protective for the other." Bulik-Sullivan et al. 2015 sidestepped these concerns by analysing trends in summary statistics across all variants. The drawback is that genome-wide summary statistics are lamentably not always available. Hence, the Bulik-Sullivan analysis focused on only 24 traits with poor disease coverage. The study uncovered several cases where the genetic profiles of two diseases were anti-correlated (see red in Figure 2). These cases are particularly interesting as our method would overlook the opposing genetic nature of the two diseases.
In closing, GWAS has given rise to a new breed of disease similarity network. These networks offer unbiased insights into commonalities between diseases. Here we explored three approaches and their trade-offs. While we initially constructed the disease network to contextualize the similarity between Hodgkin lymphoma and multiple sclerosis, we created a general resource covering 82 diseases. And we've dedicated the code and data for our network, available on GitHub, to the public domain.
This blog post is based on research from:
Khankhanian, Cozen, Himmelstein, Madireddy, Din, van den Berg, Matsushita, Glaser, Moré, Smedby, Baranzini, Mack, Lizee, de Sanjosé, Gourraud, Nieters, Hauser, Cocco, Maynadié, Foretova, Staines, Delahaye-Sourdeix, Li, Bhatia, Melbye, Onel, Jarrett, Mckay, Oksenberg & Hjalgrim (2016) Meta-analysis of genome-wide association studies reveals genetic overlap between Hodgkin lymphoma and multiple sclerosis. Int J Epidemiol. DOI: 10.1093/ije/dyv364
The study was an international collaboration between researchers at 17 institutions. Lead author, Pouya Khankhanian, began the study as a medical student at the University of California, San Francisco. He is now a Resident Physician of Neurology at the University of Pennsylvania where he takes a statistical approach to mining molecular and clinical data in the Center for Neuroengineering and Therapeutics. Corresponding author, Dr. Henrik Hjalgrim, is a researcher of Epidemiology at the Statens Serum Institute in Copenhagen, Denmark.
Daniel Himmelstein — @dhimmel on Twitter — is a PhD Candidate in Biological & Medical Informatics at UCSF in the Baranzini Lab. His research focusses on integrating public data for biomedical discovery. Daniel's recent projects include using networks to repurpose drugs and predict disease-associated genes; analysing the relationship between altitude, oxygen, and lung cancer; and exploring delays in scientific publishing.