dhimmel/tiga-review.md

## tiga-review.md

      
    Raw
  

              tiga-review.md
            
          
    Review of TIGA Preprint v1 by Daniel Himmelstein on 2020-12-01

Resources

TIGA: Target illumination GWAS analytics

Jeremy J. Yang, Dhouha Grissa, Christophe G. Lambert, Cristian G. Bologa, Stephen L. Mathias, Anna Waller, David J. Wild, Lars Juhl Jensen, Tudor I. Oprea

bioRxiv (2020-11-12) https://www.biorxiv.org/content/10.1101/2020.11.11.378596v1

DOI: 10.1101/2020.11.11.378596
Related links:

unmtransinfo/TIGA#1
https://github.com/unmtransinfo/tiga-gwas-explorer
https://unmtid-shinyapps.net/shiny/tiga/

Review

This study approaches a difficult and important problem. GWAS links variants to disease. However, users often require the links between genes and disease, and seek a comprehensive and systematic catalog linking gene to disease based on GWAS evidence. This presents challenges. First, detecting the causal genes driving the association with variants is difficult. Second, many studies provide partially overlapping evidence of varying quality that must be unified. Let's refer to these two problems as "variant-to-gene mapping" and "association integration".
This study proposes methods to overcome these challenges and applies them to create a ranked list of genes with GWAS evidence for 383 traits/diseases. The study focuses on producing datasets and a webapp that will be useful for prioritizing drug targets, although any method that can provide an integrated view of associations between genes to disease will have many applications. The manuscript notes that the gene-disease associations from this study are used as input to the general purpose DISEASES database and PHAROS.
Variant-to-gene mapping


TIGA ... does employ mappings provided by the Catalog between GWAS SNPs and genes, generated by the POSTGAP Ensembl pipeline.

The manuscript does not provide further information on the variant-to-gene mapping. The methods performed by these upstream sources and tools should be summarized so readers can properly understand the data. The questions I am left with:

Are the GWAS Catalog variant-to-gene mappings based solely on distance?
The POSTGAP README diagram mentions several methods of fine mapping and experimental data sources including epigenetics, PCHi-C, eQTL, VEP, GERP, and GENCODE. Does the version of POSTGAP used by GWAS Catalog includes all of these sources?
Many applications benefit from gene assignments that are not biased by human knowledge but instead derive solely from systematic high-throughput assays that are indifferent to human expectations of the causal biology. TIGA does not use "author-reported" genes, which do often have human knowledge bias. But do any of the sources used by GWAS Catalog introduce knowledge bias?
Is there is a max distance beyond which a gene cannot be mapped to a variant? The GWAS Catalog FAQ mentions "all Ensembl and RefSeq genes mapping within 50kb upstream and downstream of each GWAS Catalog variant", but Supplementary Fig 1 shows genes up to 100 kilobases away.
How many genes are assigned per variant (e.g. the histogram of counts)? Do variants that map to multiple genes end up contributing to rankings more than variants that map to a single gene?
How do the inherited GWAS Catalog methods compare to the Open Targets Genetics Portal methods?

In short, the variant-to-gene mapping is a crux of TIGA, but insufficient detail is provided. Even though these methods are not original to TIGA, the essential characteristics should be summarized to orient readers.
Association integration

TIGA introduces 10 metrics for summarizing all associations for a gene-trait pair. Only 3 of these metrics are used for the combined meanRankScore:


N_snpw: N_snp weighted by distance inverse exponential described above.
pVal_mLog: median(-Log(pValue)) supporting gene-trait association.
RCRAS: Relative Citation Ratio (RCR) Aggregated Score (iCite-RCR-based), described above.


From a user's perspective the following metrics would be most interesting:

The effect size of the gene-disease association
The significance of the gene-disease association
The probability that the gene-disease association exists

TIGA does provides measures of the effect size (median OR) and significance (median pVal_mLog). But both methods aggregate multiple associations using the median. Why not use meta-analysis methods that are designed to combine effect sizes and p-values, while weighting by the reliability of each study? Applying meta-analysis methods would be impure, because they'd assume all associations refer to the same SNP. But it'd address shortcomings of the median:

Low quality studies should not receive equal weight to high quality studies.
A p-value from an underpowered study should not detract from a p-value from a well-powered study.
Significance should only grow when multiple significant associations are combined. I.e. independent sources of statistical evidence of association should decrease the probability the association arose by chance.

Using the TIGA webapp to find multiple sclerosis (EFO_0003885) hits, the top result is B4GALNT1 with pVal_mlog equal to 30 that was found by a single study (N_study is 1). Meanwhile on GWAS Catalog, the top association is the SNP rs3104373 mapped to gene HLA-DQA1, with a p-value of 1e-234 and an odds ratio 2.9 [CI 2.72–3.09] from a study with 4,888 cases and 10,395 controls.
The B4GALNT1 association in TIGA was uncovered by a study on 14,802 cases and 26,703 controls. But this study also found dozens of associations that were more significant than B4GALNT1, but whose median p-value in TIGA is diluted by prior discovery in less powerful GWAS.
I don't want to impose a specific solution on the authors, when they've thought more about the problem. But here are some ideas. Significance and effect size can be aggregated in a way that weights each association based on its reliability. Reliability is a general concept, but could take into account:

article-level citation metrics
gene ambiguity for a SNP (downweight when a SNP maps to many genes)
GWAS sample size
Inflated p-value distributions for a study. GCST006287 is a study that discovers too many associations for its small sample size (discussion). Can this be detected and penalized?
whether the study validated associations on independent samples
year the study was published (only if GWAS reliability has improved with time)

Currently, RCRAS is a distinct metric, but would it make more sense as a means to weight study reliability when aggregating significance and effect across associations?
Evaluation


These variables are complementary, having a maximal pairwise Spearman correlation of 0.34 and evaluating different aspects of the associations.

The MeanRankScore had an AUROCC of 0.731 compared to AUROCC of 0.720 for RCRAS. It's therefore not clear that pVal_mLog and Nsnpw improve MeanRankScore.
Open Targets Genetics


The OpenTargets platform (Koscielny et al., 2017) uses Catalog data and other sources to identify and validate therapeutic targets by aggregating and scoring disease–gene associations for “practicing biological scientists in the pharmaceutical industry and in academia.” In contrast, TIGA is a GWAS Catalog-only application that takes into account cited articles in a simple, interpretable manner.

TIGA should cite the recent Open Targets Genetics publication:
Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics

Maya Ghoussaini, Edward Mountjoy, Miguel Carmona, … Ian Dunham

Nucleic Acids Research (2020-10-12) https://doi.org/ghfp4s

DOI: 10.1093/nar/gkaa840 · PMID: 33045747
More description of the differences between Open Targets Genetics and TIGA would be helpful.
Data & Code availability

Code for the webapp is released at https://github.com/unmtransinfo/tiga-gwas-explorer under the permissive BSD 2-Clause License (excellent). The webapp is available at https://unmtid-shinyapps.net/shiny/tiga/ and doesn't require login.
Dataset download is available in the webapp Download tab, with columns documented under the Help tab. A minor nitpick is that the column names don't follow consistent naming conventions, and switch between camelCase, underscore_sep, and ALLCAPS. This might not be worth the effort to fix, but does detract slightly from the user experience.
It would be helpful to release the datasets on the downloads page under an open license made for data (not code). I haven't read the EBI terms of use that apply to GWAS Catalog, but they likely would be compatible with a TIGA license like CC BY 4.0.

open-source pipeline designed for continual updates and improvements

The workflow docs describe the process for rebuilding the resource. Is there a plan to continually update TIGA? How much human work would be required for an update?
Minor


evaluate the evolving empirical impact of a publication, in contrast to the non-empirical impact factor.

This seems like a misuse of the word "empirical". The journal impact factor is not theoretical. It is an empirical measure of the citations received across all articles in a journal over a time period. Do you mean an article-level rather than a journal-level metric?

"sc = study count (in pub) ... Division by sc effects a partial count for papers associated with multiple studies."

What is the difference between studies / paper / publications in this section? This is not clear.
Where does citation information for iCite RCR come from? Does this only include citations from articles in PubMed Central? Looking into it, the citations come from NIH Open Citation Collection (NIH-OCC), which includes data sources such as "MedLine, PubMed Central (PMC), and CrossRef". This is about as comprehensive as a publicly available citation resource can get at the moment (excellent). Perhaps mention the citation corpus is NIH-OCC.
The Relative Citation Ratio adjusts for both the field and age of publication. These are fantastic characteristics that are worth mentioning, as they are crucial for TIGAs application.

N_beta: simple count of beta values with 95% confidence intervals supporting gene-trait association.

Shouldn't all associations at this point be significant (i.e. confidence intervals that rule out no effect)? What is the purpose of counting N_beta.

Vectors of ordinal variables represent each case, and non-dominated solutions are cases, which are not inferior to any other case at any variable. The set of all non-dominated solutions defines a Pareto-boundary.

There's too much new terminology in this sentence. What is a "case", "solution", etc? If defining Mu, use terminology relevant for your study. I didn't dwell on this section because the alternative meanRankScore was sufficient.
The PDF citations are linking to paperpile URLs that return the error message "You cannot view or change the references of this document". Example URL: https://paperpile.com/c/ENPCgk/sat4r.
Summary

The study addresses an important problem. A unified dataset of gene-trait associations based on GWAS would be valuable resource. However, the manuscript is lacking details and intermediate results regarding the crucial steps of variant-to-gene mapping and association integration. Furthermore, the methods for association integration appear suboptimal, although more discussion of the reasons behind the design decisions might sway my opinion.
Integration into DISEASES and PHAROS suggest the TIGA data will make a lasting impact.