DISEASES 2.0: a weekly updated database of disease–gene associations from text mining and data integration
Dhouha Grissa, Alexander Junge, Tudor I Oprea, Lars Juhl Jensen
bioRxiv (2021-12-09) https://doi.org/gn3fj4
DOI: 10.1101/2021.12.07.471296
The webapp and latest downloads are available at https://diseases.jensenlab.org. Version 1 is described in the 2015 publication.
DISEASES provides an omics-scale catalog of associations between diseases (as Disease Ontology terms) and genes (as STRING identifiers, which are mostly Ensembl protein IDs and some other things). This manuscript describes version 2, which includes many enhancements since version 1 including a 17-fold increase in high-confidence associations from text mining (Table 1)!
Figure 2 investigates the source of text mining improvements that increase performance:
- full-text mining compared in addition to abstracts
- updates to the NER tagging dictionary since 2013
- new publications (abstracts) since 2013.
Notably full-text mining appears to provide the largest benefit. The full-text mining is based on the open access subset of PubMed Central containing 7.3 million articles. This finding underscores the importance of openly licensed publishing in the biomedical domain, where still around 50% of articles are inaccessible to T&DM approaches due to non-open licenses and paywalling of publicly funded research!
The web interface shows where the disease and gene co-occur in an abstract.
The data is released under a CC BY license, which makes it amenable to reuse. As the authors note, DISEASES is already incorporated into many integrative resources, owing in part to its permissive licensing. The authors removed COSMIC due to an incompatible license. The authors' diligence to keep DISEASES unencumbered from legal barriers to reuse is commendable and will help ensure this resource continues to make a large contribution to biomedical research.
The introduction mentions other noteworthy projects compiling gene disease associations including DisGeNET and Open Targets.
Overall this is an important study on an important resource. However, the code and data availability can be improved as noted below.
Besides for the tagger, the code to generate DISEASES does not appear to be available. This makes it hard to get a detailed look at the methods beyond what is described in the manuscript. It also prevents the community from contributing and creating alternative datasets that reuse parts of the DISEASES infrastructure. It's not clear what the reason is for keeping the code private.
If the resource is regenerated weekly using continuous integration, it would also be valuable to give the community read access to the logs. This would occur automatically if the resource is built using a service like GitHub Actions. It's not clear from the manuscript how or where the weekly builds occur.
Versioned permalinks would be extremely helpful, so users can pin their analyses to old versions of the data without having to archive it themselves.
On the downloads page, the Figshare archive link is to a user profile, and not an archive of DISEASES v2. On the user page, there are multiple DISEASES v1 records, but the sigle DISEASES v2 record contains only the gold standard data.
Ideally, there would be a Figshare record for DISEASES 2.0 (or multiple records for different components, e.g. tagging dictionaries might be their own record.) Rather than create a new record for each DISEASES v2 release, it would be best to create multiple versions within the same record.
The downloads page links to https://download.jensenlab.org/diseases_dictionary.tar.gz, but not to human_dictionary.tar.gz
with the gene dictionary. Perhaps it'd be less confusing to just link to https://github.com/larsjuhljensen/tagger, which already contains links to all the predefined dictionaries? I am left wondering what the difference between diseases_tagger.tar.gz
is compared to the software from the GitHub repository.
The association score TSV downloads don't contain column names as the first row, nor are the columns documented on the downloads page. It's best to keep column names with the data.
excluded_documents.txt
provides a list of 826 publications from paper mills that should be excluded by T&DM approaches. This is a useful resource beyond this project. Is the source of this file tracked with collaborative version control? It could be beneficial to have this file in a GitHub repo where the community could suggest additional exclusions in the future.
Many users might be interested in an integrated score that combines the knowledge, experiments, and text mining channels. We've discussed methods for this in the past. DISEASES v1 included an integrated score (archived here), but it looks like DISEASES v2 removes this score. What is the reason for the removal of the integrated score?
The methods for producing the filtered datasets that "contain only the non-redundant associations" are not clear to me. Is this related to backtracking of evidence to parent diseases?
There's a missing word in the following sentence: "The Experiments channel has CHANGED in many ways between the two versions".
Signed, Daniel Himmelstein
Noting the revised version for re-review: DATABASE-2021-0143.R1_Proof_hi.pdf.
Review on 2022-03-01
The authors addressed or responded to all of my comments. I think the inability to release the source code is far from ideal, but given the authors' explanation this sounds unlikely to change. I think the database is of sufficient value that we should proceed with publication.
I see that there is now a DISEASES v2 collection on Figshare (https://figshare.com/collections/DISEASES_v2/5833394), which is excellent. However, the downloads page on the website still links to the user profile, so perhaps that link should be updated.