Daniel Himmelstein's review of preprint v2
Review of version 2 of the following preprint:
Rigor and Transparency Index, a new metric of quality for assessing biological and medical science methods
Joe Menke, Martijn Roelandse, Burak Ozyurt, Maryann Martone, Anita Bandrowski
bioRxiv (2020-01-18) https://doi.org/dkg6
The study introduces an automated method called SciScore to detect whether an article's methods section mentions any of 15 categories, such as a consent statement or an organism. These metrics are combined to create a single score for each article called the "Rigor and Transparency Index". The authors applied the method to the PubMed Central Open Access subset with over 1 million articles to identify trends in the level of details provided by method sections.
This study addresses a noteworthy topic. It enables quantifying the rigor of methods section by journal and over time. For example, Figure 3 shows that the details in methods published by Nature likely improved after the journal implemented a checklist. A large-scale automated analysis like this makes sense to provide the most comprehensive assessment of methods reporting. Automated assessment of methods for details and rigor can help automate review and encourage best practices. Furthermore, additional factors or patterns in quality of reporting might be determined from the data.
Interestingly, the study does not sufficiently report its machine learning methods and classifiers. No code for the study is provided. The most useful datasets generated by the study are also not provided. Therefore, I worry whether the study is somewhat of an advertisement for a proprietary SciScore product, rather than a reproducible study whose findings, methods, and data outputs can be extended by other researchers.
We recently developed SciScore, an automated tool using natural language processing (NLP) and machine learning, that can be used by journals and authors to aid in compliance with the above guidelines.
Can authors or editors submit manuscript drafts to SciScore? This seems essential if the goal is to create a tool to "support authors, reviewers and editors in implementing and enforcing such guidelines." Update: I now see https://www.sciscore.com. Should the manuscript mention this website?
Table 2: showing percentages to two decimal places is a bit of false precision. Given the small sample sizes, a rounded percent like "55%" would be the least misleading and easiest visually for the reader.
The percent of authentication is calculated as the percent of papers that contain a contamination or authentication statement is detected where at least one cell line is found.
Is it a problem that studies might mention a cell line without having done any experiments on that cell line? For example, "past studies on [cell line X] found …".
Figure 4A on 2018 JIF versus SciScore: Log-transforming the x-axis would help the plot tremendously.
Figure 4B: it's unclear which points the labels refer to. Perhaps it'd be best to just create an online interactive version of this plot where users can hover over each point to see the journal label. Tools like plotly and vega-lite can produce hover text / tooltips without too much extra work.
Figure 4B: I think the axes actually are rank and not quartiles (as referred to in the axis titles) nor percentiles (as referred to in the caption). Quartiles implies there are only 4 possible values and percentiles implies there are only 100 possible values. It's helpful to plot the quartile lines, but the axis itself is not quartiles.
To a text mining algorithm like SciScore, a supplementary PDF is effectively invisible. If we were to attempt to score these papers manually instead, it would take roughly 1,500 hours or 187 days of nonstop curation to score the 18,000 Science papers in PMC assuming each paper took 5 minutes.
This is a compelling example of why journals should not require authors to withhold text from the structured article. I imagine it would be okay to have an online-only methods section (i.e. that doesn't make it to the magazine print) but that is still part of the structured XML.
It would be helpful for the manuscript to provide an example sentence for each criteria.
the most likely being that the articles are not included in the PMC-OAI, a subset that is roughly half of the “free to read” set of papers in PubMed Central because of restrictive licenses.
PMC-OAI, which stands for "PubMed Central Open Archives Initiative Protocol for Metadata Harvesting", is a service and not a subset of papers. Instead, the authors analyzed the "PMC Open Access Subset" using PMC-OAI to download the papers. Many of the mentions of "PMC-OAI" should actually refer to the "OA Subset" instead. Currently, the results returned by PMC-OAI are the OA Subset, so these terms are a bit interchangeable, but “OA Subset” would be more standard.
the OAI articles were fed through the SciScoreTM named-entity recognition classifiers. SciScore currently uses 6 core named-entity recognition classifiers (see Table 3 for a complete list of entity types detected).
More description of the classifiers is required. Named-entity recognition is a challenging problem. What classifier model was used? Table 3 has 15 rows, how are 6 classifiers used to produce 15 types of predictions?
Table 3 reports precision and recall which depend on dataset balance (prevalence of positives). I believe the "Training set size" column is just the positives? How many negative sentences were considered?
The article-level performance assessment in Table 4 is impressive. This is surprising to me given the challenges of NER, especially with the small training set sizes in Table 3. I wish I could read about the classifier methods in more detail. Or see the code.
I saw journal-level summary statistics in this supplemental table. But I didn't see article level scores reported. This seems like the most essential dataset related to this study, especially since it's required to pursue many interesting follow ups (like do authors from certain universities or with certain funders report more thoroughly). Article-level scores are also essential for external assessment of the named-entity extraction methods.
Another useful dataset that is not provided is the corpus of manually curated sentences for NER training.
I didn't see any code released for this study. It seems that the studies implementation can be divided into 3 components:
- retrieving and pre-processing XML articles using PMC-OAI
- running SciScore on each article to produce article-level scores
- analyzing article-level scores to identify trends at the journal or year level
Let's assume the authors are unwilling to release the code for 2, because they intend keep SciScore proprietary and commercialize it (is this the plan?). In that case, code for 1 and 3 should be released and the reason for withholding code related to 2 should be stated explicitly.
Unless all code is released, the study will suffer from a lack of reusability, reproducibility, and transparency. Therefore, the impact of the study should be assessed in light of the code availability. I think the topic and approach are of critical importance, but must meter my excitement for the research based on the current level of data and code availability.