Review: ISWC 2017 Resources Submission 178
- URI: https://gist.github.com/stain/78d688c2fc527cc2f95c06d78c844526
- Title: One year of the OpenCitations Corpus - Releasing RDF-based scholarly citation data into the Public Domain
- Authors: Silvio Peroni, David Shotton, Fabio Vitali
- Call: ISWC 2017 Resources Track
- Submitted preprint: https://w3id.org/people/essepuntato/papers/oc-iswc2017/2017-05-15.html
- Latest version: https://w3id.org/people/essepuntato/papers/oc-iswc2017.html
- Accepted preprint: https://iswc2017.ai.wu.ac.at/wp-content/uploads/papers/MainProceedings/178.pdf
- Published: https://doi.org/10.1007/978-3-319-68204-4_19
- Resource: https://w3id.org/oc/corpus/ https://doi.org/10.6084/m9.figshare.5147068
- Review by: Stian Soiland-Reyes (#4 of 4)
- Outcome: Accepted at ISWC 2017 Resources Track.
Silvio Peroni, David Shotton, Fabio Vitali (2017): One Year of the OpenCitations Corpus. In: d'Amato C. et al. (eds) The Semantic Web – ISWC 2017. ISWC 2017. Lecture Notes in Computer Science 10588 https://doi.org/10.1007/978-3-319-68204-4_19
This review is licensed under a Creative Commons Attribution 4.0 International License.
Reviewer's confidence: 3: high
Appropriateness: 1: good
Clarity and quality of writing: 2: very good
Related work: 2: very good
Originality: 1: good
Impact of ideas and results: 2: very good
Implementaton and soundness: 2: very good
Evaluation: 2: very good
Assessment of the resource
Assessment of the resource: 2: very good
The paper describes the OpenCitations Corpus in detail, both how it has been constructed and how it has seen a large increase in community uptake.
The corpus is distributed under a free (perhaps the freest?) CC0 license, and VOID metadata is available, if minimal. An open SPARQL endpoint and datadumps are available, however the dumps are slightly inaccessible (See below)
Reusability: 1: very good
The corpus is described at https://w3id.org/oc/corpus/ - this gives textual and semantic VOID dataset description - yay!
However the VOIDs do not provide any datadump links or
dcat:Distributions, only links to the SPARQL endpoint. From the endpoint I was able to navigate to http://opencitations.net/download which provides the actual datadumps. As these are provided as DAR archives rather than say .nq.gz these would not be appropriate to show as a void:dataDump, but the download web page could be given as a
dcat:accessURL of a
Edit: Authors have tracked this as a issue.
Datadumps have appropriate LICENSE and README files.
I find it quite convoluted to have a Russian doll of a zip file of a zip file of a DAR archive of JSON files - is this really necessary? Multiplied with the split datasets and many downloads this becomes a bit inaccessible.
The paper does not explain why this packagaging format, except "to recreate the entire OCC structure". I struggled to install dar for my Ubuntu 16.04 TLS server, as I needed to enable the "Universe" repository. This is a barrier for reuse. Why would not using a single tar and BagIt metadata/checksums for consistency not be sufficient? Disk space requirements should also be listed on the Download page and in the void metadata.
Edit: Reviewers explain that the outer ZIP is made by FigShare, inner one to simplify FigShare upload, DAR is useful for incremental backups. They have now provided also an April 2017 dump as N-Quads at https://doi.org/10.6084/m9.figshare.5147068 which they plan to include regularly.
As the archives contain multiple JSON-LD files with an external context, this can also be quite intensive and time consuming to load in a triple store. (However some JSON-LD parsers have caching).
I would have preferred if the datasets were available as N-Quad or N-Triple format instead of JSON - as that would make it easier to load. It would probably also be better if each file was typically about 20-100 MB so that say parallel loading in Virtuoso can be more efficient.
Edit: The new nquad dump is 11 GB zipped - although it's not split
The publicly accessible Linked Data resources (e.g. https://w3id.org/oc/corpus/be/537) are well presented with content negotiation, permanent URIs and a clean HTML+RDFa rendering for humans.
However Apache Jena's riot is not happy:
stain@biggiebuntu:/tmp$ docker run -it stain/jena riot https://w3id.org/oc/corpus/be/537 15:18:00 ERROR riot :: [line: 1, col: 1 ] Expected BNode or IRI: Got: [DIRECTIVE:prefix]
While the opencitations.net server does content negotiation correctly, it does not return a Content-Type in its response, so Jena is unaware of which RDF syntax to parse. This is likely to be a problem with other Linked Data consumers as well. I recommend fixing the server setup.
Edit: Authors have already fixed content negotiation and
riot is happy again.
For some reason not all the information is shown in the HTML, for instance http://opencitations.net/corpus/ar/1928.html does not include
isHeldBy <https://w3id.org/oc/corpus/ra/1731> which is present in RDF http://opencitations.net/corpus/ar/1928.ttl
Edit: Authors pointed out this is indeed shown, but with label "role of".
The downloadable "triple store" contains dataset with a triple store server binaries, and so can in theory be started up separately. This seems to require excessive amounts of disk space, but it is not documented how much. I would think that rather a Docker image would be appropriate for distributing the installed triple store - but this could then need do its own scripted data download and unpacking rather than the consumer having to do all this multiple wget/unzip/unzip/dar and JSON-LD loading manually.
Resource Design Quality
Resource Design Quality: 1: good
The OCC dataset uses existing ontologies like SPAR ontologies, FOAF and PROV, augmented with each own OpenCitations Ontology that groups existing entities from other ontologies.
The data model of the dataset can be overwhelming, as the resources are modelled with fairly nested structures. This can make it hard to write queries and use the dataset. It is however helped by the fact that it is using published and well-documented ontologies, as well as having resolvable Linked Data URIs for every resource.
I found myself as https://w3id.org/oc/corpus/ra/293031.ttl (yay) with datacite:hasIdentifier https://w3id.org/oc/corpus/id/181253 which defines my ORCID as
literalreification:hasLiteralValue "0000-0001-9842-9718" -- why are not ORCID identifiers linked to actual URIs like http://orcid.org/0000-0001-9842-9718 - considering those already have (limited) Linked Data support?
Edit: Authors indicate they will add direct ORCID identifiers in the future and tracked this as an issue
The web service provides text-book PROV statements with the provenance of each entity - e.g. showing how a particular author's data was loaded from PubMed. This should be commended. It is however difficult to guess that
https://w3id.org/oc/corpus/ra/293031/prov/se/1 has the provenance for https://w3id.org/oc/corpus/ra/293031 - perhaps a
prov:has_provenance link can be added to the latter? Also I am unable to query anything with prov:specializationOf in the SPARQL endpoint.
Edit: The authors have added an issue to add PROV-AQ links.
Overall paper evaluation
Overall paper evaluation: 3: strong accept (was: 2: accept)
Detailed comments to the authors
Hi, I am Stian Soiland-Reyes http://orcid.org/0000-0001-9842-9718 and believe in open reviews.
I would appreciate if you could contact email@example.com if you agree on me publishing this review.
Edit: Authors agreed to publishing the review. Now at https://gist.github.com/stain/78d688c2fc527cc2f95c06d78c844526
Post-edit: Authors also published their preprint and agreed to me quoting their rebuttal here.
A final remark: Stian, we are happy if you could publish your review openly as you requested.
This review is licensed under a Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/
This paper presents the OpenCitations Corpus, containing scholarly citation data gathered from Open Access literature databases. This is a very valuable resource that is used by thousands of researchers. While OCC has existed since 2010, this paper shows the evolvement of the corpus and how it is maintained and used.
I have highlighted some issues in datadump downloads that I would hope to see addressed - that would lift it from "accept" to "strong accept" for me. I understand this would not be easy to achieve in the time-span of this review process, but hope the authors take this into consideration - perhaps generating flatter N-Triples files would be fairly easy to achieve from the underlying triple store?
Section 4: Instead of spelling out numbers like "six million and half citation" or "five hundred and forty thousands times", use numbers, e.g. "54,000 times".
Figure 3: Instead of dates in US format "September 24, 2016" or "April 26, 2017 dump", use ISO 8601 "2016-09-24" and "2017-04-26". This is not just an international date format, but also makes it easier to understand the time span covered.
Some text in section 4 use a nicer British format "6 April 2017" - I recommend the whole article use a consistent format. I can understand if you want to use named months because OCC releases are monthly - however these releases may also be referred to in other papers and in code, where the ISO8601 format can also dual-function as an incremental "version" string.
Confidential remarks for the program committee
We thank all reviewers (R1, R2, R3, R4 herein) for their comments, suggestions, and typos spotting. Please find below specific answers to reviewers' comments.
The goal of the I4OC is to push publishers in allowing Crossref to release their reference lists in JSON format. The 1-to-40% increase of available citations concerns the data available on Crossref after the launch of the I4OC [R2]. OpenCitations uses some of these data contained in Crossref, but its current coverage concerns the citation networks extracted starting from the articles included in the PubMed Central Open Access subset (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/) that are accessed iteratively by means of the Europe PubMed Central API [R3]. We will clarify this in the introduction.
Due to the page constraints, it has not been possible for us to describe precisely every aspect of the tools and the whole methodology followed for ingesting new data [R2]. However, additional details can be found in the OpenCitations website and in some cited works, e.g. Peroni et al. 2016.
The assessment about the quality of the OCC has not been provided in the paper since we preferred to focus on the access data for demonstrating that there is an active community interested in the resource - as also suggested by the ISWC Resource track guidelines [R3]. However, we are actively working on the quality aspect, as highlighted by some issues recently opened in GitHub (e.g. https://github.com/essepuntato/opencitations/issues/10).
VOID data can be obtained directly accessing the URL of the OCC (https://w3id.org/oc/corpus/) and of any of its subdatasets (e.g. https://w3id.org/oc/corpus/br/) [R3] - we will clarify this in the paper.
Outward links are not included in the OCC currently [R3], since we have mainly focussed in ingesting data, and we want to reduce the errors introduced in the Corpus so as to avoid possible wrong alignments to external resources due to these errors. However, we plan to extend the scripts soon so as to have interlinking to external datasets via owl:sameAs and/or rdfs:seeAlso depending on the case, such as Wikidata, ScholarlyData, and ORCID - in particular, in the latter case, we would like to link the agents in the OCC to their ORCID URL [R4].
At the moment, the full text search is provided directly within the SPARQL interface [R3], as shown in https://wiki.blazegraph.com/wiki/index.php/FullTextSearch. We plan to add a traditional search field in the website soon.
About the Russian doll structure of the dumps [R4], i.e. the zip file of a zip file of a DAR archive of JSON files: the former zip file is produced automatically by Figshare, the second zip is done by us so as to decrease the volume of data to upload on Figshare, while the use of DAR as mechanism for packaging items is very useful for backups, since it also allows us to implement a daily incremental backup, something that BagIt doesn't allow us to do, as far as we can see. However, we see the issue in terms of accessibility. To this end, in order to guarantee a better accessibility of the data discussed in this paper, we have just also produced and uploaded on Figshare (https://doi.org/10.6084/m9.figshare.5147068) the n-quads zipped version of the full corpus of the April 2017 dump, and we plan to implement in the current workflow a mechanism for creating and publishing such more accessible version for the future dumps of the OCC.
While it would be better to have all the data stored in n-quads, having data in JSON-LD is a crucial requirement for us [R4] in order to make them easily comprehensible also to Web-developer and researcher with no expertise in Semantic Web technologies and formats. We will clarify this point in the camera ready.
We have just added the correct content type in the server setup so as to answer to HTTP requests correctly [R4].
The "isHeldBy" property is actually present in the HTML representation of ar resources with the label "role of" [R4].
About the provenace discussion [R4]: the triplestore does not contain the provenance information (e.g. prov:specializationOf) on purpose, mainly due to space limitations, while they are available with direct HTTP access and in the provenance dumps. Adding a direct link to the whole provenance graph of the resource would be valuable indeed, but it should link to the whole provenance graph of a resource (e.g. https://w3id.org/oc/corpus/ra/293031/prov/) which is not correctly returned by our server so far (while it is able to handle single provenance resources). We will extend the scripts so as to properly serve these information soon.
All the minor fixes in the text, such as spelling mistakes [R2] and the formatting of the dates [R4], will be addressed in the final version of the paper.
All the other issues identified - namely: registering resources in DataHub and LOV [R1], providing and improving DCAT/VOID descriptions about distributions [R4], adding outward links [R3,R4], adding a full text search interface in the website [R3], publishing a n-quads archive of the full Corpus [R4], specify space requirements in the download page [R4], simplify triplestore package via Docker [R4], add has_provenance link to the whole provenance graph of a resource [R4] - will be taken into consideration as future developments of the resource, and they have been already added as issues in the GitHub repository (see issues 11-19 at https://github.com/essepuntato/opencitations/issues). We will acknowledge this in the paper.
A final remark: Stian, we are happy if you could publish your review openly as you requested.
Silvio Peroni, http://orcid.org/0000-0003-0530-4305
GitHub issues raised:
Response to authors
Edit: Thanks to the authors for recognizing, tracking and addressing the concerns raised. You have provided a datadump at https://doi.org/10.6084/m9.figshare.5147068 which I expect you will add as a new citation in the camera-ready copy.
I have changed my Overall Paper Evaluation from "2: accept" to "3: strong accept".
Thanks for agreeing to open peer review, I have published this review at https://gist.github.com/stain/78d688c2fc527cc2f95c06d78c844526