Skip to content

Instantly share code, notes, and snippets.

@wragge
Last active November 13, 2021 05:41
Show Gist options
  • Save wragge/08b62dc680cc25aea4ed27e9d137e1ff to your computer and use it in GitHub Desktop.
Save wragge/08b62dc680cc25aea4ed27e9d137e1ff to your computer and use it in GitHub Desktop.
Feedback on the revised draft plan for Trove advanced research tools under the HASS Research Data Commons program

Feedback on the updated draft plan for Trove advanced research tools

Submitted by Dr Tim Sherratt (GLAM Workbench and University of Canberra)

Do you think this proposal meets the requirements?

Additional notes on the two versions of the draft plan can be found here and here.

The current investment in the HASS Research Data Commons is part of a long-term program of capability development across the HASS sector. This is reflected in the evaluation criteria for project plans that emphasises ‘maximisation of the use or re-use of existing research infrastructure’, ‘integrated infrastructure layers with other HASS RDC activities’, collaboration, ‘research leadership’, and ‘a demonstrated commitment to ongoing community development’.

The NLA plan for a Trove Advanced Platform is narrowly focused on the development of the Library’s own systems. The redrafted plan introduces limited consultation, but the overall scope of the project is unchanged. The reuse or integration of external tools is vaguely gestured at, without any commitment to collaboration. Throughout this process, the NLA has refused to acknowledge existing resources, such as the GLAM Workbench. It has been reluctant to engage with the HASS community, and has provided no evidence that it is interested in long-term community development, or research leadership. It simply wants HASS RDC funding to build new Trove interfaces.

The inability of the NLA to address feedback, and think beyond the the boundaries of their initial plan, indicates that they should not be leading this part of the HASS RDC. To meet the objectives of NCRIS and make a positive contribution to capability development within the HASS sector, the scope of this project needs to be radically rethought. The NLA should focus on the enrichment of its data and the improvement of its public APIs. It should work with the ARDC and other research infrastructure providers to develop a community of developers, trainers, and users that expose this data to new research uses. The plan should support and enlarge activity outside of the Library, not merely pump funds into current systems.

For these reasons I believe the plan should be rejected and the ARDC should consider alternative strategies for developing an advanced research platform using Trove data. If this is not done, I fear that the long-term strategic objectives of the HASS RDC will be damaged, and a valuable opportunity to enlarge the scope and impact of HASS research infrastructure will be lost.

How would this proposal support your research?

It’s difficult to see how this proposal would add anything to what I am currently able to do with Trove data.

I can, for example, already create datasets from the Trove API using the Trove Newspaper Harvester. These datasets are generated using openly-licensed tools, running on a variety of cloud services including the Nectar Research Cloud. The datasets are timestamped, and can include the full text, PDFs, and images as well as article metadata. Other notebooks in the GLAM Workbench harvest data from Trove books and journals.

I can share these datasets with either selected partners or the public using CloudStor. I already share several gigabytes of metadata, text, and images harvested from Trove via CloudStor.

I can visualise these datasets using a variety of notebooks in the GLAM Workbench, as well as other open source tools, such as Voyant. Most importantly, these tools offer the opportunity to ask critical questions of the data itself.

The NLA has suggested that their planned developments would open these sorts of analyses to a wider range of researchers, implying that existing tools require advanced digital skills. This ignores the fact that tools such as QueryPic (for search visualisation) and the Trove Newspaper & Gazette Harvester are already available as simple web apps. Opening these existing tools to use by more researchers could be achieved simply by adding links and documentation to the Trove site. However, the NLA refuses to do this. Similarly, the current plan offers no pathways for researchers seeking to develop their skills.

Are there other capability gaps that we should consider in the longer term?

As argued above, I think the primary role of the NLA in the development of this research platform should be as the data provider. There are numerous ways in which Trove’s data might be improved and enriched in support of new research uses. These improvements could then be pushed through APIs to integrate with a range of tools and resources. Where is the Trove API in this plan? The API is the key pipeline for data-sharing and integration, yet it receives no attention at all.

If the scope of this project was rethought, there would be an opportunity to make some much-needed improvements to the NLA’s public APIs. Here’s a few possibilities:

  • Bring the web interface and API back into sync so that researchers can easily transfer queries between the two (Trove’s interface update introduced new categories, while the API still groups resources by the original zones).
  • Provide public API access to additional data about digitised items. For example, you can get lists of newspaper titles and issues from the API, but there’s no comparable method to get titles and issues for digitised periodicals. The data’s there – it’s used to generate lists of issues in the browse interface – but it’s not in the API. There’s also other resource metadata, such as parent/child relationships, which are embedded in web pages but not exposed in the API.
  • Standardise the delivery of OCRd text for different resource types.
  • Finally add the People & Organisations data to the main RESTful API.
  • Fix the limitations of the web archives CDX API ( documented here ).
  • Add a search API for the web archives.
  • Make use of IIIF standard APIs for the delivery of images and maps. This would enable the use of the growing ecosystem of IIIF compliant tools for integration, analysis, and annotation.
  • And what about a Write API? Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources.

These sorts of developments open up possibilities for future enhancement and collaboration. They align more closely with the HASS RDC program’s objectives than the current plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment