Globus Download in Hyrax
Globus is a tool for transferring very large datasets. It has many advantages over older systems for transferring files, and researchers are increasingly expecting that data repositories should offer Globus integration. While there is not yet an official Globus integration offering from the Samvera community, several institutions have integrated Globus into their repository systems. Notch8 was recently asked to write such an integration for the Hyrax-based Rutgers Virtual Data Collaboratory. This blog post will describe the research and design process for this, as well as provide links to some sample code and pointers for future development.
As we undertook this work, we were aided greatly by conducting informational interviews with Nabeela Jaffer at the University of Michigan, and David Chandek-Stark at Duke University. UM and Duke have implemented similar strategies for Globus integration, with a few differences. We are grateful to our colleagues at these institutions for sharing their time and expertise, and this is a wonderful example of how working in an open way helps to advance the state of data repositories in general much faster than teams working in isolation.
High Level Architecture
[ Insert Globus Integration for Rutgers-VDC diagram here ]
To enable download via Globus, we are following the same general pattern that both UM and Duke are using: 1. Create a shared volume that is writeable by the Hyrax applicaiton process 2. Create a Globus end-point that reads from that same volume 3. Automate the export of data sets from Hyrax to that shared volume, organized by unique id 4. Generate a predictable link that includes the institution’s Globus ID and the item’s unique id, which will allow a user access to the files via the Globus web client
- A single work from the Duke Research Data Repository, available for download via the Globus client
- The top level directory of the Duke Research Data Repository, visible via the Globus client, showing all of the datasets
While the UM, Duke, and Rutgers solutions all share the same high-level pattern, there are some key differences. Please note that this document is not a complete analysis of each solution; it is only a report of the analysis done at Notch8 in order to fulfill a specific contract for Rutgers University.
Michigan: On-demand export
The University of Michigan solution copies files on demand for Globus download, offering the user a button that will copy a dataset to Globus download space in a background job, and then email the user when the item is ready for download. Heavily used datasets remain in the Globus download space, and for those items the user is presented an immediate opportunity to download via Globus, with no waiting. This advantages of this approach include more efficient use of space and thus reduced cost for repository operation. The disadvantages of this approach include increased complexity (e.g., the need for an on-demand job to copy the files and notify the user when their files are ready) and the need for active storage management (the Globus download space must periodically be cleaned out).
Duke: Nightly batch exports
The Duke University solution instead chooses to make all of its public data available for download at any given time. Dataset export is tracked via a rails
ApplicationRecord object called
Globus::Export, which records whether a work has been exported, whether that export succeeded, and when the last export occurred. A nightly scheduled process scans the repository for newly added works by checking each work against its table of
Globus::Export records, kicking off an export for any work that has not yet been exported.
Rutgers: Using the Hyrax Actor Stack
The Rutgers approach takes a more real-time approach than either of the above solutions. We adopted the
Globus::Export Application Record from the Duke Solution, but our version of
Globus::Export has two additional fields:
completed_file_sets. One of the challenges around data import in Hyrax is the fact that file attachment happens via background jobs, and there is no obvious way to know when a work has been totally assembled. However, by the end of the initial run of the Actor Stack, we know the list of
FileSet objects that are attached to a work. We record that list of
FileSet identifiers on a
Globus::Export. Then, we insert into the background job that is attaching files, a method that kicks off a Globus Export of a particular
FileSet and, assuming all goes to plan, records that
FileSet id in the corresponding
Globus::Export#completed_file_sets. When generating the user-facing view of a work, we check the
Globus::Export object for that work, and if the
#expected_file_sets match the
#completed_file_sets, we display the generated download link.
Future work for this integration might include:
* leveraging the
browse-everything gem’s file system integration to also allow for Globus upload to particular directory, where data would then be available for cataloging and deposit into Hyrax
* improved error checking, including more robust checksum validation when the files are copied
* extraction of this functionality into a gem that could be installed and configured into a Hyrax application without the need for much local customization
This has been a rewarding project, and we are so grateful to the team at Rutgers for the opportunity to better understand the needs of research scientists working with large data sets!