Skip to content

Instantly share code, notes, and snippets.

@JeffSpies
Last active September 17, 2015 08:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save JeffSpies/709778d9da435127b041 to your computer and use it in GitHub Desktop.
Save JeffSpies/709778d9da435127b041 to your computer and use it in GitHub Desktop.
A storage platform for distributing the stewardship of high-value data

BitTorrent for Science

A storage platform for distributing the stewardship of high-value data

Summary

The Center for Open Science's (COS; http://cos.io) Open Science Framework (OSF; http://osf.io) freely stores a large amount of publically accessible content (e.g., materials, code, datasets, and publications). Stewardship of such high-value data should not exist solely with one organization or institution. Instead, an inclusive approach would allow any person, organization, or institution to contribute to this stewardship by storing and hosting some percentage of the data using a non-proprietary version of the BitTorrent protocol. Not only is BitTorrent a common, decentralized platform for peer-to-peer file sharing, but, because each file (specifically, file segment) is hashed (i.e., a unique fingerprint or signature is generated for each file), content-addressability comes free.

Content-addressable storage (i.e., storage where files are referenced by the signature of their content) is important for scientific preservation. Persistent identifiers (e.g., DOIs) are unique identifiers with strong guarantees and fallbacks for always resolving to the same object. In practice, services like DOI (i.e., a scholarly bit.ly; http://dx.doi.org/10.1038/nphys1170) only provide guarantees about resolving to the same URL (along with the maintenance of some metadata) regardless of whether or not that URL actually continues to resolve. The OSF provides such identifiers in the form of OSF IDs (e.g., https://osf.io/ezcuj/), DOIs, and ARKs, but is pursuing efforts to offer strong guarantees about the persistence of the content of the objects as well. For example, COS has a $250,000 endowment for the maintenance of a static OSF should COS cease to exist. A hash makes for an ideal identifier for content because it can be recalculated from the content allow for tests of data integrity. Together, persistent identifiers and a distributed content-addressable storage solution go a long way in providing long-term identification and preservation of scientific content.

To complete the long-term, distributed preservation equation, an understanding of the availability of a given file is required. A slight modification to the typical BitTorrent tracker (i.e., a server that assists the communication between peers) would allow for tracking of (a) coverage of a given file across the network of seeders in the community and (b) the reliability in availability of each seeder. A client could optionally use this information to seed content that was not explicitly requested in order to increase coverage of a given subset of files. In this way, users could donate storage to the system. In practice, there would be some users--perhaps organizations and institutions--that have dedicated resources and are determined to be high availability nodes, while some would have less availability (e.g., a user closes their laptop intermittently throughout the day). Overlap could be continuously optmized to maximize distribution and probability that a file is retrievable. For example, the goal could be to have at least N high availability nodes (at X% availability over Y months) seeding every file and M lower availability nodes (at A% availability over B months). Other variables to determine availability (e.g., vetted agreements) could be used as part of the equation to avoid attacks to the system disrupting said equation (e.g., appearing to be highly available node for a given period and then disappearing outright to maximally disrupt the coverage equation).

The OSF would host torrent files (or packages of torrents) as core components of its projects. A custom client--beyond allowing for storage donation--could allow for queries in the form of a file hash, an OSF ID, or DOI that would resolve to the appropriate torrent(s) associated with the identifier. While the tracker and identifier database (i.e., the OSF) would need to be centralized for these use-cases, and while efforts like the endowment and a mirroring network that is in development could provide relatively strong guarantees about the sustainability of the centralized components, a "trackerless" fallback would offer a robust option in a worst-case scenario. It should be noted that the custom aspects of the tracker and client would be optional to access or seed data; common clients could be used.

###Requirements:

  • Using a standard, non-proprietary BitTorrent protocol...
  • And releasing code under an open (ideally, Apache Version 2 compatible license)...
  • In a state ideal for community development and contribution...
  • Create or modify a tracker that can also track and share coverage information for a given file as well as collect availability information about seeders over time. The tracker should support non-custom BitTorrent clients.
  • Create or modify a multi-platform, GUI client that can download and seed torrents and allowing users to "donate" a specific amount of storage. The client should be able to be run headlessly to support dedicated storage use-cases.
  • Add search capabilities to the client using the OSF API to link identifiers such as OSF IDs or DOIs with a set of torrents.

###Bonus Features:

  • Advanced features for selecting the type of content the donation is directed towards (e.g., files tagged as "cancer biology" or "computer science").
  • A "trackerless" fallback using a Distributed Hash Table (DHT).
  • Use of the file-segmenting aspect of the BitTorrent protocol to non-redundantly store and seed versions of files (or projects).

###Outstanding issues:

  • A number of attacks specifically target distributed networks such as BitTorrent. Understanding and potentially preparing solutions for those would be ideal.
  • Question: Should OSF projects (collections of files) be represented by a single torrent or a package of torrents (i.e., a zip file that could be recognized by the client)? If the former, can the data be stored efficiently to non-redundantly store versions of a given project?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment