Skip to content

Instantly share code, notes, and snippets.

@blahah
Last active February 29, 2020 17:50
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save blahah/a987d15c38fb0985785f4ab619250c69 to your computer and use it in GitHub Desktop.
Save blahah/a987d15c38fb0985785f4ab619250c69 to your computer and use it in GitHub Desktop.
Ways dat can be leveraged to transform science, #1 - the internet of data transforms

dat is an incredibly powerful technology for peer to peer sharing of versioned, secure, integrity-guaranteed data.

One thing it excels at is populating a live feed of data points from one source, and allowing any number of peers to subscribe to that feed. The data can only originate from the original source (this is guaranteed using public-key encryption), but the peers in the network can still sync the new data with one another. To subscribe to a given source you only need to know an alphanumeric key that uniquely identifies the source, and is automatically generated by dat.

There are many ways that this simple system can be used to build a new infrastructure for science. This is the first in a series of posts in which I'll explain how.

Here I briefly describe some ways dat can be used to automate some aspects of scientific discovery, increase resource and information reuse efficiency, and help keep our information resources up to date with science (a topic I will expand on significantly in later posts).

The internet of data transforms

The internet of data transforms is a theoretical network of relatively small compute nodes, each specialised at doing one particular thing to data. Each node would subscribe to data from somewhere - perhaps from an API, or more likely to a dat feed. It would perform some computation on the data and output the result in a different dat feed.

It might be crudely represented this way:

  ---dat-feed-A---> [data transform node] ---dat-feed-B--->

This is analagous to a function in programming, in that it takes input and does something with it before providing output. It's also analogous to UNIX pipes or nodeJS streams, in that data passes between entities that do something with it and can pull new entries from the feed on demand.

Any entity that wanted to make use of the resulting data from a given node could subscribe to the outgoing feed - again, this just requires knowing the key. Nodes are not restricted to a single input or output - they could consume any number of feeds and output any number. Examples of things a data transform might include:

  • filtering the input based on certain criteria
  • calculating some mathematical property such as a hash or summary statistic
  • changing the format of the data
  • merging two or more sources of data into a single stream

Because a data transform node can consume feeds and output feeds, they could be chained together in pipelines or connected in more complex networks. People or automatons could subscribe to any node in the network to take advantage of just the type of data output at that point.

Example pipeline of transforms 1: the self-updating sourmash tree of life

Phylogenetic trees are made by aligning nucleotide sequences, analysing the alignments to compute distance measures between the sequences, and then probabilistically inferring evolutionary relationships that could explain the observed data.

Making a large phylogenetic tree is a big job - people painstakingly collect data and analyse it. Much of the analysis could be automated, but there are many working parts, and there has not really been any easy, consistent way to achieve something like this.

Here's a simplified version of how the internet of data transforms could be used to do it, expressed as a list of transform nodes. You should assume that each node in the list is subscribed to the output feed of the previous node in the list:

  • SRA metadata - pulls new data from the NCBI Sequence Read Archive every day pushing a JSON object for each new entry in the archive to its output feed
  • SRA sourmasher - for every entry in the SRA metadata feed, download the read dataset(s) and run sourmash on each one. Add the resulting hash to the original SRA metadata entry, and push it to the output feed.
  • SRA novelty detector - for every entry in the SRA sourmash feed, compare the hash to a database of previous hashes, then add it to the database. If it passes some threshold for distance from anything that was already in the database, push it to the output feed along with the distance measure.
  • tree rebuilder - for every entry in the SRA novelty detector feed, rebuild the tree, and push the new version of the tree to the output feed.

The sourmash tree of life project could use the dat browser JS client to live-update the tree of life data in the browser, including the version and linking to the source data. Because the feed is a versioned append-only log, the interface could trivially enable seeing how the tree has changed over time by just scrolling back through the feed, or comparing two entries - and to see how any given sequencing dataset changed the tree.

With four data transform nodes, we've built a (vastly simpliflied) pipeline for a constantly updating tree of life.

Example 2: The live-updating scientific article identifying crop breeding target genes

One might ask "why not run all this analysis in one script and skip all this feeds nonsense"?

You could do this, but the emergent possibilties of the system would be lost.

By making each node independent and subscribable, you are creating a series of resources that serve the needs of the original pipeline, but also allow a limitless number of other uses for the data at each stage in the pipeline.

To illustrate this, here's an example:

A researcher working on C4 photosynthesis might want to subscribe to the SRA sourmash feed, and have a dependent pipeline of data transforms that looks something like this:

  • filter to only keep RNA-Seq datasets
  • using the sourmash hash, look for datasets that appear to be from close relatives of C4 grass species or their C3 sister-clades
  • perform transcriptome assembly
  • annotate the assembly
    • quantify transcript expression
    • add the dataset into a more refined protein-sequence based phylogeny
  • merge the results into a gene expression phylogeny dataset
  • (re-)run a statistical analysis to rank gene families according to their likely importance to the functioning of C4 photosyntesis

The researcher could subscribe to this final feed in their Dat desktop app, getting live updates any time there is new information. The same results could feed directly into a living publication in ScienceFair, so that all readers who have tagged the paper would be notified that there is new information.

Example 3: automated realtime updating of our scientific models

To take this back a bit from the detail of individual research programmes, let's think how this could affect science, or our capability to update and reason with knowledge itself.

Any field of science involves attempting to explain or understand phenomena. This is usually done by constructing several different possible models for how the thing works, and then using evidence combined with probability and statistics to assign confidence to the different competing models.

Or to put it more simply: science is about understanding how stuff works. We do that by using data to decide which possible explanation for how stuff works we believe in most.

The internet of data transforms described above could form the basis for a system that allows us to set up the statistical results of any science so that they can continuously update as more information comes in. By connecting more relevant data streams to the network for a given inference, improving the analysis done at a given node, and so on, new or improved information or techniques could propagate through the entire system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment