Skip to content

Instantly share code, notes, and snippets.

@arundurvasula
Last active August 29, 2015 14:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arundurvasula/bc9ea4506270f555d1b3 to your computer and use it in GitHub Desktop.
Save arundurvasula/bc9ea4506270f555d1b3 to your computer and use it in GitHub Desktop.
LTM

tl;dr: data management software that logs access to files.

##1. The problem:

Modern data analysis relies on many sophisticated tools that perform a wide range of calculations on data. While software continues to evolve along with methods, data management still remains a complicated problem.

First, data analysis involves a lot of trail and error. One method may work well on one dataset, but it may not work as well on another. In its nature, data analysis must be done many times to arrive upon the best solution (if there is one). This process of trial and error, however is costly in time and organization. While solutions exist to mitigate these problems (for example, software that runs other software for you), these solutions are not complete.

Specifically, organization is difficult because there is no obvious and systematic way to keep track of what has been done to data. For example, when assembling sequence data many assemblers must be used with different options to find the optimal assembly method and options. While it's possible to keep track of what one has done in a script, this method will not capture any data analysis done outside of the script.

Second, over time, it becomes difficult to remember what analyses have been done on data. This can be addressed by appending descriptors to the filename (e.g. sample1.trimmed.qual.mapped.bam). However, this quickly becomes unwieldy and fails to capture exactly what has happened to the file (including program options).

##2.The solution:

The project suggested here is a daemon that watches data directories using the inotify API. It stores information about what processes and users read and modify data and stores it in a hidden json log file in that directory. It will also support arbitrary metadata used to describe the data in the same json log file. For example, it can store information about how and when the data was collected. This information need not be present in all data, which provides flexibility in describing the data.

Second, this project will provide a local webserver to access and modify the logs in a user friendly manner. Because the log format is standard json, we can build upon previous web applications to quickly build a web front end, similar to CouchDB's Futon (http://docs.couchdb.org/en/1.6.1/intro/futon.html).

@rossibarra
Copy link

Not sure I see the utility of the websever. Presumably this is most useful for analyses on clusters etc. with lots of commandline work. A text file I can grep or less through is more useful than a webserver. One thing that might be more useful is a pretty-formatted output possibility -- say in markdown -- for reading legibility.

@arundurvasula
Copy link
Author

True, but not everyone has access to a cluster. For less computationally minded labs, they might only have 1 server but still deal with a lot of data. It's not ideal, but still need to get stuff done on it.

I agree that it should be secondary and grepping and lessing (or whatever else) through a pretty printed file is important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment