Skip to content

Instantly share code, notes, and snippets.

@pschella
Last active August 24, 2017 16:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pschella/101e8caddc9822364b1e459874cf8aa8 to your computer and use it in GitHub Desktop.
Save pschella/101e8caddc9822364b1e459874cf8aa8 to your computer and use it in GitHub Desktop.
Use Cases for the LSST DM Data Butler Working Group

Persona

  • Dave the Developer (e.g. pipeline developer)
  • Susy the Astronomer (e.g. general public astronomer user)
  • Otto the Operator (e.g. person running pipelines on a cluster in operations)

Use cases

  • Susy is going to a conference and wants to pre-cache some data, from a remote repository, to be able to access through the butler while without network connectivity.
  • Susy and her colleagues want to access the same data (or overlapping) from a remote repository. It would be efficient if this can be cached in an on-site proxy.
  • Susy (or the task she is running) wants to load a dataset through the butler, but the dataset is too large for her device memory. What does the butler do?
  • Susy wants to access metadata associated with a dataset. Does the butler need to load the entire dataset?
  • Dave needs, for performance reasons, to do a direct memory map of a (part of a) file. Is this possible with a dataset provided by the butler (i.e. can code do low-level IO)?
  • Susy / Dave needs a large dataset from the butler, can she / he do this asynchronously (e.g. does the butler support multiple asynchronous, potentially simultaneous accesses)?
  • Otto needs to read/write many new datasets from/to a central repository. Can this be done as a single (ACID) transaction?
  • Susy wants to build a query interactively (e.g. using tab-complete or some other predictive help that knows about the dataflow)
  • Susy wants to get images from LSST and another telescope served in a consistent fashion. (this would render joint processing with Euclid and WFIRST natural with the stack, and could also be used by other astronomers with their own smaller data sets (think: deconfusing Spitzer with LSST))
  • Susy / Dave writes a new algorithm that produces a new data product and wants to be able to read/write this new product to any repository without a lot of work

Deprecated

  • Otto the operator wants to distribute (multiple pipeline runs) over a set of nodes, he (or the scheduling framework) needs to know data locality. Does he ask the butler?
  • Dave / Otto needs to access a part of a data object that is loaded on another cluster node. Can he use direct RDMA access to access that data in memory without pulling over a serialized version (as is supported over e.g. Infiniband)?
@TallJimbo
Copy link

On data locality: as we imagined it in SuperTask WG, that's something discovered through the SuperTask control system and imposed by a workflow system that stages the appropriate files on the nodes where they'll be processed. The SuperTask WG did not identify any requirement for running on any system on which a data repository could be spread across multiple nodes but some nodes were "closer" to some datasets.

@pschella
Copy link
Author

I'm deprecating multi-node use cases as per previous comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment