pschella/butler-use-cases.md

## butler-use-cases.md

      
    Raw
  

              butler-use-cases.md
            
          
    Persona


Dave the Developer (e.g. pipeline developer)
Susy the Astronomer (e.g. general public astronomer user)
Otto the Operator (e.g. person running pipelines on a cluster in operations)

Use cases


Susy is going to a conference and wants to pre-cache some data, from a remote repository, to be able to access through the butler while without network connectivity.
Susy and her colleagues want to access the same data (or overlapping) from a remote repository. It would be efficient if this can be cached in an on-site proxy.
Susy (or the task she is running) wants to load a dataset through the butler, but the dataset is too large for her device memory. What does the butler do?
Susy wants to access metadata associated with a dataset. Does the butler need to load the entire dataset?
Dave needs, for performance reasons, to do a direct memory map of a (part of a) file. Is this possible with a dataset provided by the butler (i.e. can code do low-level IO)?
Susy / Dave needs a large dataset from the butler, can she / he do this asynchronously (e.g. does the butler support multiple asynchronous, potentially simultaneous accesses)?
Otto needs to read/write many new datasets from/to a central repository. Can this be done as a single (ACID) transaction?
Susy wants to build a query interactively (e.g. using tab-complete or some other predictive help that knows about the dataflow)
Susy wants to get images from LSST and another telescope served in a consistent fashion. (this would render joint processing with Euclid and WFIRST natural with the stack, and could also be used by other astronomers with their own smaller data sets (think: deconfusing Spitzer with LSST))
Susy / Dave writes a new algorithm that produces a new data product and wants to be able to read/write this new product to any repository without a lot of work

Deprecated


Otto the operator wants to distribute (multiple pipeline runs) over a set of nodes, he (or the scheduling framework) needs to know data locality. Does he ask the butler?
Dave / Otto needs to access a part of a data object that is loaded on another cluster node. Can he use direct RDMA access to access that data in memory without pulling over a serialized version (as is supported over e.g. Infiniband)?