- Dave the Developer (e.g. pipeline developer)
- Susy the Astronomer (e.g. general public astronomer user)
- Otto the Operator (e.g. person running pipelines on a cluster in operations)
- Susy is going to a conference and wants to pre-cache some data, from a remote repository, to be able to access through the butler while without network connectivity.
- Susy and her colleagues want to access the same data (or overlapping) from a remote repository. It would be efficient if this can be cached in an on-site proxy.
- Susy (or the task she is running) wants to load a dataset through the butler, but the dataset is too large for her device memory. What does the butler do?
- Susy wants to access metadata associated with a dataset. Does the butler need to load the entire dataset?
- Dave needs, for performance reasons, to do a direct memory map of a (part of a) file. Is this possible with a dataset provided by the butler (i.e. can code do low-level IO)?
- Susy / Dave needs a large dataset from the butler, can she / he do this asynchronously (e.g. does the butler support multiple asynchronous, potentially simultaneous accesses)?
- Otto needs to read/write many new datasets from/to a central repository. Can this be done as a single (ACID) transaction?
- Susy wants to build a query interactively (e.g. using tab-complete or some other predictive help that knows about the dataflow)
- Susy wants to get images from LSST and another telescope served in a consistent fashion. (this would render joint processing with Euclid and WFIRST natural with the stack, and could also be used by other astronomers with their own smaller data sets (think: deconfusing Spitzer with LSST))
- Susy / Dave writes a new algorithm that produces a new data product and wants to be able to read/write this new product to any repository without a lot of work
- Otto the operator wants to distribute (multiple pipeline runs) over a set of nodes, he (or the scheduling framework) needs to know data locality. Does he ask the butler?
- Dave / Otto needs to access a part of a data object that is loaded on another cluster node. Can he use direct RDMA access to access that data in memory without pulling over a serialized version (as is supported over e.g. Infiniband)?
On data locality: as we imagined it in SuperTask WG, that's something discovered through the SuperTask control system and imposed by a workflow system that stages the appropriate files on the nodes where they'll be processed. The SuperTask WG did not identify any requirement for running on any system on which a data repository could be spread across multiple nodes but some nodes were "closer" to some datasets.