Skip to content

Instantly share code, notes, and snippets.

@TallJimbo
Last active August 10, 2017 19:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save TallJimbo/bb462488ccd6b769c672c567d47f9962 to your computer and use it in GitHub Desktop.
Save TallJimbo/bb462488ccd6b769c672c567d47f9962 to your computer and use it in GitHub Desktop.
Use Cases for the LSST DM Data Butler Working Group

Definitions

Data Repository

A collection of datasets and the metadata that describes them, managed by the Butler. Datasets in a repository share at least some provenance (e.g. configuration and software versions). Data repositories may be chained together, enabling the outputs of one step of processing to be used as the inputs for the next, and the graph of related data repositories defines a search path for the datasets within them.

Data ID

A (possibly opaque) identifier that can be used to get or put a single dataset from a Butler.

Data ID Expresssion

A user-provided description a group of datasets, such as the --id arguments to CmdLineTask. The metadata in a Data Repository may be necessary to transform a Data ID Expression into a set of Data IDs.

Usage Contexts

Batch Production Service

The official LSST high-throughput compute environment used for DRP processing, at least most CPP processing, and large-scale test processing. Implemented at NCSA, CC-IN2P3, and probably extendable to external compute resources. May not have a shared filesystem, but does have access to Data Backbone.

Level 1 Processing

The compute environment for LSST alert productions, in which data is delivered directly from camera buffers and must be processed with low-latency.

External HPC

A compute environment on traditional High-Performance Computing resources not managed directly by LSST, such as a university compute cluster. Assumed to have a shared filesystem, but does not have access to Data Backbone. Also covers DM HPC resources for developers (i.e. lsst-dev) prior to standing up the Batch Processing Service.

Local Science Platform

The compute environment experienced directly by users of the LSST Science Platform notebook environment (i.e. the environment provided by the notebook kernels and shell themselves). Has access to Data Backbone, possibly mediated by Remote Data Access Services.

Science Platform Batch Computing

The compute environment used to processing longer-running, asynchronous jobs launched by Science Platform users (either notebook or portal). May be the same as Batch Processing Services. Has access to Data Backbone, probably not mediated by Remote Data Access Services.

Remote Data Access Services

The compute environment seen by users who access LSST data from external systems using data access services (e.g. VO APIs).

Developer Laptops/Workstations

The compute environment of LSST developers (especially during construction) and power-users on on their own single-node systems. Any data repositories are assumed to be local (access to remote data is covered by Remote Data Access Services).

Use Cases

SuperTask Pre-Flight Metadata Querying

The SuperTask control system passes a Data ID Expression and a Data Repository to the data access system, obtaining a graph-like data structure describing the datasets present in the repository and the relationships between them.

Contexts: Batch Production Service, External HPC, Local Science Platform, Developer Laptops/Workstations, Remove Data Access Service1.

SuperTask Execution

Given Data IDs for multiple input and output datasets, a SuperTask writes the output datasets from memory to storage and reads in input dataset from storage into memory.

It must be possible to configure the Butler in this use case to elide any actual I/O for certain datasets when the workflow system can determine that the only consumers of an output dataset will be run in the same process. Similarly, it must be possible to store certain datasets in local scratch space when some or all consumers will be able to see the same scratch space (and/or it is determined that later staging from scratch to persistent storage is preferable).

It must be possible to configure the Butler to use unique filenames (across at least all datasets used by a particular production run) for each dataset in at least some of the contexts in which this use case appears.

Contexts: Batch Production Service, External HPC, Developer Laptops/Workstations, Level 1 Processing, Science Platform Batch Computing

SuperTask Input Transfer

Given a set of data IDs representing the input datasets for the work to be done on a particular compute resource (e.g. a node, or group of nodes with access to the same filesystem), a workflow system stages data from a remote repository (possibly in the Data Backbone) to that resource's filesystem.

Contexts: Batch Production Service, Level 1 Processing (?), Science Platform Batch Computing (?), Remote Data Access Service1

SuperTask Output Transfer

Given a local Data Repository containing SuperTask outputs from a particular compute resource and a set of Data IDs, a workflow system transfers the outputs and any repository-level provenance back to a remote repository (possibly in the Data Backbone or Local Science Platform storage).

Contexts: Batch Production Service, Level 1 Processing (?), Science Platform Batch Computing (?)

Analysis Metadata Querying

Given a Data ID Expression, a dataset type, and a Data Repository, a science user or DM developer in an interactive Python environment or one-off script retreives a set of Data IDs that can be used to load the datasets that match the expression.

Contexts: Local Science Platform, Remote Data Access Service, External HPC, Developer Laptops/Workstations

Analysis Dataset Access

Given a Data ID or a Data ID Expression that resolves to a single Data ID, a dataset type, and a Data Repository, a science user or DM developer in an interactive Python environment or one-off script loads a dataset from the repository into memory. In contexts where the Data Repository is remote, it should be possible to configure the Butler to populate a local Data Repository on-the-fly with any retrieved datasets, allowing the local repository to be used for subsequent reads.

Contexts: Local Science Platform, Remote Data Access Service, External HPC, Developer Laptops/Workstations

Dataset Type Registration

A SuperTask or interactive user defines a new dataset type, supplying a new name to go with a Data ID form composed of familiar components2 and a description of the in-memory class type, and optionally a path template and on-disk format. This step is required before any instances of that dataset type can be added to the repository.

Contexts: Batch Production Service, External HPC, Local Science Platform, Developer Laptops/Workstations

Raw Data Ingest

An automated part of the camera-DM interface or a science user or developer with non-LSST data adds raw data and its associated metadata to Data Repository, both adding new datasets and registering units of data for them3. The Data Repository may be pre-existing and may already have datasets that duplicate or conflict with the new dataset, and the caller should be able to set whether this should result in those datasets being replaced or skipped.

Contexts: Level 1 Processing, External HPC, Developer Laptops/Workstations

SkyMap Registration

A science user or DM developer adds a new tesselation of the sky into tracts and patches to a Data Repository, allowing future datasets to be defined that utilize these units of data.

Contexts: Batch Production Service, External HPC, Developer Laptops/Workstations, Local Science Platform

Metadata Updating

A SuperTask has improved our estimates of the metadata associated with a unit of data (such as the position of a sensor-level image on the sky), and the SuperTask control system updates the metadata in a Data Repository accordingly (actually, a chained output Data Repository should be created with the updated metadata; the input should not be modified in-place).

Contexts: Batch Production Service, External HPC, Developer Laptops/Workstations

Footnotes

1. I'd call this a stretch goal, not a requirement, but it'd be pretty nice if you could do SuperTask preflight on your laptop or local HPC using DAX services to do the metadata lookups, stage the data identified by that lookup to your system, and then do the SuperTask execution there.

2. In the current butler, these "familiar components" are the keys of the Data ID dict. The SuperTask WG is proposing a new approach in which these components become class objects called Units (for "units of data").

3. In the current butler, "registering new units of data" means adding rows to the registry. As with 2 the SuperTask WG proposal clarifies and more strongly enforces the distinction between adding datasets and adding units of data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment