This document is very much a work in progress.
Databroker projections provide a method to map from beamline-specific Databroker structures and names to common datastructures and names. They do this through a mapping document (a (projection)[https://github.com/bluesky/event-model/blob/8be9bd49b0ce76b64a9a111c1091364beedb1671/event_model/schemas/run_start.json#L8]) and python code that takes a BlueskyRun and a Projection as inputs, and returns a an xarray.Dataset that maps fields and values from the BlueskyRun.
Projections inherintly perform two separate forms of mapping: structural and semantic.
Projections map field names from one ontology to the other. If RunA were to define an array of "whites" and another were to define an arrary of "brights", the each run could have a separate Projection that maps to those fields to a projected field called "flats".
Bluesky's Event Model defines various schemas for how to structurally arrange scan data. Even within this strcuture, choices can be made about WHERE to put information. For example, if one has a separation between data frames, flat frames and dark frames, one could choose to the flat and dark frames to their own "streams" or to add them to the same stream as data frames. Additionally, one could choose to name the stream that contains data as "primary" or "data" or "zaphod".
Since the goal of Projections and Projctors is to output a single data structure that maps from different data sets, and whose fields and arrays can be fed to analysis tools without further mapping, the output structure of a projected BlueskyRun hides much of the Bluesky structure. For example, the xarray.Dataset that a projector outputs contains:
- "start doc" information in the datast.attrs)
- "event" field information in the dataset fields
- "event descriptor" configuraiotn fields in each dataset field
In this simple example, the projected xarray.Dataset
will contain two field...one in the top level attrs
and the other as a "column":
"projection": {
"image_data": {
"type": "linked",
"location": "event",
"stream": "primary",
"field": ":exchange:data"
},
"sample_name": {
"type": "configuration",
"field": ":measurement:sample:name",
"location": "start"
}
}
Notice the location
fields. "event" instructs the Projector to grab the field from an event stream, while "start" instructs the Projector to grab a field from the run start document.
The current Projection and Projector mechanism is intimately tied to BlueskyRuns. However, mapping from disparate datasets in disparate format using different ontologies is a fairly common problem. In order to use Projections in those scenarios, one is currently forced to "ingest" a dataset into Event Model documents (or BlueskyRun). But there is a great deal of data at rest for which this seems like overkill. Additionally, with the new Tiled project, there is an further emphasis on access data "at rest", providing it to clients in simpler structures than BlueskyRun or Event Model documents. However, the desire to provide semantic mapping remains.
To do this, I propose breaking the tie between Projections and Databroker, moving Projections into its own repository. I envision a suite of Projectors, each specialized for the source type of data. For example, Projectors that can providing mappings for CSV, or HDF5, etc. Perhaps this builds on top of Tiled and its output types. In this scenario, the current Databroker projections and projectors coudl be considered as specializations of Projections and Projectors.