Skip to content

Instantly share code, notes, and snippets.

@SimonKrughoff
Last active August 31, 2017 16:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save SimonKrughoff/f3fd0b4f2331c994991558940aa1ac72 to your computer and use it in GitHub Desktop.
Save SimonKrughoff/f3fd0b4f2331c994991558940aa1ac72 to your computer and use it in GitHub Desktop.
List of butler use cases for Science Platform

Use cases for Science Platfrom

  • Subsection repositories based on datasets and data IDs.
    • Given this list of data IDs I need a coherent self-consistent standalone repo of PVIs and deep coadds.
      • This means the butler will need to infer dataIds for particular datasets based on the dataIds for others. E.g. infer coadd tract and patch from PVI visit and ccdnum.
  • We need a mechanism for discovering data based on multiple axes (good seeing, bad seeing, time based, area of sky).
    • As an example, image characterization pipeline publishes relevant results (seeing) to DBB, then later pipelines should be able to query based upon it.
  • I/O plugin needs to be configurable on a dataset basis.
    • The example is that a user may want intermediate files to go to the local POSIX system, but may want end products to go to a more permanent storage location.
  • Datasets need to be easy to define. Every dataset need not implement persistence for every storage context (why would you write a RDMS storage engine for ImageF?), but implementing persistence for "reasonable" storage contexts should be straightforward.
    • Examples from SQuaRE would be lsst.vderify.Measurement or lsst.verify.Job.
  • Another use case from the QA world, it would be nice to be able to specify datasets based on SQL (or ADQL) queries. If we have fixed schemas, it would allow us to say Butler.get('high_snr_stars') rather than Butler.get('select * from src where psfFlux/psfFlux_err > 100 and isStar == 1').
  • Sometimes we have outputs from QA that are related to measurements of metrics, but maybe not specific LSST classes (e.g. a whisker diagram of the PSF ellipticity as a function of focal plane position). Do we care about being able to Butler.put binary blobs like that?
  • We will want to access the same datasets from multiple runs using different code/configuration.

Implementation concerns

  • Will sqlite dbs for the registry scale?
  • It's clear that having an I/O module that can interact with an object store (e.g. S3). This means that the design should be flexible relative to the limitations to the various systems: e.g. lack of atomic consistency on object stores, inode contention on GPFS.
@timj
Copy link

timj commented Aug 9, 2017

Isn't there a QA use case for having one set of data processed multiple times with different versions of the software and being able to retrieve datasets based on specific runs so that you can compare them?

@SimonKrughoff
Copy link
Author

Good point. I'll add something.

@r-owen
Copy link

r-owen commented Aug 31, 2017

Could you please spell out the acronym PVI the first time you use it? Even better, always spell it out (since it's only used a few times).

@r-owen
Copy link

r-owen commented Aug 31, 2017

What do you mean by "It's clear that having an I/O module that can interact with an object store (e.g. S3)."? I think you may have left out a few words?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment