Skip to content

Instantly share code, notes, and snippets.

@parejkoj
Created November 30, 2023 23:42
Show Gist options
  • Save parejkoj/2957b8308f219819464c15da2a41525c to your computer and use it in GitHub Desktop.
Save parejkoj/2957b8308f219819464c15da2a41525c to your computer and use it in GitHub Desktop.

APDB preloading in general

  • Configuration: separate pipeline/config to run on preload, analogous to existing PIPELINES_CONFIG?
  • Run ap.assocaition.LoadDiaCatalogsTask as a standalone task inside the Prompt Processing code directly, make it a PipelineTask, or run it inside a new DatabasePreloadTask(PipelineTask)?
    • Either way, run LoadDiaCatalogsTask in preload, write diaObjects to a butler dataset ("{fakesType}{coaddName}Diff_diaObject_input"?), add that as an Input to DiaPipelineTask.
  • Would need APDB access in DiaPipelineTask, LoadDiaCatalogsTask, PP itself [and future Solar System task, assumed analogous to LoadDiaCatalogsTask].
  • Currently the Apdb object is a "pseudo-subtask" of DiaPipelineTask, set up via config fields and a "marker" Output conneciton to signal that the task completed it's APDB access.
  • LoadDiaCatalogsTask currently takes apdb as run argument.

Strategy: passing apdb to __init__:

PipelineTasks must be constructable from just (config, log, initInputs); no such restriction for general Tasks.

Two cases: It's not strictly necessary to use the same solution in both cases, but having different pipelines in batch and production makes testing less reliable and could cause confusion in the broader Science Pipelines group.

"batch" (pipetask/ap_verify/BPS) processing:

  1. Run LoadDiaCatalogsTask as a subtask of DiaPipelineTask, pass diaPipe.apdb as explicit argument.
  • Can't be done in PP because of scheduling.
  • If DiaPipelineTask is also a PipelineTask for PP, we would need two different interfaces for getting APDB to it.
  1. Run LoadDiaCatalogsTask as a subtask of a DatabasePreloadTask (analogous to current DiaPipelineTask).
  • This only pushes the problem one step back; still need to coordinate APDB access between DatabasePreloadTask and DiaPipelineTask.
  • If apdb is given as initInput, don't use self.config.apdb, otherwise use it: this lets Prompt Processing have only one apdb connection and pass it as necessary, while bps jobs can use the existing system. This may not be relevant: if LoadDiaCatalogsTask is both a PipelineTask and a subtask, it shouldn't need its own config.apdb at all.
  1. Run LoadDiaCatalogsTask as a PipelineTask. APDB is treated as an InitInput using the apdb_marker hack.
  • It's legal but very confusing to have a config object as a Butler dataset.

Prompt Processing:

  1. Run LoadDiaCatalogsTask as a subtask of a DatabasePreloadTask [see above]
  2. Run LoadDiaCatalogsTask as a PipelineTask. APDB is treated as an InitInput using the apdb_marker hack [see above]
  3. Run LoadDiaCatalogsTask as a PipelineTask, inject APDB into the InitInputs (e.g., using a custom TaskFactory)
  • Can't be done with mainstream executors.
  • If DiaPipelineTask is also usable as a subtask, then need two different interfaces for getting APDB to it.
  • Might break Middleware's connection logic?

Strategy: keeping APDB as a "subtask" of each task that needs it.

  • Not sure whether we need to explicitly retarget between ApdbSql and ApdbCassandra; configs do have some self-identifying capability (e.g., Config._fromPython), but I think ConfigurableField may bypass that.
  • two cases:
    • "batch" (pipetask/ap_verify/BPS) processing
      • Need to synchronize all instances of APDB config, e.g. with a config.py file.
    • Prompt Processing
      • Need to either assume specific APDB hooks in the pipeline, or search through the pipeline config for all instances of ApdbConfig.

Solar System specific behavior

  • configuration: bundled as preloadPipeline(s)
  • Load SSObject table from APDB (updated daily; can't init-optimize because query is position-dependent).
    • Do this via LoadSSObjectsTask (a PipelineTask), run in preload? Or make LoadSSObjectsTask a subtask of DatabasePreloadTask?
  • We need to compute the ephemerides to as close to the exposure midpoint as possible, which means both projecting from the database values to something close to the actual exposure time, and then doing a fast linear propagation from there inside the association code itself (once the exact exposure time is known).
    • Current next_visit message from the Summit does NOT predict time of exposure.
    • How precise does the exposure midpoint time have to be for SSObject propagation?
      • Jake/Mario: to within a second. So we'll have to compute an updated position with a linear propagation or something within the SSObjects task.
      • What do we start with as a guess on the exposure time? Just use the time next_visit was received? There is no "current exposure" concept in prompt processing, so the best we could do is a fudge factor.
      • It would be useful to ask Telescope & Site if next_visit could be modified to include an estimated start time of the exposure (this should be computable from the script queue).

For DiaPipelineTask, it's best for the corrected SS objects to be a Butler dataset.

  • New LoadSSObjectsTask: read from SSObject table, propagate positions, write to detector_visit_solar_system_objects.
@isullivan
Copy link

In general, I strongly agree that the implementation of preloading in batch and Prompt Processing should match as closely as possible. I think that suggests we run LoadDiaCatalogsTask and LoadSSObjectsTask as PipelineTasks, and pass the APDB configuration as an InitInput (option 3 for batch and 2 for PP). Perhaps we refer to that connection as the APDB interface or APDB interface constructor (for example) to avoid confusion from passing a config as a Butler data product.

For the Solar System side, it is still TBD exactly how we will load the daily ephemerides (see DM-41971), but the LoadSSObjectsTask should write a catalog that should be straightforward to read in diaPipe. If we end up needing a separate connection to the APDB to write the associated SS objects, or a connection to a separate dedicated "SSDB", then we should be able to make use of the same config (or "interface") InitInput to diaPipe as above. We should be OK in preloading even if the timing is off by a couple minutes, since we can compute corrections to the positions within diaPipe, so for now adding a fudge factor to the time when the next_visit event was received should be sufficient. We should absolutely push Telescope & Site to add estimated start time to the event, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment