parejkoj/APDB preloading notes.md

## APDB preloading notes.md

      
    Raw
  

              APDB preloading notes.md
            
          
    APDB preloading in general


Configuration: separate pipeline/config to run on preload, analogous to existing PIPELINES_CONFIG?
Run ap.assocaition.LoadDiaCatalogsTask as a standalone task inside the Prompt Processing code directly, make it a PipelineTask, or run it inside a new DatabasePreloadTask(PipelineTask)?

Either way, run LoadDiaCatalogsTask in preload, write diaObjects to a butler dataset ("{fakesType}{coaddName}Diff_diaObject_input"?), add that as an Input to DiaPipelineTask.


Would need APDB access in DiaPipelineTask, LoadDiaCatalogsTask, PP itself [and future Solar System task, assumed analogous to LoadDiaCatalogsTask].
Currently the Apdb object is a "pseudo-subtask" of DiaPipelineTask, set up via config fields and a "marker" Output conneciton to signal that the task completed it's APDB access.
LoadDiaCatalogsTask currently takes apdb as run argument.

Strategy: passing apdb to __init__:

PipelineTasks must be constructable from just (config, log, initInputs); no such restriction for general Tasks.
Two cases: It's not strictly necessary to use the same solution in both cases, but having different pipelines in batch and production makes testing less reliable and could cause confusion in the broader Science Pipelines group.
"batch" (pipetask/ap_verify/BPS) processing:


Run LoadDiaCatalogsTask as a subtask of DiaPipelineTask, pass diaPipe.apdb as explicit argument.


Can't be done in PP because of scheduling.
If DiaPipelineTask is also a PipelineTask for PP, we would need two different interfaces for getting APDB to it.


Run LoadDiaCatalogsTask as a subtask of a DatabasePreloadTask (analogous to current DiaPipelineTask).


This only pushes the problem one step back; still need to coordinate APDB access between DatabasePreloadTask and DiaPipelineTask.
If apdb is given as initInput, don't use self.config.apdb, otherwise use it: this lets Prompt Processing have only one apdb connection and pass it as necessary, while bps jobs can use the existing system. This may not be relevant: if LoadDiaCatalogsTask is both a PipelineTask and a subtask, it shouldn't need its own config.apdb at all.


Run LoadDiaCatalogsTask as a PipelineTask. APDB is treated as an InitInput using the apdb_marker hack.


It's legal but very confusing to have a config object as a Butler dataset.

Prompt Processing:


Run LoadDiaCatalogsTask as a subtask of a DatabasePreloadTask [see above]
Run LoadDiaCatalogsTask as a PipelineTask. APDB is treated as an InitInput using the apdb_marker hack [see above]
Run LoadDiaCatalogsTask as a PipelineTask, inject APDB into the InitInputs (e.g., using a custom TaskFactory)


Can't be done with mainstream executors.
If DiaPipelineTask is also usable as a subtask, then need two different interfaces for getting APDB to it.
Might break Middleware's connection logic?

Strategy: keeping APDB as a "subtask" of each task that needs it.


Not sure whether we need to explicitly retarget between ApdbSql and ApdbCassandra; configs do have some self-identifying capability (e.g., Config._fromPython), but I think ConfigurableField may bypass that.
two cases:

"batch" (pipetask/ap_verify/BPS) processing

Need to synchronize all instances of APDB config, e.g. with a config.py file.


Prompt Processing

Need to either assume specific APDB hooks in the pipeline, or search through the pipeline config for all instances of ApdbConfig.


Solar System specific behavior


configuration: bundled as preloadPipeline(s)
Load SSObject table from APDB (updated daily; can't init-optimize because query is position-dependent).

Do this via LoadSSObjectsTask (a PipelineTask), run in preload? Or make LoadSSObjectsTask a subtask of DatabasePreloadTask?


We need to compute the ephemerides to as close to the exposure midpoint as possible, which means both projecting from the database values to something close to the actual exposure time, and then doing a fast linear propagation from there inside the association code itself (once the exact exposure time is known).

Current next_visit message from the Summit does NOT predict time of exposure.
How precise does the exposure midpoint time have to be for SSObject propagation?

Jake/Mario: to within a second. So we'll have to compute an updated position with a linear propagation or something within the SSObjects task.
What do we start with as a guess on the exposure time? Just use the time next_visit was received? There is no "current exposure" concept in prompt processing, so the best we could do is a fudge factor.
It would be useful to ask Telescope & Site if next_visit could be modified to include an estimated start time of the exposure (this should be computable from the script queue).


For DiaPipelineTask, it's best for the corrected SS objects to be a Butler dataset.

New LoadSSObjectsTask: read from SSObject table, propagate positions, write to detector_visit_solar_system_objects.