Skip to content

Instantly share code, notes, and snippets.

@toddkaufmann
Created May 7, 2019 02:20
Show Gist options
  • Save toddkaufmann/577d0014e25717548e8a8513595234e4 to your computer and use it in GitHub Desktop.
Save toddkaufmann/577d0014e25717548e8a8513595234e4 to your computer and use it in GitHub Desktop.
Last edited by Todd Kaufmann 3 years ago
= Proposed job scheduler change to support smaller chunksize, greater resilience =
== Idea: instead of looping over instances, loop over the set of jobs ==
for example, 1000 commands, 10 instances --> 100 jobs (0..99), where job N is commands N10 .. N10+9.
Jobs have three states:
done
running - then job also has associated instance
waiting (to run)
at start set:
list of instances available (duplicate by # cores, or name inst#1,#2,etc)
all jobs are waiting:
job.state = waiting job.cmds = N10 .. N10+9
general pseudocode for the idea:
Loop:
if instance available
get next waiting job
start:
job.state = running, job.instance = inst_id
remove inst_id from list
else
for each job that is running
if finished, then
if cmd was successfull # if there is a way to tell outputs are correct etc
job.state = done
job.instance goes back on list
else
job.state = waiting
job.instance goes back on list
job.cmd = job.cmd + " try-one-more-time " # or
job.tries += 1
# ie run it again
else
# Additional feature, to support restart of jobs
check for response/still running etc (count++)
remove instance from list / restart or terminate ?
other jobs (if any) running on this instance then also need to be killed
ie, each running with job.instance == inst, is killed
killed jobs go back to job.state = waiting
if no jobs finished this time / no instances available,
print "d done, r running, w waiting"
then wait a while (sleep), continue loop
until no jobs exists where job.state = waiting or running
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment