toddkaufmann/proposed scheduler

## proposed scheduler
Last edited by Todd Kaufmann 3 years ago
= Proposed job scheduler change to support smaller chunksize, greater resilience =

== Idea: instead of looping over instances, loop over the set of jobs ==

for example, 1000 commands, 10 instances --> 100 jobs (0..99), where job N is commands N10 .. N10+9.

Jobs have three states:

done
running - then job also has associated instance
waiting (to run)
at start set:

list of instances available (duplicate by # cores, or name inst#1,#2,etc)

all jobs are waiting:

job.state = waiting job.cmds = N10 .. N10+9

general pseudocode for the idea:

Loop:
    if instance available
      get next waiting job
      start:
        job.state = running, job.instance = inst_id
	remove inst_id from list
    else
      for each job that is running
        if finished, then
	  if cmd was successfull   #  if there is a way to tell outputs are correct etc
	    job.state = done
	    job.instance goes back on list
	  else
	    job.state = waiting
	    job.instance goes back on list
	    job.cmd = job.cmd + " try-one-more-time "  # or
	    job.tries += 1
	    # ie run it again
	else
          # Additional feature, to support restart of jobs
	  check for response/still running etc (count++)
	  remove instance from list / restart or terminate ?
	  other jobs (if any) running on this instance then also need to be killed
	  ie, each running with job.instance == inst,  is killed
	  killed jobs go back to  job.state = waiting
      if no jobs finished this time / no instances available,
        print "d done,  r running,  w waiting"
        then wait a while (sleep), continue loop
    until no jobs exists where  job.state = waiting or running
	Last edited by Todd Kaufmann 3 years ago
	= Proposed job scheduler change to support smaller chunksize, greater resilience =

	== Idea: instead of looping over instances, loop over the set of jobs ==

	for example, 1000 commands, 10 instances --> 100 jobs (0..99), where job N is commands N10 .. N10+9.

	Jobs have three states:

	done
	running - then job also has associated instance
	waiting (to run)
	at start set:

	list of instances available (duplicate by # cores, or name inst#1,#2,etc)

	all jobs are waiting:

	job.state = waiting job.cmds = N10 .. N10+9

	general pseudocode for the idea:

	Loop:
	if instance available
	get next waiting job
	start:
	job.state = running, job.instance = inst_id
	remove inst_id from list
	else
	for each job that is running
	if finished, then
	if cmd was successfull # if there is a way to tell outputs are correct etc
	job.state = done
	job.instance goes back on list
	else
	job.state = waiting
	job.instance goes back on list
	job.cmd = job.cmd + " try-one-more-time " # or
	job.tries += 1
	# ie run it again
	else
	# Additional feature, to support restart of jobs
	check for response/still running etc (count++)
	remove instance from list / restart or terminate ?
	other jobs (if any) running on this instance then also need to be killed
	ie, each running with job.instance == inst, is killed
	killed jobs go back to job.state = waiting
	if no jobs finished this time / no instances available,
	print "d done, r running, w waiting"
	then wait a while (sleep), continue loop
	until no jobs exists where job.state = waiting or running