There is a problem with the assumption that the OS will handle the charge of multiples tasks running concurrently(right now, INGInious immediately launches tasks it receives, (nearly) without limits), mainly because of memory usage.
For a strange reason, processes are never/not enough put to swap, leading to OOM in other containers.
There is an unused option in Docker (MemorySwap), that allows to fix the amount of Swap available to a container. But it seems that the default value is infinite, so this won't help.
So it seems that a failure of the OS to manage lots of processes using lots of memory.
When the scheduler will be done, we will probably restrict the number of running tasks to the number of CPUs/cores. We cannot do that right now because of the probable "starvation" (in the sense described below) that could occur.
The freeze command in docker put a running container hierarchy inside the Freezer control group. The system will then exclude the processes from the container from its own scheduler, which that no CPU will be used by the container anymore.
That would be useful if the CPU was a problem, but that's not the case. But when a process is in the Freezer cgroup, there should be (but I can't find any indication about that anywhere) greater chance for this process to be swapped. This needs to be tested.
The scheduler can
- Start a task
- Kill a task (for restarting it later) (to avoid?)
- Verify the status of the task (task is done or not)
- Freeze a task
- Unfreeze a task
So the scheduler needs to be non-preemptive, and should be aware that when a task is launched, it may (if we don't consider the option of killing/restarting it) never stops before the timeout.
- CPU constraints
- Memory constraints
- Must avoid starvation
- Need to minimize the total time to execute the tasks (with trade-offs)
- Short tasks must have a short waiting time, while long tasks may have very long waiting times (this is another type of starvation). This is clearly the biggest point to optimize.
- As said earlier, the scheduler cannot be (completely) preemptive: once a task is started, it cannot be put out of memory (but can be frozen)
- New tasks can be added at any time.
The best solution available (Criu http://criu.org/Main_Page) is not really compatible with Docker(http://kimh.github.io/blog/en/criu/experiment-to-suspend-and-resume-docker-container-with-criu/, http://criu.org/Docker), so we can't use these kind of techniques right now.
HTCondor does seem to have the ability to move tasks from one computer to another, which means its scheduler can be/is preemptive. There are loads of documentation about schedulers used in operating systems, but most of them are preemptive.