mconcas/T20150805.md

## T20150805.md

      
    Raw
  

              T20150805.md
            
          
    Turin, Wed 5 Aug - 2015
### Goals prefixed:

Centos6/SCL6 based  containers parrot+CVMFS aware, capable to run
the experiment software.
Condor cluster using containers as workers.
A python script (daemon API based) capable to:

perform a continuous check of resource availability on the bare metal host, and
take decisions applying specific policies.
pull always-up-to-date images.
create, starting from a remote cfg and pulled images, the needed containers.
manage running containers following TTL policies performing a kind of garbage
collection...


container pilot as a container entry-point: get remote condor configuration files
start the condor daemons, wait for a job during a limited period, exit the container.

Issues to be discussed & TODO list


[TODO] To try a dummy test inside the container to verify the effectiveness of the
environment.


[TODO] Build an actual condor configuration for the entire cluster. The current
one is not working correctly (pilot involved in debug phase too).


[ISSUE] Current version makes not use of a module called
docker-py. Instead of building a brand new
python module (i.e. reinvent the wheel) one can use it (con: rebuild it from zero).
[TODO] «Daemonize» the script.
[TODO+ISSUE] About Policies: what should we keep an eye on?
(It's warmly recommended, over the internet, the
psutil module to measure resources
utilization).
Personally I think that there is no simple way at all to decide how many resources
to deploy on a user machine just gathering some usage data.
Even polling the instant (or some 'short term average') usage one cannot foresee if
in the next X minutes/hours the machine would be 'free'.
One can have, at least, some educated questions:

check if a non root/condor/ user is logged in (tty or pts, quite
a simplification but effective) or is running tasks.
Furthermore, time-based choices could be taken
(e.g. if nobody is logged in/running job(s) at 00:00:00AM it's likely the host
will remain free until, say, 06:00:00AM).
a magic formula based on usage to decide how many containers
could be ran on a single host does not exists, at least one can renormalize his
needs to the maximum available resources on the host.
it would be fine if one could differentiate (currently don't know how)
containers basing on what kind of job (long/short) could be ran.
perform a long term statistic to 'tag' a computer.
short job can start
since we are actually talking about volunteer computing
(and not leech computing) it would be meaningful that during the installation
phase (still editable later, btw) the owners choose some criteria or take some
choices about, say, policies based on job length.
Thus the daemon (manager) may take choices basing on something clear, and be
consistent.


Nothing to add (for now).


Nothing to add (for now).


Currently not working properly 'condor_off' (see pt 2.)