Skip to content

Instantly share code, notes, and snippets.

@mconcas
Last active August 29, 2015 14:26
Show Gist options
  • Save mconcas/d631d6e3ef90b4b305b9 to your computer and use it in GitHub Desktop.
Save mconcas/d631d6e3ef90b4b305b9 to your computer and use it in GitHub Desktop.

Turin, Wed 5 Aug - 2015

### Goals prefixed:

  1. Centos6/SCL6 based containers parrot+CVMFS aware, capable to run the experiment software.
  2. Condor cluster using containers as workers.
  3. A python script (daemon API based) capable to:
    1. perform a continuous check of resource availability on the bare metal host, and take decisions applying specific policies.
    2. pull always-up-to-date images.
    3. create, starting from a remote cfg and pulled images, the needed containers.
    4. manage running containers following TTL policies performing a kind of garbage collection...
  4. container pilot as a container entry-point: get remote condor configuration files start the condor daemons, wait for a job during a limited period, exit the container.

Issues to be discussed & TODO list

  1. [TODO] To try a dummy test inside the container to verify the effectiveness of the environment.

  2. [TODO] Build an actual condor configuration for the entire cluster. The current one is not working correctly (pilot involved in debug phase too).

  3. [ISSUE] Current version makes not use of a module called docker-py. Instead of building a brand new python module (i.e. reinvent the wheel) one can use it (con: rebuild it from zero).

    [TODO] «Daemonize» the script.

    [TODO+ISSUE] About Policies: what should we keep an eye on? (It's warmly recommended, over the internet, the psutil module to measure resources utilization). Personally I think that there is no simple way at all to decide how many resources to deploy on a user machine just gathering some usage data. Even polling the instant (or some 'short term average') usage one cannot foresee if in the next X minutes/hours the machine would be 'free'. One can have, at least, some educated questions:

    • check if a non root/condor/ user is logged in (tty or pts, quite a simplification but effective) or is running tasks. Furthermore, time-based choices could be taken (e.g. if nobody is logged in/running job(s) at 00:00:00AM it's likely the host will remain free until, say, 06:00:00AM).
    • a magic formula based on usage to decide how many containers could be ran on a single host does not exists, at least one can renormalize his needs to the maximum available resources on the host.
    • it would be fine if one could differentiate (currently don't know how) containers basing on what kind of job (long/short) could be ran.
    • perform a long term statistic to 'tag' a computer.
    • short job can start
    • since we are actually talking about volunteer computing (and not leech computing) it would be meaningful that during the installation phase (still editable later, btw) the owners choose some criteria or take some choices about, say, policies based on job length. Thus the daemon (manager) may take choices basing on something clear, and be consistent.
  4. Nothing to add (for now).

  5. Nothing to add (for now).

  6. Currently not working properly 'condor_off' (see pt 2.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment