Skip to content

Instantly share code, notes, and snippets.

@boegel
Last active August 29, 2015 13:56
Show Gist options
  • Save boegel/9225891 to your computer and use it in GitHub Desktop.
Save boegel/9225891 to your computer and use it in GitHub Desktop.
Python framework for job submission - project idea

A Python framework for using resource managers

 This page presents a project idea for a Python framework on job submission, with the intent to trigger collaboration on the topics.

Authors
  • Kenneth Hoste, Stijn De Weirdt (HPC-UGent)

Definitions

  • job script: text file implementing a workload in a scripting language (e.g. bash, tclsh, Python, ...)
  • has certain well-defined parts, e.g. shebang, RM header (e.g. #PBS ...), actual workload implementation, ...
  • job: instance of a job script representing a workload, e.g. a simulation, scientific experiment, ...
  • resource manager (RM): (remote) service where jobs are submitted to
  • does not (have to) include job scheduler
  • does not assume tracking of job states or job history (example: completed jobs are no longer available via the RM interface)

Functionality

support for (in order of preference):

  • job submission (qsub script.sh)
  • minimal job attributes
  • walltime (e.g. qsub -l walltime=10:00:00)
  • nodes/cores (e.g. qsub -l nodes=1:ppn=16)

  • dependencies (e.g. qsub -l afterok:<jobid>)
  • job hold/release
  • DAG job submission
  • DAG job: set of jobs with interdependencies
  • 'array' jobs (qsub -t)
  • job querying (state) (e.g. qstat)
  • job removal (e.g. qdel)
  • advanced job attributes
  • queue
  • target partition/reservation
  • node features (e.g. qsub -W ...)
  • mail settings (e.g. qsub -m abe)
  • memory requirements (e.g. qsub -l vmem=10gb)
  • mapping of abstract node features to job attributes (e.g. a Harpertown-based GPGPU node => qsub -q gpu_harpertown, qsub -l nodes=1:harpertown:gpu)
  • support for remote submission (cfr. Galaxy)
  • e.g. via SSH tunnel to cluster login nodes, ...
  • general interface to multiple (remote) systems (all around the world)

Goals

  • API
  • support for various resource managers, e.g. PBS, SLURM, PBSPro, OAR, LoadLeveler, MOAB, ...
  • command line client (e.g. mysub)

Specifications

Development platform
  • source code repository: git
  • collaborative framework: GitHub (github.com)
  • documentation: GitHub wiki pages
Programming language
  • compatible with Python v2.6 and more recent Python v2.x
  • compatibility with Python 3.x is definitely worth considering (and feasible alongside Python v2.6 support)
  • references to guides on maintaining a Python2/3 compatible codebase?
Design
  • object-oriented design
  • 'abstract' class ResourceManager:
class ResourceManager(object):
   ...
  • subclasses for specific resource managers:
class Pbs(ResourceManager):
  ...

class Slurm(ResourceManager):
  ...
  • simple and clean API, e.g.:
    class Job(object):
      """Representation of a job."""
      def __init__(self, *args, **kwargs):
        self.name = None
        self.script = None
        self.dependencies = []
        self.jobid = None
        ...

    class GroupOfJobs(object):
      """Representation of a group of jobs (e.g. a DAG)."""
      def __init__(self, *args, **kwargs):
        self.jobs = []
      ...

    def create_job(job_script, jobs_specs=None):
      """Create a new Job instance."""
      ...

    class ResourceManager(object):
      """Abstract class representing a resource manager."""
     
      def submit_job(self, job, job_attrs=[]):
        """Submit a job."""
        raise NotImplementedError
    
      def hold_job(self, job, hold_type=UserHold):
        """Set hold on a job."""
        raise NotImplementedError

      ...
Features
  • includes a suite of unit tests from the very start
  • interfacing with specific resource managing software can be done via mocking
Task assignment

(preliminary draft, in no way is this final or approved yet!!!)

  • pick a catchy name: joint effort
  • agree on license: joint effort
  • LGPL because of integration into other tools?
  • make intentory of related/existing (Python) frameworks: joint effort
  • includes looking into available Python APIs for resource managers, e.g. pbs_python, python-torque, ...
  • design: joint effort
  • framework implementation: NeSI (?)
  • includes documenting via docstrings, unit tests, ...
  • implementation for specific resource managers:
  • Torque: HPC-UGent
  • SLURM: NeSI, JSC (?)
  • PBSPro: GMI/azet.org (?)
  • OAR: Uni.lu (?)
  • LoadLeveler: JSC (?)

Organisation

cfr. EasyBuild structure

  • release manager (NeSI?)
  • only person with merging rights
  • his/her institution hosts central git repository with master branch (e.g. https://github.com/nesi/)
  • code reviewers
  • developers
  • testers

Intended use

  • backend for job submission in EasyBuild (--job)
  • minimal requirement: DAG job submission
  • backend for job submission in benchmarking/performance monitoring frameworks (e.g. JuBE)
  • backed for user portal (e.g. in Django)
  • e.g. Galaxy
  • backend for Hanything-On-Demand (https://github.ugent.be/hpcugent/hanythingondemand/)

Inspiration

@ehiggs
Copy link

ehiggs commented Aug 20, 2014

I think a lot of inspiration here can be taken from ibcloud which does something similar but for virtualization requisitioning. Have a poke through the code here and take a look at how the get_driver works by returning a class which can then be used to run the job.

Would testing require something to spawn a virtualized cluster? Otherwise, it's not clear how testing could be done.

@boegel
Copy link
Author

boegel commented Aug 20, 2014

gc3pie claims to closely match this project idea, see https://code.google.com/p/gc3pie/

@ehiggs: they should be using ibcloud, not us :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment