Skip to content

Instantly share code, notes, and snippets.

@dvush
Last active August 29, 2017 08:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dvush/a589e94874d0db71e218f117f96140ca to your computer and use it in GitHub Desktop.
Save dvush/a589e94874d0db71e218f117f96140ca to your computer and use it in GitHub Desktop.
GSoC 2017 Geant HPC workload balancer.

Task

Create HPC layer for GeantV project. GeantV is application for physics simulation and it had only single-node implementation. Our task was to create system that is aware about multiple hosts and could use allocated resources and process more events in a more efficient manner.

Code

Merge request: https://gitlab.cern.ch/GeantV/geant/merge_requests/177

Last GSoC related commit: e069dce6

What is done

We implemented GeantV HPC layer for efficient processing of complex transport particle simulations, that consists of two main items: Event Dispatcher (master) and Event Servers (workers). Event Dispatcher is central authority that tracks all workers and jobs allocated to them. Worker is a common GeantV application that does transport particle simulation job.

Basic workflow:

Worker asks for job from Event Dispatcher and after completing it, worker confirms that job is done. Event Dispatcher tracks all workers and allocate requested amount of work upon request, it also tracks all allocated jobs. One of the interesting features is that during the work we continuously collect statistics about nodes, that will let us use unsupervised machine learning methods to find bad nodes in the future.

Implemented features

  • During the work we developed a protocol for reliable asynchronous communication between worker and server, going in both directions that allows us to easily add new communications (work is based on ZeroMQ).
  • Message loss and other issues may cause worker to go out of contact, we properly handle that.
  • Worker asks for job when there is slots for events available to ensure that worker is always loaded. This implies perfect load balancing.(comparing to the manual balancing).
  • We have heartbeat policy that allow us to detect dead workers.
  • When worker dies, its work is successfully rescheduled.
  • If work is not confirmed after deadline (timeout policy), it is canceled and rescheduled.
  • We can remotely kill and restart worker application.
  • We gather statistics from worker.
  • We can distribute duplicated jobs at the end of run when there are no jobs left for some workers and cancel all duplication upon confirm (success) of at least one of duplicated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment