dvush/gsoc2017_geantV.md

## gsoc2017_geantV.md

      
    Raw
  

              gsoc2017_geantV.md
            
          
    Task

Create HPC layer for GeantV project. GeantV is application for physics simulation and it had only single-node implementation. Our task was to create system that is aware about multiple hosts and could use allocated resources and process more events in a more efficient manner.
Code

Merge request: https://gitlab.cern.ch/GeantV/geant/merge_requests/177
Last GSoC related commit: e069dce6
What is done

We implemented GeantV HPC layer for efficient processing of complex transport particle simulations, that consists of two main items: Event Dispatcher (master) and Event Servers (workers). Event Dispatcher is central authority that tracks all workers and jobs allocated to them. Worker is a common GeantV application that does transport particle simulation job.
Basic workflow:

Worker asks for job from Event Dispatcher and after completing it, worker confirms that job is done. Event Dispatcher tracks all workers and allocate requested amount of work upon request, it also tracks all allocated jobs. One of the interesting features is that during the work we continuously collect statistics about nodes, that will let us use unsupervised machine learning methods to find bad nodes in the future.
Implemented features


During the work we developed a protocol for reliable asynchronous communication between worker and server, going in both directions that allows us to easily add new communications (work is based on ZeroMQ).
Message loss and other issues may cause worker to go out of contact, we properly handle that.
Worker asks for job when there is slots for events available to ensure that worker is always loaded. This implies perfect load balancing.(comparing to the manual balancing).
We have heartbeat policy that allow us to detect dead workers.
When worker dies, its work is successfully rescheduled.
If work is not confirmed after deadline (timeout policy), it is canceled and rescheduled.
We can remotely kill and restart worker application.
We gather statistics from worker.
We can distribute duplicated jobs at the end of run when there are no jobs left for some workers and cancel all duplication upon confirm (success) of at least one of duplicated.