Skip to content

Instantly share code, notes, and snippets.

@agarzaro
Last active August 22, 2019 18:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save agarzaro/6bcf2f8e75ef2a71204601d23c245b33 to your computer and use it in GitHub Desktop.
Save agarzaro/6bcf2f8e75ef2a71204601d23c245b33 to your computer and use it in GitHub Desktop.
Google Summer of Code (Arturo Garza, 2019)

VectorFlow is a C++ open source adapter that integrates SIMD vectorized components in a scalar workflow. The API provides a generic vector flow service that can non-intrusively integrate with arbitrary data processing frameworks. The overhead introduced by the extra data transformations is much smaller compared to the benefits obtained from the SIMD vectorization and from the use of better instruction caching.

The project idea arises from the effort and collaboration of the CERN-HSF organization to improve the performance of simulation in particle physics. Many FLOP-intensive high-energy physics algorithms may profit from the vector pipelines of modern processors, they don’t because they don’t have vectorizable inner loops. With this in mind, VectorFlow was born.

  • The service allows the user to work only on data of interest. This is, a FILTER step to avoid unnecessary computations.
  • The scalar data of arbitrary type can be accumulated in vectors of a given size (defined by the computer architecture). This is, a GATHER step to prepare the data for SIMD processing.
  • The transformed data, then, can be directly fed into a vector-aware implementation of the given algorithm. This is, a VECTORIZATION step, which integrates with VecCore services.
  • At the end, the output data can be pushed back to the initial scalar form. This is, a SCATTER step to re-integrate the output into the framework data flow.

- DESIGN -

Once the main components of the vector algorithm handler were identified (Filter, Gather, Vectorize, and Scatter data), the VectorFlow interface was designed to provide the user with a friendly API to help integrate vectorizable components into any data flow of any given type (state):

WORK interface - Base class that provides interfaces to scalar and vector work.

  • Templated implementation: custom Data (used as input) & custom container of Data*.
  • Virtual methods to execute the scalar and/or vector version of a given algorithm.
  • A Work can have clients, defined as pointers to external containers.
    • The Execute methods implementations may move the data after processing to other clients (Dispatch).
  • Work will take usually as input a container of states. Either:
    • Loop on them and call the scalar Execute.
    • Push the container to the vector Execute to Gather into the vectorized algorithm.

FLOW interface - Base class that provides support to the different types of flows.

  • Add each stage to the workflow via AddWork.
  • Set each stage to be executed either in scalar or vector mode via SetVectorMode.
  • Two flows to support any variant type of simulation workflows were proposed: the PIPELINE flow & the COMPLEX flow.

PIPELINE flow - (Flow derived class) Sequence of work tasks altering the same state.

  • Add the Work objects in sequence via Flow::AddWork.
  • Add the data via AddData.
  • Mark the work to be done in vector mode via Flow::SetVectorMode(stage, flag).
  • Execute a given stage or the full sequence.

COMPLEX flow - (Flow derived class) Graph of work tasks with decisional logic on what is the continuation.

  • Holds a list of buffer containers (baskets) for each work stage.
  • Add the Work objects in sequence via Flow::AddWork.
  • Declare the client baskets for each work stage via Work::AddClient.
    • Connection is from one task to another are implicitly coded at the end of Work::Execute by calling Work::Dispatch(stage).
  • Add the data via AddData (goes to the first container).
  • Mark the work to be done in vector mode via Flow::SetVectorMode(stage, flag).
  • Processing will call Execute for the full basket of the first stage, then second, and so on.
    • States may move back to earlier stages in the sequence, Flush needs to be implemented.
  • An example on how the complex flow interface works can be found here.

It is important to mention that different flows can be interconnected. Suppose one work module not only changes the input state, but it also produces a new data type, and then we want to run an algorithm vectorized on that data type. So another flow object can be created for the new data type and feeding its data container from the output of the first workflow. Once the first workflow execution finishes, the user can start the execution of the second flow, and so on.

- BENEFITS TO THE COMMUNITY -

When vectorization could be a good idea? A natural loop on data of the same type preempts vectorization and repetitive work increases locality allowing for a much better cache coherence. However, without vectorizable inner loops (like in high-energy physics simulations), the vectorization becomes difficult to achieve. This is why an interface/service to transform data into vector format in order to facilitate vectorization for SIMD acceleration is necessary. Performance improvement through vectorization has been proven, but the overhead introduced by data transformation should not be significant in relation to the benefits introduced by the vectorization.

Also, this framework could be extended to different algorithms that work today on scalar data and could benefit from vector input. That is, in order for an algorithm to take advantage of vectorization, the user should provide the transformation of the scalar data to a vector form based on VecCore types, then use the VecCore API to provide the vector implementation of a given algorithm (the scalar/vector interface represents a large part of the vectorization effort).

With this, many future opportunities will be opened, like performance and concurrency analysis, integration with other frameworks (as the interface is intended to be a smooth transition between scalar and vector data formats), and the extension to different type of research areas and devices.

- MAIN CONTRIBUTIONS & RESULTS -

Among many tasks that were developed, the most important contributions involve the following:

Design & implementation of a scalar/vector adapter service.

As already mentioned, the main components necessary to exploit vectorization are: FILTER, GATHER, VECTORIZE, & SCATTER. An optimal implementation of each one of them is essential in order to reduce the inherent overhead introduced by the data transformations of the workflow. So, the use case example employed to be studied and later on also benchmarked was a particle track generator + propagator (propagator in magnetic field is a setup very frequently used in particle physics simulations).

  • The generator consists in an event loop that generates particle tracks according to a model called the Cocktail Generator. Part of this contribution includes development in the generator, the event, and the tracks files. Finally, the complete test can be found here.

  • The propagator consists in keeping record of the particle track throughout a given geometry, which is integrated from VecGeom. Originally, each particle track is propagated towards outside a geometry (a sphere, a tube, etc.) once a given magnetic field is applied to it. The test, that propagates a particle track outside a sphere of a given radius, can be found here.

After this example, the propagator is a very CPU intensive task which was developed in a scalar flow (this is propagating a given particle track through a given geometry ONE at a time), the question here is: can we do better? The answer is YES, it can be improved by taking advantage from the vector registers embebbed in the CPU architecture, SIMD processing!

SIMD stands for Single Instruction Multiple Data and it refers to a processor architecture that executes a single instruction over multiple data items at the same time. This means, it is possible to propagate multiple particle tracks simultaneously as the same operation is applied to all the tracks. So, at this point, the scalar/vector adapter for particle tracks is needed in order to exploit this capability.

  • First, the data needs to be transformed into vector type format (the vector type content should be minimal and contain only the data fields actually needed by the algorithm, this to minimize the data copy overheads). Note that the data can previously be filtered, like only propagating charged particle tracks.
  // Data in vecCore vector types
  using namespace vectorflow; // For vector data types (Double_v)
  using vecCore::Set;
  using vecCore::Get; 
  vecgeom::Vector3D<Double_v> fPos_v;
  vecgeom::Vector3D<Double_v> fDir_v;
  Double_v fCharge_v, fMomentum_v, fNSteps_v;

The length of the vector is completely dependant on the processor architecture. For instance, all the development and benchmarking took place in an AVX2 processor with capacity to handle 4 double's data types simultaneously. This is because VecCore vector types, such as those aliased in the header vectorFlow/vectorTypes.h (e.g. Double_v), will automatically match the host architecture.

  • Then , the Gather task is in charge to fill in the vector lanes with the scalar data. This was implemented by integrating VecCore for SIMD vectorization library.
  // Gather info to fill one SIMD lane
  void Gather(Data *track, const std::size_t lane) {
    Set(fPos_v.x(),  lane, track->PosX());
    Set(fPos_v.y(),  lane, track->PosY());
    Set(fPos_v.z(),  lane, track->PosZ());
    Set(fDir_v.x(),  lane, track->DirX());
    Set(fDir_v.y(),  lane, track->DirY());
    Set(fDir_v.z(),  lane, track->DirZ());
    Set(fCharge_v,   lane, track->Charge());
    Set(fMomentum_v, lane, track->P());
    Set(fNSteps_v,   lane, track->GetNsteps());
  }
  • Finally, once the vectorized method is done, it is necessary to scatter back the output into the scalar flow.
  // Scatter one lane info back to original structure 
  void Scatter(const std::size_t lane, Data *track) {
    track->SetPosition(Get(fPos_v.x(), lane),  Get(fPos_v.y(), lane), Get(fPos_v.z(), lane));
    track->SetDirection(Get(fDir_v.x(), lane), Get(fDir_v.y(), lane), Get(fDir_v.z(), lane));
    track->SetNsteps(Get(fNSteps_v, lane));
  }

The complete adapter implementation can be found here.

This adapter was used to implement vector-aware propagation methods for a vector of tracks and not only for just one particle track.

  // PropagateToR methods, for a track and a vector of tracks
  void PropagateToR(double radius, vectorflow::Track &track) const;
  void PropagateToR(double radius, std::vector<vectorflow::Track*> const &tracks) const;

  // PropagateInTube methods, for a track and a vector of tracks
  void PropagateInTube(int layer, vectorflow::Track &track) const;
  void PropagateInTube(int layer, std::vector<vectorflow::Track*> const &tracks) const;

When vectorizing a method, there is no recipe on how to do it, it completely depends on the developer and the way he or she thinks. There might be multiple ways to implement it and also different ways on how the data can be stored. In this case, the PropagateToR method propagates the particle track in a sphere of a given radius and the PropagateInTube propagates the track throughout multiple concentric cylinders. The complete implementation can be found here.

Proof of concept of the VectorFlow workflow implementation & performance analysis.

After all the elements of the scalar/vector adapter were implemented, the next step was to prove the feasibility of the VectorFlow interface as it is intended to easily integrate scalar and/or vector tasks with any other data processing flows. This adapter will work as the proof of concept of the VectorFlow different workflows, the pipeline flow and the complex flow.

  • The pipeline flow example integrates the vectorized PropagateToR task into the event loop that generates the particle tracks, this is generate + propagate. As mentioned in the design section, the pipeline flow shares the same state/basket (data input) with all the added tasks. In this case, the pipeline flow copies pointers to tracks into its own basket for data transformation and processing. Once the computations finish, the output data is scatter back to the original particle structure and the pipeline flow is cleared. The complete implementation can be found here.

  • The complex flow example integrates the vectorized PropagateInTube with 4 layers in the same event loop as in the pipeline flow. Each layer of the tube is a task (each one with its own data input/basket) as each particle track can be propagated to a inner or outer layer within the tube. When this happens, the current task dispatches the content of its basket to the next layer/task and the complex flow will continue its execution as long as there is still data in the buffers. Also, like in the pipeline flow, the vectorized tasks in this flow will scatter output data back to the original structures whenever is ready at any moment thanks to the flexible implementation of the scalar/vector service adapter. The complete implementation can be found here.

The results are promising, the speed up obtained when integrating the vectorized tasks is ~2.0x in the pipeline flow and ~1.8x in the complex flow, even though the overhead that is introduced in the workflow. These implementations were optimized by using gperftools profiler. It is attached an example graph and in there is possible to recognize the nodes with intensive computations, these are SIMD vectorized methods like DoStep method for a propagator class and Dot method for a VecGeom Vector3D.

Future work in this area is related to continue profiling and benchmarking the VectorFlow interfaces with more elaborated examples and to start integrating it with previous simulation software for high-energy physics such as Geant4.

Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@agheata
Copy link

agheata commented Aug 21, 2019

dictate the FUTURE -> improve the performance
in the Design part, you should describe in detail the main framework components (Work, Flow) and their interfaces

@agheata
Copy link

agheata commented Aug 22, 2019

  • better data caching -> better instruction caching (data caching is not necessary as good as in the original scalar flow, since we need to gather from different memory locations.
  • in order to an algorithm to take advantage... -> in order for an algorithm to take advantage of vectorization, the user should provide the transformation of the scalar data to a vector form based on VecCore types, then use the VecCore API to provide the vector implementation of this algorithm

@agheata
Copy link

agheata commented Aug 22, 2019

  • We should also mention that different flows can be interconnected (even if we don't have an example showing that...), Suppose one work module not only changes the input state (e.g. tracks), but also produces a new data type, and then we want to run an algorithm vectorized on that data type. Then one can make another flow object for the new data type, feeding its data container from the first workflow. Once the first workflow will be finished, the user can start the execution of the second flow, and so on.

@agheata
Copy link

agheata commented Aug 22, 2019

  • particle track generator + propagator -> ... propagator in magnetic field, a setup very frequently used in particle physics simulations.
  • the use case -> the use case example
  • "First, the data needs to be transformed into vector type format.". Add after that: "Note that the vector type content should be minimal and contain only the data fields actually needed by the algorithm, to minimize the data copy overheads.

@agheata
Copy link

agheata commented Aug 22, 2019

  • "...4 double's data types simultaneously." Add after: "Note that if using the VecCore types such as those aliased in the header vectorFlow/VectorTypes.h (e.g. Double_v), they will automatically match the host architecture.

@agheata
Copy link

agheata commented Aug 22, 2019

  • Mention that we have optimized the implementations by profiling with gperftools, and attach one example graph. Mention for the graph which are the leafs with intensive SIMD computation.

@agheata
Copy link

agheata commented Aug 22, 2019

The report is in good shape already, but just try to address these observations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment