Instantly share code, notes, and snippets.

@w1th0utnam3 /review.md Secret
Last active Aug 10, 2018

Embed
What would you like to do?
GSoC'18 final review

Final programming phase review

On August 14 the coding phase of Google Summer of Code officially ends. In this post I'll give a brief summary of my work during the summer. This is followed by a more detailed analysis and an outlook section.

Summary

Project goal

The goal of this project was to explore ways to improve the performance of FEniCS using vector-instruction or SIMD capabilities of modern CPUs. The focus was on the assembly stage of the FEM solution process as the solution of linear systems is handled by external libraries. Early, we decided that I should investigate the vectorization of functions that compute element matrices as this is the arithmetically most intensive part in the assembly process. This is not a trivial task as FEniCS (or more specifically its component FFC) generates the corresponding C code from high-level math like UFL code for each problem formulation automatically.

Implementation

The most promising approach turned out to be a "cross-cell" assembly approach. In this approach, the element matrix is computed for multiple cells simultaneously. Vectorization is then performed automatically by the compiler across this cell batch for every arithmetic statement of the cell matrix code. For more details, have a look at the referenced blog posts and the pull request section below.

Results

As shown by benchmarks in a previous blog post this approach lead to a speedup factor between 2x to 6x when looking only at the element matrix computation. Unfortunately, the user does not always profit from this speedup. The global matrix assembly gets more complicated when using the cross-cell approach. In specific cases, global assembly with the cross-cell approach was however still up to 20% faster then non-vectorized code. For more details, have a look at the analysis section below.

Pull requests and commits

Main project

In general, pull requests are better manageable when they cover only few lines of code. Larger projects should usually be broken up into several incremental PRs with a clear scope so that their quality can be judged individually. Due to the rather experimental nature of this GSoC project however, I decided to just use two PRs in order to give a better overview of the changes and additions related to vectorization. Furthermore, it is quite hard to extract single components of the PRs that make sense on their own and can be tested separately.

The first large PR targets FFC-X and adds new functions to its code generation back-end UFLACS to produce cross-cell code. This is code which computes cell matrices for multiple mesh cells at the same time. I tested and benchmarked it intensively which is documented in a previous blog post.

The second large PR implements a prototype version of cross-cell based matrix and vector assembly in DOLFIN-X. This PR depends on the two other PRs listed above. Obviously it requires the cross-cell code generation of FFC-X, otherwise the corresponding assembly makes no sense. Furthermore, a change in compilation parameters with respect to memory alignment in DOLFIN is required. Vectorized code (more specifically AVX instructions) have restrictions on the alignment of memory they operate on. For more details have a look at the implementation and analysis section below.

Both PRs contain a couple of unit tests that I implemented recently. They compare results of the vectorized code with the unmodified, non-vectorized code.

Smaller commits

During the summer I also made a couple of smaller commits and helped with issues that were not directly related to my project.

Implementation details and performance analysis

Cross-cell code generation

The implementation details of the changes to FFC-X to generate cross-cell UFC (Unified Form-assembly Code) is already covered by previous blog posts and I don't want to repeat everything here. Please have a look at the blog posts

Furthermore, all code can be found in the FFC-X pull request mentioned above. Most of the implementation resides in the file ffc/uflacs/ast_transformations.py.

Bilinear forms As previously stated, experiments with the generated code showed speedup of 2x to 6x for bilinear forms and a speedup between 3x and 4x with linear forms (created using the action operator) when computing cell matrices without matrix assembly. Some analysis of the speedup factors shown in the plot below were already presented in the referenced blog posts. Speedup of bilinear forms The posts already showed that the speedup of bilinear forms varies strongly between problems and there are individual problems (like Stokes flow) that show hardly any benefit from vectorization. During the project I did not manage to complete a more in-depth analysis of these cases which is required to potentially improve the performance for them. Therefore, I added this to the outlook/to-do list in the end of this review.

Linear forms In case of the action of bilinear forms, the speedup results are a lot more consistent: Speedup of linear forms

Alignment and DOLFIN compilation flags

Background for PR "Control Eigen alignment for compatibility with AVX user code"

Vectorization (performed by the compiler or manually by the programmer) relies on extended CPU instruction sets that go beyond the basic x86-64 instructions. An example for these instruction sets is AVX2 which is available on all modern Intel CPUs. AVX2 supporting CPUs have vector registers than can store 256 bit. Corresponding instructions can operate on e.g. 4 double values simultaneously. As a source for the respective vectorized load operations, a default double array can be used. There is a slight catch however: the instruction sets require the memory to be "aligned" which boils down to the address being a multiple of the alignment value. For example, in case of AVX2 an alignment of 32 bytes is required. Disregarding this requirements results almost definitely in crashes.

In practice, this requirement gets introduced in the cross-cell approach, when the FFC generated cell matrix code gets JIT (just in time) compiled with with an e.g. -march=avx2 flag. A flag like this is required as GCC does not emit any non-x86-64 instructions by default. Consequently no vectorized code would be produced without this flag.

The FFC generated code uses in/out pointers to receive and store data. The alignment requirements extend to all underlying memory locations. In practice, the memory gets allocated by DOLFIN in the assembly loop. DOLFIN uses Eigen matrices for most of internal linear algebra operations (independent from global matrices that are provided from PETSc). Unfortunately Eigen does not support dynamic specification of alignment at runtime (which would be required for a clean solution when interacting with Python user code). Furthermore, we did not want to introduce any self-written data structures or custom memory management. The only option is to use a single alignment value for all Eigen matrices that gets fixed at DOLFIN compile time.

There are two ways how the alignment can be controlled at compile time:

  1. The EIGEN_MAX_ALIGN_BYTES=... preprocessor directive. This directive ensures that Eigen will use at least an alignment of the specified number of bytes.
  2. Setting a CMAKE_CXX_FLAGS="-march=native" environment variable during DOLFIN compilation. Eigen automatically detects the required alignment through compiler macros like __AVX__. Eigen will overwrite any less strict alignments specified by EIGEN_MAX_ALIGN_BYTES in this case.

The alignment that is actually selected by Eigen is stored in the EIGEN_DEFAULT_ALIGN_BYTES definition.

The PR uses a default value of EIGEN_MAX_ALIGN_BYTES=32 as AVX2 is basically standard nowadays. A stricter default value like 64 bytes (required for AVX-512) does not seem reasonable at the moment. If a user wants to use AVX-512 for cross-cell assembly, DOLFIN has to be compiled with an appropriately changed parameter.

Cross-cell assembly

Apart from outliers, the performance of the vectorized code generated by FFC looked very promising. In the beginning of this review however, I already hinted that the matrix assembly process itself is responsible for swallowing a huge part of the achieved speedup. The assembly component of the cross-cell approach was not covered in previous blog posts. Therefore, I'll continue with a walk-through in this report that highlights problematic areas. Most of the code changes in DOLFIN are located in the Assembler (cpp/dolfin/fem/Assembler.cpp ) and Form (cpp/dolfin/fem/Form.cpp ) classes.

UFC interface During assembly, DOLFIN has to know how many cells the generated cell matrix code expects. Therefore, the FFC changes include an update of the UFC interface definition in ffc/backends/ufc/ufc.h. Every ufc_*_integral struct that stores function pointers to the corresponding FFC generated tabulate_tensor receives an additional cell_batch_size member.

This member can be accessed from the Assembler through newly added functions Form::cell_batch_size() and FormIntegrals::cell_batch_size().

Assembly algorithm The cross-cell assembly loop is implemented in different methods of the Assembler class:

Each of these functions switches between the new cross-cell implementation and the original implementation depending on the cell_batch_size:

const unsigned int cell_batch_size = form.cell_batch_size();
if (cell_batch_size > 1) 
{
  // new cross-cell assembly code
}
else
{
  // original assembly code
}

I'll continue with an exemplary walk-through with the cross-cell branch of the assemble_matrix function.

The high-level steps of the assembly are the following:

  1. Allocate memory Ae for the whole cell matrix and Ae_cell that can store the cell matrix of a single cell
  2. Gather references to the next cell_batch_size cells from the mesh in a vector cell_batch. More specifically we have to store copies of the mesh::Cell objects because they are just proxy objects created when dereferencing the mesh iterator. If the number of mesh cells is not divisible by cell_batch_size, fill the last batch with dummy cells (e.g. copy of the last real cell).
  3. For each of the cells obtain the coordinates of their degrees of freedom (Mesh::cell::get_coordinate_dofs). This is currently not vectorized and the obtained values are stored consecutively in a vector coordinate_dofs_batch.
  4. Call Form::tabulate_tensor_batch(Ae, cell_batch, coordinate_dofs_batch). More details below.
  5. Loop over the non-dummy cells of the batch and:
    1. Obtain the dof-maps of the corresponding cell
    2. Copy all values belonging to the current cell from Ae to Ae_cell. This "de-interleaves" the data. The following steps are identical to the non-vectorized version.
    3. Apply dirichlet boundary conditions to Ae_cell.
    4. Use the dof-maps to add Ae_cell to the global matrix using Form::add_local.

Why is an additional copy from Ae to Ae_cell involved? As stated above, Ae is interleaved with respect to the cells of the batch. This means that it is essential a rank 3 tensor with dimensions (dofmap1.size(), dofmap2.size(), cell_batch_size) stored in C row-major order. The Form::add_local function for matrix insertion uses PETSc's MatSetValues function which expects raw pointers as arguments. Therefore, we cannot use something like Eigen::Map as a strided view on the individual cells. Furthermore, it is not possible to just make a joined, cell-interleaved version of the dof-maps: PETSc expects the values passed to MatSetValues to be a dense 2D array. If this 2D array is of dimensions (n,m), the corresponding dof-maps also have to be of dimensions (n) and (m) respectively. This is fundamentally incompatible with the "3D nature" of the interleaved cell matrix Ae. The only remaining option is to "transpose" the cell_batch_size dimension of Ae to the front, followed by cell-wise matrix insertion for each 2D matrix in the remaining dimensions. This is implemented by copying all values of a single cell from Ae to a second array Ae_cell.

tabulate_tensor batched evaluation The Assembler calls Form::tabulate_tensor_batch which is the component of DOLFIN that actually calls into the compiled code that was generated by FFC. After getting called from DOLFIN as Form::tabulate_tensor_batch(Ae, cell_batch, coordinate_dofs_batch), the method performs the following steps:

  1. For each cell in cell_batch, perform "coefficient restriction". This is the same process for each coefficient of a form:
    1. Call the Function::restrict/Expression::restrict methods with the current cell and its dof coordinates. The result is written to a temporary array for each cell.
    2. Copy the content of the temporary array into a joined, cell-interleaved array _w.
  2. For each cell in cell_batch, copy the dof coordinates from the individual arrays in coordinate_dofs_batch into a joined, cell-interleaved array _coord_dofs.
  3. Call the function pointer of the vectorized tabulate_tensor function generated by FFC with the interleaved arrays with tabulate_tensor(Ae, _w, _coord_dofs)

Why are the additional copies for interleaving required? Currently, it is not possible to perform coefficient restriction in some kind of cross-cell/batched mode. Coefficients can be user defined math Expression objects or finite element based Function objects. In both cases it is not trivial to somehow directly write into strided memory. Both rely on raw scalar pointers to return their values. It might be possible to change this. However, I did not investigate this extensively.

Performance My GSoC working repository contains some benchmarking code that I used for testing. For cross-cell assembly the relevant script is assembly_bench.py. I compared the default non-batched assembly with the cross-cell assembly case optimized for my processor (i7 8700K). I used the following problems for testing:

  1. A Poisson problem with second order Lagrange tetrahedral elements with a varying coefficient defined using linear elements
  2. A hyperelasticity problem with linear Lagrange tetrahedral elements
  3. The action of the bilinear form of the Poisson problem
  4. The action of the bilinear form of the hyperelasticity

The results were quite underwhelming, speedup in percent of the default assembly:

  1. Poisson: 21%
  2. Hyperelasticity: 1%
  3. Action of Poisson: 0%
  4. Action of the Hyperelasticity bilinear form: 0%

Especially for the linear forms the results were quite surprising. We expected a much higher speedup in comparison to the bilinear forms due to the significantly lower assembly overhead. I did not find a satisfying explanation for this behavior yet.

Missing functionality

  • The Assembler class is under heavy development as part of the DOLFIN-X project. Therefore, the current cross-cell assembly can only be considered a prototype and has to be kept in sync with the changes to the default non-batched assembly. As the Assembler does not implement non-cell type integrals (e.g. facet or vertex integrals) this isn't tested with the cross-cell approach either. This includes proper testing of the code generated by FFC.
  • The prototype assembly algorithm currently does not respect subdomains correctly. This means that it assumes that all cells in a batch share the same subdomain (otherwise a runtime exception is thrown). Furthermore, the Form::cell_batch_size() would require a mesh::Cell as parameter to return the correct batch size corresponding to the subdomain. Otherwise it would be required that all subdomains share the same batch size.

Usage instructions

The following sections give an introduction on how to actually enable and try out the cross-cell vectorization features in FFC-X and DOLFIN-X.

Requirements

The GSoC project is based on the FEniCS-X project and not on the stable release version. Respectively, a working installation of FEniCS-X is required to try out the vectorization capabilities. Furthermore, the large project PRs mentioned above are still under review and not merged, yet. Therefore, you have to check out the corresponding branches, namely

Additionally, DOLFIN-X has to be compiled with an alignment or architecture flag as explained in the previous implementation and analysis section. For AVX2 supporting CPUs however, this is already ensured by the default CMake options.

As an alternative to a manual installtion, a pre-configured Docker container image is available on Quay.io. It can be pulled using

docker pull quay.io/w1th0utnam3/simd-work:latest

Enabling vectorization

Cross-cell assembly with DOLFIN A full example for enabling cross-cell assembly when using DOLFIN from Python can be found here.

Calling FFC directly To generate cross-cell vectorized UFC from UFL files the following parameters are relevant:

  • cell_batch_size: FFC generates cell matrix computation functions that compute the cell integral for this number of cells simultaneously. Should match the number of doubles that fit into one vector register of the CPU target architecture. Default value: 1
  • alignas: The memory alignment required by the CPU for vector instructions. For AVX2 this would be 32 bytes, for AVX-512 this would be 64 bytes. Default value: 32
  • enable_cross_element_gcc_ext: Whether to use GCC's vector extensions for vectorization. Should be enabled when using GCC as a compiler for better performance. Default value: 0

An example call to FFC looks like this:

python3 -m ffc -f cell_batch_size 4 -u enable_cross_element_gcc_ext 1 -u alignas 32 Poisson.ufl

To actually get a performance benefit of the compiled code, the following compiler flags (in case of GCC) are relevant

  • -ftree-vectorize: Enable GCC's auto vectorizer
  • -march=...: Specify the target CPU architecture, has to be set to native, avx2, skylake,... etc. to allow GCC to emit any kind of vector instructions
  • -mtune=...: Perform additional target architecture dependent optimizations

Outlook

As noted above, the cross-cell assembly in DOLFIN-X can only be considered a prototype. The implementation of the Assembler will change a lot before the release. Maybe this allows making some additional changes to make cross-cell assembly more efficient. At the moment however, the results are a little bit disappointing and it has to be decided in the development team whether we want to continue following this approach. In any case I would enjoy to working with the team even if we put the vectorization project on hold for the moment.

Apart from the low speedup there are some additional remaining points that were mentioned above:

  • Correct subdomain treatment in cross-cell assembly (DOLFIN)
  • Support for non-cell integrals (FFC & DOLFIN)
  • In-depth analysis of large differences in speedup of FFC generated code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment