w1th0utnam3/review.md Secret

## review.md

      
    Raw
  

              review.md
            
          
    Final programming phase review

On August 14 the coding phase of Google Summer of Code officially ends. In
this post I'll give a brief summary of my work during the summer. This is
followed by a more detailed analysis and an outlook section.
Summary

Project goal

The goal of this project was to explore ways to improve the performance of
FEniCS using vector-instruction or SIMD
capabilities of modern CPUs. The focus was on the assembly stage of the FEM
solution process as the solution of linear systems is handled by external
libraries. Early, we decided that I should investigate the vectorization of
functions that compute element matrices as this is the arithmetically most
intensive part in the assembly process. This is not a trivial task as FEniCS
(or more specifically its component FFC) generates the corresponding C code
from high-level math like UFL code for each problem formulation automatically.
Implementation

The most promising approach turned out to be a "cross-cell"
assembly approach. In this approach, the element matrix is computed for
multiple cells simultaneously. Vectorization is then performed automatically
by the compiler across this cell batch for every arithmetic statement of the
cell matrix code.
For more details, have a look at the referenced blog posts and the pull request
section below.
Results

As shown by benchmarks in a previous blog post
this approach lead to a speedup factor between 2x to 6x when looking only at
the element matrix computation.
Unfortunately, the user does not always profit from this speedup. The global
matrix assembly gets more complicated when using the cross-cell approach. In
specific cases, global assembly with the cross-cell approach was however still
up to 20% faster then non-vectorized code.
For more details, have a look at the analysis section below.
Pull requests and commits

Main project


FFC-X: "Implementing cross-cell code generation in UFLACS"
DOLFIN-X: "Implementing cross-cell integral assembly"
DOLFIN-X: "Control Eigen alignment for compatibility with AVX user code"

In general, pull requests are better manageable when they cover only few lines of
code. Larger projects should usually be broken up into several incremental PRs
with a clear scope so that their quality can be judged individually. Due to the
rather experimental nature of this GSoC project however, I decided to just use
two PRs in order to give a better overview of the changes and additions related
to vectorization. Furthermore, it is quite hard to extract single components of
the PRs that make sense on their own and can be tested separately.
The first large PR targets FFC-X and adds new functions to its code generation
back-end UFLACS to produce cross-cell code. This is code which computes cell
matrices for multiple mesh cells at the same time. I tested and benchmarked it
intensively which is documented in a previous blog post.
The second large PR implements a prototype version of cross-cell based matrix
and vector assembly in DOLFIN-X. This PR depends on the two other PRs listed
above. Obviously it requires the cross-cell code generation of FFC-X, otherwise
the corresponding assembly makes no sense. Furthermore, a change in compilation
parameters with respect to memory alignment in DOLFIN is required. Vectorized
code (more specifically AVX instructions) have restrictions on the alignment of
memory they operate on. For more details have a look at the implementation
and analysis section below.
Both PRs contain a couple of unit tests that I implemented recently. They
compare results of the vectorized code with the unmodified, non-vectorized code.
Smaller commits

During the summer I also made a couple of smaller commits and helped with
issues that were not directly
related to my project.

FFC-X: Several code cleanup commits (remove of dead code)
DOLFIN-X: Other commits

Implementation details and performance analysis

Cross-cell code generation

The implementation details of the changes to FFC-X to generate cross-cell
UFC (Unified Form-assembly Code) is already covered by previous blog posts
and I don't want to repeat everything here. Please have a look at the blog posts

Benchmarking & vector extensions
“Cross-element” vectorization

Furthermore, all code can be found in the FFC-X pull request
mentioned above. Most of the implementation resides in the file ffc/uflacs/ast_transformations.py.
Bilinear forms As previously stated, experiments with the generated code showed speedup of
2x to 6x for bilinear forms and a speedup between 3x and 4x with linear forms
(created using the action operator) when computing cell matrices without matrix
assembly. Some analysis of the speedup factors shown in the plot below were
already presented in the referenced blog posts.

The posts already showed that the speedup of bilinear forms varies strongly between
problems and there are individual problems (like Stokes flow) that show hardly
any benefit from vectorization. During the project I did not manage to complete
a more in-depth analysis of these cases which is required to potentially improve
the performance for them. Therefore, I added this to the outlook/to-do list
in the end of this review.
Linear forms In case of the action of bilinear forms, the speedup results are a lot more
consistent:

Alignment and DOLFIN compilation flags

Background for PR "Control Eigen alignment for compatibility with AVX user code"
Vectorization (performed by the compiler or manually by the programmer) relies
on extended CPU instruction sets that go beyond the basic x86-64 instructions.
An example for these instruction sets is AVX2 which is available on all
modern Intel CPUs. AVX2 supporting CPUs have vector registers than can store
256 bit. Corresponding instructions can operate on e.g. 4 double values
simultaneously. As a source for the respective vectorized load operations,
a default double array can be used. There is a slight catch however: the instruction sets
require the memory to be "aligned" which boils down to the address being a
multiple of the alignment value. For example, in case of AVX2 an alignment of
32 bytes is required. Disregarding this requirements results almost definitely
in crashes.
In practice, this requirement gets introduced in the cross-cell approach,
when the FFC generated cell matrix code gets JIT (just in time) compiled with
with an e.g. -march=avx2 flag. A flag like this is required as GCC does not
emit any non-x86-64 instructions by default. Consequently no vectorized code
would be produced without this flag.
The FFC generated code uses in/out pointers to receive and store data. The
alignment requirements extend to all underlying memory locations. In practice,
the memory gets allocated by DOLFIN in the assembly loop. DOLFIN uses Eigen
matrices for most of internal linear algebra operations (independent from
global matrices that are provided from PETSc). Unfortunately Eigen does not
support dynamic specification of alignment at runtime (which would be required
for a clean solution when interacting with Python user code). Furthermore,
we did not want to introduce any self-written data structures or custom memory
management. The only option is to use a single alignment value for all Eigen
matrices that gets fixed at DOLFIN compile time.
There are two ways how the alignment can be controlled at compile time:

The EIGEN_MAX_ALIGN_BYTES=... preprocessor directive. This directive
ensures that Eigen will use at least an alignment of the specified number
of bytes.
Setting a CMAKE_CXX_FLAGS="-march=native" environment variable during
DOLFIN compilation. Eigen automatically detects the required
alignment through compiler macros like __AVX__. Eigen will overwrite any
less strict alignments specified by EIGEN_MAX_ALIGN_BYTES in this case.

The alignment that is actually selected by Eigen is stored in the
EIGEN_DEFAULT_ALIGN_BYTES definition.
The PR uses a default value of EIGEN_MAX_ALIGN_BYTES=32 as AVX2 is basically
standard nowadays. A stricter default value like 64 bytes (required for AVX-512)
does not seem reasonable at the moment. If a user wants to use AVX-512 for
cross-cell assembly, DOLFIN has to be compiled with an appropriately changed
parameter.
Cross-cell assembly

Apart from outliers, the performance of the vectorized code generated by FFC
looked very promising. In the beginning of this review however, I already hinted that
the matrix assembly process itself is responsible for swallowing a huge part of
the achieved speedup. The assembly component of the cross-cell approach was not covered in previous blog posts. Therefore, I'll continue with a walk-through in this report that
highlights problematic areas. Most of the code changes in DOLFIN are located
in the Assembler (cpp/dolfin/fem/Assembler.cpp
) and Form (cpp/dolfin/fem/Form.cpp ) classes.
UFC interface
During assembly, DOLFIN has to know how many cells the generated cell matrix
code expects. Therefore, the FFC changes include an update of the UFC interface
definition in ffc/backends/ufc/ufc.h. Every ufc_*_integral struct that stores function pointers
to the corresponding FFC generated tabulate_tensor receives an additional
cell_batch_size member.
This member can be accessed from the Assembler through newly added functions
Form::cell_batch_size() and FormIntegrals::cell_batch_size().
Assembly algorithm
The cross-cell assembly loop is implemented in different methods of the
Assembler class:

static Assembler::assemble_matrix: Assembles a matrix/bilinear form with zeros for
Dirichlet boundary conditions
static Assembler::assemble: Assembles a vector/linear form
static Assembler::apply_bc: Applies Dirichlet boundary values to a vector

Each of these functions switches between the new cross-cell implementation
and the original implementation depending on the cell_batch_size:
const unsigned int cell_batch_size = form.cell_batch_size();
if (cell_batch_size > 1) 
{
  // new cross-cell assembly code
}
else
{
  // original assembly code
}

I'll continue with an exemplary walk-through with the cross-cell branch of
the assemble_matrix function.
The high-level steps of the assembly are the following:

Allocate memory Ae for the whole cell matrix and Ae_cell that can store
the cell matrix of a single cell
Gather references to the next cell_batch_size cells from the mesh in a
vector cell_batch. More specifically we have to store copies of the
mesh::Cell objects because they are just proxy objects created when
dereferencing the mesh iterator. If the number of mesh cells is not divisible
by cell_batch_size, fill the last batch with dummy cells (e.g. copy of the last real cell).
For each of the cells obtain the coordinates of their degrees of freedom
(Mesh::cell::get_coordinate_dofs). This is currently not vectorized and
the obtained values are stored consecutively in a vector coordinate_dofs_batch.
Call Form::tabulate_tensor_batch(Ae, cell_batch, coordinate_dofs_batch). More details below.
Loop over the non-dummy cells of the batch and:

Obtain the dof-maps of the corresponding cell
Copy all values belonging to the current cell from Ae to Ae_cell.
This "de-interleaves" the data. The following steps are identical to the
non-vectorized version.
Apply dirichlet boundary conditions to Ae_cell.
Use the dof-maps to add Ae_cell to the global matrix using Form::add_local.


Why is an additional copy from Ae to Ae_cell involved? As stated above,
Ae is interleaved with respect to the cells of the batch. This means that
it is essential a rank 3 tensor with dimensions (dofmap1.size(), dofmap2.size(), cell_batch_size)
stored in C row-major order. The Form::add_local function for matrix insertion uses PETSc's
MatSetValues
function which expects raw pointers as arguments. Therefore, we cannot use
something like Eigen::Map as a strided view on the individual cells.
Furthermore, it is not possible to just make a joined, cell-interleaved version of
the dof-maps: PETSc expects the values passed to MatSetValues to be a dense
2D array. If this 2D array is of dimensions (n,m), the corresponding dof-maps
also have to be of dimensions (n) and (m) respectively. This is fundamentally
incompatible with the "3D nature" of the interleaved cell matrix Ae. The only
remaining option is to "transpose" the cell_batch_size dimension of Ae to
the front, followed by cell-wise matrix insertion for each 2D matrix in
the remaining dimensions. This is implemented by copying all values of a single
cell from Ae to a second array Ae_cell.
tabulate_tensor batched evaluation The Assembler calls Form::tabulate_tensor_batch
which is the component of DOLFIN that actually calls into the compiled code that
was generated by FFC. After getting called from DOLFIN as Form::tabulate_tensor_batch(Ae, cell_batch, coordinate_dofs_batch), the method performs the following steps:

For each cell in cell_batch, perform "coefficient restriction". This is
the same process for each coefficient of a form:

Call the Function::restrict/Expression::restrict methods with the
current cell and its dof coordinates. The result is written to a temporary array
for each cell.
Copy the content of the temporary array into a joined, cell-interleaved
array _w.


For each cell in cell_batch, copy the dof coordinates from the
individual arrays in coordinate_dofs_batch into a joined, cell-interleaved
array _coord_dofs.
Call the function pointer of the vectorized tabulate_tensor function
generated by FFC with the interleaved arrays with tabulate_tensor(Ae, _w, _coord_dofs)

Why are the additional copies for interleaving required? Currently, it is not
possible to perform coefficient restriction in some kind of cross-cell/batched
mode. Coefficients can be user defined math Expression objects or finite
element based Function objects. In both cases it is not trivial to somehow
directly write into strided memory. Both rely on raw scalar pointers to
return their values. It might be possible to change this. However, I did not
investigate this extensively.
Performance
My GSoC working repository contains some benchmarking code that I used for
testing. For cross-cell assembly the relevant script is assembly_bench.py. I compared the default non-batched assembly with the cross-cell assembly
case optimized for my processor (i7 8700K). I used the following problems for
testing:

A Poisson problem with second order Lagrange tetrahedral elements with
a varying coefficient defined using linear elements
A hyperelasticity problem with linear Lagrange tetrahedral elements
The action of the bilinear form of the Poisson problem
The action of the bilinear form of the hyperelasticity

The results were quite underwhelming, speedup in percent of the default assembly:

Poisson: 21%
Hyperelasticity: 1%
Action of Poisson: 0%
Action of the Hyperelasticity bilinear form: 0%

Especially for the linear forms the results were quite surprising. We expected
a much higher speedup in comparison to the bilinear forms due to the significantly
lower assembly overhead. I did not find a satisfying explanation for this behavior
yet.
Missing functionality

The Assembler class is under heavy development as part of the DOLFIN-X
project. Therefore, the current cross-cell assembly can only be considered
a prototype and has to be kept in sync with the changes to the default
non-batched assembly. As the Assembler does not implement non-cell type
integrals (e.g. facet or vertex integrals) this isn't tested with the
cross-cell approach either. This includes proper testing of the code
generated by FFC.
The prototype assembly algorithm currently does not respect subdomains correctly.
This means that it assumes that all cells in a batch share the same
subdomain (otherwise a runtime exception is thrown). Furthermore, the
Form::cell_batch_size() would require a mesh::Cell as parameter to
return the correct batch size corresponding to the subdomain. Otherwise
it would be required that all subdomains share the same batch size.

Usage instructions

The following sections give an introduction on how to actually enable and try
out the cross-cell vectorization features in FFC-X and DOLFIN-X.
Requirements

The GSoC project is based on the FEniCS-X project and not on the stable release
version. Respectively, a working installation of FEniCS-X is required to try
out the vectorization capabilities. Furthermore, the large project PRs
mentioned above are still under review and not merged, yet.
Therefore, you have to check out the corresponding branches, namely

ffcx/fabian/gcc-vec-ext
dolfinx/fabian/batch-assembly

Additionally, DOLFIN-X has to be compiled with an alignment or architecture
flag as explained in the previous implementation and analysis section. For
AVX2 supporting CPUs however, this is already ensured by the default CMake
options.
As an alternative to a manual installtion, a pre-configured Docker container
image is available on Quay.io.
It can be pulled using
docker pull quay.io/w1th0utnam3/simd-work:latest

Enabling vectorization

Cross-cell assembly with DOLFIN A full example for enabling cross-cell
assembly when using DOLFIN from Python can be found here.
Calling FFC directly To generate cross-cell vectorized UFC from UFL files
the following parameters are relevant:

cell_batch_size: FFC generates cell matrix computation functions that
compute the cell integral for this number of cells simultaneously. Should
match the number of doubles that fit into one vector register of the CPU
target architecture. Default value: 1
alignas: The memory alignment required by the CPU for vector instructions.
For AVX2 this would be 32 bytes, for AVX-512 this would be 64 bytes.
Default value: 32
enable_cross_element_gcc_ext: Whether to use GCC's vector extensions for
vectorization. Should be enabled when using GCC as a compiler for better
performance. Default value: 0

An example call to FFC looks like this:
python3 -m ffc -f cell_batch_size 4 -u enable_cross_element_gcc_ext 1 -u alignas 32 Poisson.ufl

To actually get a performance benefit of the compiled code, the following
compiler flags (in case of GCC) are relevant

-ftree-vectorize: Enable GCC's auto vectorizer
-march=...: Specify the target CPU architecture, has to be set to
native, avx2, skylake,... etc. to allow GCC to emit any kind of vector
instructions
-mtune=...: Perform additional target architecture dependent
optimizations

Outlook

As noted above, the cross-cell assembly in DOLFIN-X can only be considered a
prototype. The implementation of the Assembler will change a lot before the
release. Maybe this allows making some additional changes to make cross-cell
assembly more efficient. At the moment however, the results are a little bit
disappointing and it has to be decided in the development team whether we want to
continue following this approach. In any case I would enjoy to working with the
team even if we put the vectorization project on hold for the moment.
Apart from the low speedup there are some additional remaining points that were
mentioned above:

Correct subdomain treatment in cross-cell assembly (DOLFIN)
Support for non-cell integrals (FFC & DOLFIN)
In-depth analysis of large differences in speedup of FFC generated code