On August 14 the coding phase of Google Summer of Code officially ends. In this post I'll give a brief summary of my work during the summer. This is followed by a more detailed analysis and an outlook section.
The goal of this project was to explore ways to improve the performance of FEniCS using vector-instruction or SIMD capabilities of modern CPUs. The focus was on the assembly stage of the FEM solution process as the solution of linear systems is handled by external libraries. Early, we decided that I should investigate the vectorization of functions that compute element matrices as this is the arithmetically most intensive part in the assembly process. This is not a trivial task as FEniCS (or more specifically its component FFC) generates the corresponding C code from high-level math like UFL code for each problem formulation automatically.
The most promising approach turned out to be a "cross-cell" assembly approach. In this approach, the element matrix is computed for multiple cells simultaneously. Vectorization is then performed automatically by the compiler across this cell batch for every arithmetic statement of the cell matrix code. For more details, have a look at the referenced blog posts and the pull request section below.
As shown by benchmarks in a previous blog post this approach lead to a speedup factor between 2x to 6x when looking only at the element matrix computation. Unfortunately, the user does not always profit from this speedup. The global matrix assembly gets more complicated when using the cross-cell approach. In specific cases, global assembly with the cross-cell approach was however still up to 20% faster then non-vectorized code. For more details, have a look at the analysis section below.
- FFC-X: "Implementing cross-cell code generation in UFLACS"
- DOLFIN-X: "Implementing cross-cell integral assembly"
- DOLFIN-X: "Control Eigen alignment for compatibility with AVX user code"
In general, pull requests are better manageable when they cover only few lines of code. Larger projects should usually be broken up into several incremental PRs with a clear scope so that their quality can be judged individually. Due to the rather experimental nature of this GSoC project however, I decided to just use two PRs in order to give a better overview of the changes and additions related to vectorization. Furthermore, it is quite hard to extract single components of the PRs that make sense on their own and can be tested separately.
The first large PR targets FFC-X and adds new functions to its code generation back-end UFLACS to produce cross-cell code. This is code which computes cell matrices for multiple mesh cells at the same time. I tested and benchmarked it intensively which is documented in a previous blog post.
The second large PR implements a prototype version of cross-cell based matrix and vector assembly in DOLFIN-X. This PR depends on the two other PRs listed above. Obviously it requires the cross-cell code generation of FFC-X, otherwise the corresponding assembly makes no sense. Furthermore, a change in compilation parameters with respect to memory alignment in DOLFIN is required. Vectorized code (more specifically AVX instructions) have restrictions on the alignment of memory they operate on. For more details have a look at the implementation and analysis section below.
Both PRs contain a couple of unit tests that I implemented recently. They compare results of the vectorized code with the unmodified, non-vectorized code.
During the summer I also made a couple of smaller commits and helped with issues that were not directly related to my project.
- FFC-X: Several code cleanup commits (remove of dead code)
- DOLFIN-X: Other commits
The implementation details of the changes to FFC-X to generate cross-cell UFC (Unified Form-assembly Code) is already covered by previous blog posts and I don't want to repeat everything here. Please have a look at the blog posts
Furthermore, all code can be found in the FFC-X pull request
mentioned above. Most of the implementation resides in the file ffc/uflacs/ast_transformations.py
.
Bilinear forms As previously stated, experiments with the generated code showed speedup of
2x to 6x for bilinear forms and a speedup between 3x and 4x with linear forms
(created using the action
operator) when computing cell matrices without matrix
assembly. Some analysis of the speedup factors shown in the plot below were
already presented in the referenced blog posts.
The posts already showed that the speedup of bilinear forms varies strongly between
problems and there are individual problems (like Stokes flow) that show hardly
any benefit from vectorization. During the project I did not manage to complete
a more in-depth analysis of these cases which is required to potentially improve
the performance for them. Therefore, I added this to the outlook/to-do list
in the end of this review.
Linear forms In case of the action
of bilinear forms, the speedup results are a lot more
consistent:
Background for PR "Control Eigen alignment for compatibility with AVX user code"
Vectorization (performed by the compiler or manually by the programmer) relies
on extended CPU instruction sets that go beyond the basic x86-64
instructions.
An example for these instruction sets is AVX2 which is available on all
modern Intel CPUs. AVX2 supporting CPUs have vector registers than can store
256 bit. Corresponding instructions can operate on e.g. 4 double
values
simultaneously. As a source for the respective vectorized load operations,
a default double
array can be used. There is a slight catch however: the instruction sets
require the memory to be "aligned" which boils down to the address being a
multiple of the alignment value. For example, in case of AVX2 an alignment of
32 bytes is required. Disregarding this requirements results almost definitely
in crashes.
In practice, this requirement gets introduced in the cross-cell approach,
when the FFC generated cell matrix code gets JIT (just in time) compiled with
with an e.g. -march=avx2
flag. A flag like this is required as GCC does not
emit any non-x86-64
instructions by default. Consequently no vectorized code
would be produced without this flag.
The FFC generated code uses in/out pointers to receive and store data. The alignment requirements extend to all underlying memory locations. In practice, the memory gets allocated by DOLFIN in the assembly loop. DOLFIN uses Eigen matrices for most of internal linear algebra operations (independent from global matrices that are provided from PETSc). Unfortunately Eigen does not support dynamic specification of alignment at runtime (which would be required for a clean solution when interacting with Python user code). Furthermore, we did not want to introduce any self-written data structures or custom memory management. The only option is to use a single alignment value for all Eigen matrices that gets fixed at DOLFIN compile time.
There are two ways how the alignment can be controlled at compile time:
- The
EIGEN_MAX_ALIGN_BYTES=...
preprocessor directive. This directive ensures that Eigen will use at least an alignment of the specified number of bytes. - Setting a
CMAKE_CXX_FLAGS="-march=native"
environment variable during DOLFIN compilation. Eigen automatically detects the required alignment through compiler macros like__AVX__
. Eigen will overwrite any less strict alignments specified byEIGEN_MAX_ALIGN_BYTES
in this case.
The alignment that is actually selected by Eigen is stored in the
EIGEN_DEFAULT_ALIGN_BYTES
definition.
The PR uses a default value of EIGEN_MAX_ALIGN_BYTES=32
as AVX2 is basically
standard nowadays. A stricter default value like 64 bytes (required for AVX-512)
does not seem reasonable at the moment. If a user wants to use AVX-512 for
cross-cell assembly, DOLFIN has to be compiled with an appropriately changed
parameter.
Apart from outliers, the performance of the vectorized code generated by FFC
looked very promising. In the beginning of this review however, I already hinted that
the matrix assembly process itself is responsible for swallowing a huge part of
the achieved speedup. The assembly component of the cross-cell approach was not covered in previous blog posts. Therefore, I'll continue with a walk-through in this report that
highlights problematic areas. Most of the code changes in DOLFIN are located
in the Assembler
(cpp/dolfin/fem/Assembler.cpp
) and Form
(cpp/dolfin/fem/Form.cpp
) classes.
UFC interface
During assembly, DOLFIN has to know how many cells the generated cell matrix
code expects. Therefore, the FFC changes include an update of the UFC interface
definition in ffc/backends/ufc/ufc.h
. Every ufc_*_integral
struct that stores function pointers
to the corresponding FFC generated tabulate_tensor
receives an additional
cell_batch_size
member.
This member can be accessed from the Assembler
through newly added functions
Form::cell_batch_size()
and FormIntegrals::cell_batch_size()
.
Assembly algorithm
The cross-cell assembly loop is implemented in different methods of the
Assembler
class:
static Assembler::assemble_matrix
: Assembles a matrix/bilinear form with zeros for Dirichlet boundary conditionsstatic Assembler::assemble
: Assembles a vector/linear formstatic Assembler::apply_bc
: Applies Dirichlet boundary values to a vector
Each of these functions switches between the new cross-cell implementation
and the original implementation depending on the cell_batch_size
:
const unsigned int cell_batch_size = form.cell_batch_size();
if (cell_batch_size > 1)
{
// new cross-cell assembly code
}
else
{
// original assembly code
}
I'll continue with an exemplary walk-through with the cross-cell branch of
the assemble_matrix
function.
The high-level steps of the assembly are the following:
- Allocate memory
Ae
for the whole cell matrix andAe_cell
that can store the cell matrix of a single cell - Gather references to the next
cell_batch_size
cells from the mesh in a vectorcell_batch
. More specifically we have to store copies of themesh::Cell
objects because they are just proxy objects created when dereferencing the mesh iterator. If the number of mesh cells is not divisible bycell_batch_size
, fill the last batch with dummy cells (e.g. copy of the last real cell). - For each of the cells obtain the coordinates of their degrees of freedom
(
Mesh::cell::get_coordinate_dofs
). This is currently not vectorized and the obtained values are stored consecutively in a vectorcoordinate_dofs_batch
. - Call
Form::tabulate_tensor_batch(Ae, cell_batch, coordinate_dofs_batch)
. More details below. - Loop over the non-dummy cells of the batch and:
- Obtain the dof-maps of the corresponding cell
- Copy all values belonging to the current cell from
Ae
toAe_cell
. This "de-interleaves" the data. The following steps are identical to the non-vectorized version. - Apply dirichlet boundary conditions to
Ae_cell
. - Use the dof-maps to add
Ae_cell
to the global matrix usingForm::add_local
.
Why is an additional copy from Ae
to Ae_cell
involved? As stated above,
Ae
is interleaved with respect to the cells of the batch. This means that
it is essential a rank 3 tensor with dimensions (dofmap1.size(), dofmap2.size(), cell_batch_size)
stored in C row-major order. The Form::add_local
function for matrix insertion uses PETSc's
MatSetValues
function which expects raw pointers as arguments. Therefore, we cannot use
something like Eigen::Map
as a strided view on the individual cells.
Furthermore, it is not possible to just make a joined, cell-interleaved version of
the dof-maps: PETSc expects the values passed to MatSetValues
to be a dense
2D array. If this 2D array is of dimensions (n,m)
, the corresponding dof-maps
also have to be of dimensions (n)
and (m)
respectively. This is fundamentally
incompatible with the "3D nature" of the interleaved cell matrix Ae
. The only
remaining option is to "transpose" the cell_batch_size
dimension of Ae
to
the front, followed by cell-wise matrix insertion for each 2D matrix in
the remaining dimensions. This is implemented by copying all values of a single
cell from Ae
to a second array Ae_cell
.
tabulate_tensor batched evaluation The Assembler
calls Form::tabulate_tensor_batch
which is the component of DOLFIN that actually calls into the compiled code that
was generated by FFC. After getting called from DOLFIN as Form::tabulate_tensor_batch(Ae, cell_batch, coordinate_dofs_batch)
, the method performs the following steps:
- For each cell in
cell_batch
, perform "coefficient restriction". This is the same process for each coefficient of a form:- Call the
Function::restrict
/Expression::restrict
methods with the current cell and its dof coordinates. The result is written to a temporary array for each cell. - Copy the content of the temporary array into a joined, cell-interleaved
array
_w
.
- Call the
- For each cell in
cell_batch
, copy the dof coordinates from the individual arrays incoordinate_dofs_batch
into a joined, cell-interleaved array_coord_dofs
. - Call the function pointer of the vectorized
tabulate_tensor
function generated by FFC with the interleaved arrays withtabulate_tensor(Ae, _w, _coord_dofs)
Why are the additional copies for interleaving required? Currently, it is not
possible to perform coefficient restriction in some kind of cross-cell/batched
mode. Coefficients can be user defined math Expression
objects or finite
element based Function
objects. In both cases it is not trivial to somehow
directly write into strided memory. Both rely on raw scalar pointers to
return their values. It might be possible to change this. However, I did not
investigate this extensively.
Performance
My GSoC working repository contains some benchmarking code that I used for
testing. For cross-cell assembly the relevant script is assembly_bench.py
. I compared the default non-batched assembly with the cross-cell assembly
case optimized for my processor (i7 8700K). I used the following problems for
testing:
- A Poisson problem with second order Lagrange tetrahedral elements with a varying coefficient defined using linear elements
- A hyperelasticity problem with linear Lagrange tetrahedral elements
- The action of the bilinear form of the Poisson problem
- The action of the bilinear form of the hyperelasticity
The results were quite underwhelming, speedup in percent of the default assembly:
- Poisson: 21%
- Hyperelasticity: 1%
- Action of Poisson: 0%
- Action of the Hyperelasticity bilinear form: 0%
Especially for the linear forms the results were quite surprising. We expected a much higher speedup in comparison to the bilinear forms due to the significantly lower assembly overhead. I did not find a satisfying explanation for this behavior yet.
Missing functionality
- The
Assembler
class is under heavy development as part of the DOLFIN-X project. Therefore, the current cross-cell assembly can only be considered a prototype and has to be kept in sync with the changes to the default non-batched assembly. As theAssembler
does not implement non-cell type integrals (e.g. facet or vertex integrals) this isn't tested with the cross-cell approach either. This includes proper testing of the code generated by FFC. - The prototype assembly algorithm currently does not respect subdomains correctly.
This means that it assumes that all cells in a batch share the same
subdomain (otherwise a runtime exception is thrown). Furthermore, the
Form::cell_batch_size()
would require amesh::Cell
as parameter to return the correct batch size corresponding to the subdomain. Otherwise it would be required that all subdomains share the same batch size.
The following sections give an introduction on how to actually enable and try out the cross-cell vectorization features in FFC-X and DOLFIN-X.
The GSoC project is based on the FEniCS-X project and not on the stable release version. Respectively, a working installation of FEniCS-X is required to try out the vectorization capabilities. Furthermore, the large project PRs mentioned above are still under review and not merged, yet. Therefore, you have to check out the corresponding branches, namely
Additionally, DOLFIN-X has to be compiled with an alignment or architecture flag as explained in the previous implementation and analysis section. For AVX2 supporting CPUs however, this is already ensured by the default CMake options.
As an alternative to a manual installtion, a pre-configured Docker container image is available on Quay.io. It can be pulled using
docker pull quay.io/w1th0utnam3/simd-work:latest
Cross-cell assembly with DOLFIN A full example for enabling cross-cell assembly when using DOLFIN from Python can be found here.
Calling FFC directly To generate cross-cell vectorized UFC from UFL files the following parameters are relevant:
cell_batch_size
: FFC generates cell matrix computation functions that compute the cell integral for this number of cells simultaneously. Should match the number ofdouble
s that fit into one vector register of the CPU target architecture. Default value: 1alignas
: The memory alignment required by the CPU for vector instructions. For AVX2 this would be 32 bytes, for AVX-512 this would be 64 bytes. Default value: 32enable_cross_element_gcc_ext
: Whether to use GCC's vector extensions for vectorization. Should be enabled when using GCC as a compiler for better performance. Default value: 0
An example call to FFC looks like this:
python3 -m ffc -f cell_batch_size 4 -u enable_cross_element_gcc_ext 1 -u alignas 32 Poisson.ufl
To actually get a performance benefit of the compiled code, the following compiler flags (in case of GCC) are relevant
-ftree-vectorize
: Enable GCC's auto vectorizer-march=...
: Specify the target CPU architecture, has to be set tonative
,avx2
,skylake
,... etc. to allow GCC to emit any kind of vector instructions-mtune=...
: Perform additional target architecture dependent optimizations
As noted above, the cross-cell assembly in DOLFIN-X can only be considered a
prototype. The implementation of the Assembler
will change a lot before the
release. Maybe this allows making some additional changes to make cross-cell
assembly more efficient. At the moment however, the results are a little bit
disappointing and it has to be decided in the development team whether we want to
continue following this approach. In any case I would enjoy to working with the
team even if we put the vectorization project on hold for the moment.
Apart from the low speedup there are some additional remaining points that were mentioned above:
- Correct subdomain treatment in cross-cell assembly (DOLFIN)
- Support for non-cell integrals (FFC & DOLFIN)
- In-depth analysis of large differences in speedup of FFC generated code