w1th0utnam3/gsoc18.md

## gsoc18.md

      
    Raw
  

              gsoc18.md
            
          
    Maximizing performance on modern architectures with data-level parallelism

Abstract

In the solution process of problems discretized using FEM, the assembly of
element level tensors to the global matrix often contributes a significant
amount of computational time relative to the overall process. The goal of this
GSoC 2018 project is to improve the performance of the FEniCS framework in the
assembly phase. To achieve this, the FEniCS Form Compiler (FFC)
should be enhanced such that the generated code can fully utilize data-level
parallelism (SIMD) functionality of modern CPUs.
Technical Details

In the FEniCS framework, DOLFIN
provides an interface for the user that connects and hides several internal
components. Users formulate their problems consisting of variational form,
function spaces, boundary conditions, etc. in the high level Unified Form Language (UFL).
When the user requests a solution, DOLFIN performs all necessary steps leading
to matrix assembly and calling a linear solver from an external library. During
the assembly stage, DOLFIN loops over all mesh entities and computes the local
tensors using calls to corresponding tabulate_dofs/tabulate_tensor functions
(e.g. evaluating the cell integrals). The C++ code of these functions
is generated by FFC beforehand, during the analysis of the UFL code and subsequently
compiled using dijitso. FFC is
implemented in Python and generates the C++ code in several stages. It starts by
translating the UFL form to an intermediate representation (IR) that is optimized in
subsequent stages before being translated to C++. For the tabulate_tensor
function, the IR is given in form of an abstract syntax tree (AST) implemented
using Python classes that model available C++ expressions/statements.
As previously stated, the goal of this project is identify and realize optimization
potential in FFC in order to make use of SIMD instructions in the generated C++ code.
This is important for current architectures in order to achieve practically optimal
throughput of arithmetic operations. In particular, this may be done either by adding
explicit SIMD intrinsics that are specific to the involved third party C++ compiler
or more generally by generating code that is friendly to automatic vectorization.
As FEniCS is currently not restricted to a specific compiler, the latter approach is
probably more sensible. However, it should also be considered whether there may be more
effective approaches than only improving the local cell/facet/etc. integral code. To
help auto-vectorization by the C++ compiler, common techniques are

architecture dependent padding and data alignment
loop interchange or permutation
loop unrolling
loop fusion

Nevertheless, the benefit of adding intrinsics should also be investigated.
Overall, the main tasks of the problem can be summarized as:

Identifying specific application problems (i.e. variational forms and element types)
that are bottelnecked by the content of the tabulate_tensor functions to use them as
an optimization target
Examining current solutions in similar projects, e.g. TSFC
and COFFEE from the Firedrake project, see
"Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly"
Implementing and testing different approaches while monitoring possible performance
improvements on the previously selected problems as well as regressions for simpler problems
Possibly investigate other reasons for bottlenecks in the matrix assembly code

During the project it is sensible to keep in contact with contributors of the Firedrake
project as similar optimization were already performed in their codebase.
Schedule of Deliverables

Community Bonding Period: April 23 - May 13 (3 weeks)

During the community bonding period I would like to discuss my plans with more
members of the team, make them more detailed and get to know the current state of the
code better.
I'm on holiday from April 17 to May 9, doing a round trip through Japan, where I
visit friends from university. Unfortunately the dates overlap with the bonding
period as I booked the flights last year to not collide with my exams that I had
to take earlier this month. My plan is to catch up on the missing time after
coming back and to already get to know the project and team better before the
official results of the applications are announced (April 23).

Week -5 (April 9 - 15):

Develop more detailed plans for the project in discussion with mentor (Jack Hale)
Exchange with other team members and experts from Firedrake, etc.


Week -4 to -2 (April 16 - May 6): On holiday
Week -1 (May 7 - 13):

Setup of blog for weekly reports
Setup of development environment
Get a better understanding of which parts of code generation are related to
what mathematical concepts (maybe write a blog post that creates a "map" of the
relevant code pieces)


Phase 1: May 14 - June 10

The focus of the first phase is to make sure that there is a metric to measure performance
improvements/regressions and to examine how to check whether the compiler generates vectorized
code. Furthermore, first ideas to improve code generation should be implemented.

Week 1 (May 14 - 20):

Identify problems where the assembly stage is bottlenecked by FFC generated code
Check whether there may already be some optimization potential using the current form of
code generation (e.g. with permutations of statements) or whether larger changes are
definitely necessary (e.g. changing the way coefficients are stored)


Week 2 & 3 (May 21 - 27, May 28 - June 3):

Setup a benchmark or adapt the currently available regression test benchmarks for this project
Depending on results of previous weeks try out and evaluate results of small scale optimizations


Week 4 (June 4 - June 10):

Prepare phase 1 commit


Phase 2: June 15 - July 13

During the second phase, the performance of the generated code should be improved iteratively
by implementing and combining the common optimization techniques or by trying out new ideas.
This phase mainly consists of alternating between the implementation and the evaluation using
the previously defined benchmarks.

Week 5 to 7 (June 11 - 17, June 18 - 24, June 25 - July 1):

Time for larger refactoring/restructuring ideas


Week 8 (July 2 - 8):

Prepare phase 2 commit


Phase 3 and Final Week: July 13 - August 14

The goal of the last phase is to finalize the implementation of the new features. The focus
should be on fixing bugs, writing unit or regression tests and to clean up code and documentation.
The tests and documentation should also be committed to the project.

Week 9 & 10 (July 9 - 15, July 23 - 29):

Evaluate the current performance: "last chance" to look in other directions for optimization
Discuss current state with Firedrake developers as well
If progress is satisfying: maybe also try out optimizations unrelated to SIMD


Week 11 (July 23 - 29):

Implement integration and regression tests for the FFC enhancements


Week 12 (July 31 - August 5):

No new features in last weeks, instead: code clean up and potentially add missing comments
and documentation
Final test runs of implementation


Week 13 (August 6 - August 14):

Spare time for bug fixing
Prepare final merge request of features and tests


Development Experience

In order to get to know the project better, I identified FFC issue #173
to be closely related to the area of the project. Therefore, I tried to track the
issue down and - in contact with other project developers through Slack - implemented
a possible fix which is currently available in my branch.
We are currently discussing whether this patch is the correct direction but I already
created a corresponding pull request (#95).
Related to this issue, I reported the related problem in GCC as GCC issue #85072.
Before applying for this GSoC project, I gained development experience through the following projects:


University: Group for Computer Animation, 2017-Now
Student assistant job - Currently, I'm helping with research on time integration methods for deformable
bodies by implementing and evaluating numerical methods in C++ using Eigen.
Furthermore, I worked on improving the PositionBasedDynamics
library which is maintained by the group.


Bachelor thesis and internship, Winter 2016 - Spring 2017 (6 months)
"Extension of an application for thermal analysis of satellites" - During my Bachelor thesis and
internship (6 months) at a large European spacecraft manufacturer, I worked fulltime on the
implementation of numerical methods for ODEs and iterative solvers and evaluation of their
performance. The implementation was mainly done in Python using NumPy
but I also worked a little bit with MATLAB.
Furthermore, I built interfaces between Python and C/C++ libraries using ctypes.


University: Institute for Fluid Power Drives and Controls, 2014-2016
Student assistant job - I helped with the development of a C++ application for the
simulation of fluid flow in "axial piston pumps". I worked on the numerical code which
uses the KINSOL nonlinear solver,
the user interface (Qt) and result visualization (embedded
VTK views). A paper
based on this tool was published by the PhD candidate I worked with: Stephan Wegner.


Found and reported the following C++ compiler bugs:

GCC #82613: "Cannot access private definitions in base clause of friend class template"
MSVC: "C1001 when performing uniform initialization with pack expansion over alias template"
MSVC: Bug related to "Error modifying struct member in constexpr function"


Some minor contributions to github repositories MetaStuff: #1, #2 and EnTT: #1


In my spare time I worked on some personal GitHub repositories but mainly to
collect utilities I used for my work (e.g. noname_tools)
or simply to try out things learned from courses (e.g. phyani_playground).
In these repositories I like to play around with modern C++ techniques as I'm a
regular listener of CppCast and like to follow the most
recent developments of the language (towards C++20, etc.).


Other Experience

I'm currently studying Computational Engineering Science (M. Sc.) at RWTH Aachen
University, Germany. The programme connects engineering, computer science and mathematics.
We were introduced to FEM from the mathematical point of view (@MathCCES RWTH).
The lectures were always accompanied by exercises where we implemented the numerical methods
e.g. in MATLAB. Furthermore, I heard a lecture on FEM for fluid mechanics which was more
engineering/implementation oriented (@CATS RWTH). In computer science lectures we were introduced to the basic data structures and
algorithms, software engineering principles and later focused on high performance
computing including implementation techniques for MPI, OpenMP and optimizing for
data-level parallelism (i.e. SIMD).
During a student project at MathCCES I worked
with another student on a simplified wake model for wind turbines
which is usable for large scale optimization. After the implementation phase we
validated it using OpenFOAM.
During my years at university I grew confident working on Linux and writing all my
documents with LATEX. In several smaller coding projects we learned to managing
our code with multiple developers in git.
Why this project?

Considering my development experience and the orientation of my study program, I spent
a lot of the time at university with tool development for scientific computing and
engineering. In my opinion, proper tooling for these tasks gets progressively more
important as the involved problems become more complicated and computational power
increases. Researchers should be able to focus on their projects and their tools are
supposed to "just work" and be easily adaptable if necessary. When I was introduced
to FEniCS during a math exercise I was quite excited by the simplicity of the high
level language that can be used to describe problems. Furthermore I really liked
the open source aspect of the package and the ability to extend and customize it.
As open source software is so important in every day use and scientific computing,
I would like contribute to a project. However, it always felt like the hurdle to get
to know a project was too high. Therefore, I decided to apply for a project with the
FEniCS team as it is relatively closely related to my study program and hopefully
easier to get productive.
This concrete project with respect to optimization of the code generation components
appealed to me the most as I'm very interested in how the "magic happens" in the background.
Especially in cases like FEniCS, where the interface is relatively simple but still powerful.
Btw., this also applies to C++ which is a reason as to why I follow the
development of the language and the corresponding rationale with great
interested. Unfortunately, I did not yet have the possibility to study compiler techniques at
university. Therefore, I'm very keen to learn more about code generation through
this project. I hope that this project can help me decide where I want to continue
after my master thesis, e.g. whether I want to work as a PhD candidate in this field.
From a more practical point of view, I know from my own experience how resource
intensive the assembly stage in FEM is and I would like to improve the usability of
the package in this regard.
Workload

In my opinion, the GSOC projects should be considered a fulltime job and accordingly I
want to spend around 40 hours per week on the task. Currently, I'm still employed at the
Computer Animation research group (mentioned earlier in Development Experience)
and work there for 9 hours per week. My current plan is to handle these hours either in
a single day or two half days per week depending on the respective tasks.
Appendix


Website: https://w1th0utnam3.github.io/
Github: w1th0utnam3
Twitter: @w1th0utnam3
Other forms of contact: see Slack profile