This blog post provides an overview of the work done for the Google Summer of Code 2019.
The main goal of this project is to add the KaHIP partitioner to DOLFIN's graph wrappers and mesh partitioning, and investigate whether the expected improvements obtained by KaHIP for mesh partitioning would reflect on the Dolfin's parallel toolchain. This is related to Issue #116.
A second goal of this project is to add support for partitioning using a subset of processors. As currently implemented in DOLFIN-X, the mesh partitioners use all the available MPI processes to perform partition. Numerical experiments show that at scale, this can demand high memory usage (partitioning packages depends on the number of processes). Also, the running time increases significantly with the number of processes for a fixed size local mesh/graph per-processor.This is related to Issue #9.
In the following sections, we will list the contributions and some results demonstrating the improvements achieved during GSoC 2019.
To test the parallel KaHIP partitioner, contributions were made to different open-source repositories.
List of Pull Requests and Commits - KaHIP repository:
- PR #33 - Use python3 print format in Scons - merged on Jun 25
- These simple modifications allow the use of the compile.sh script in systems that have only python3
- PR #37 - Add modified kahip lib to deploy/parallel - merged on Jun 25
- The aim of this PR is to simplify linking the parallel interface using cmake.
- Commit ad07d2d - bug fix in interface due to @IgorBaratta - commited on Aug 6
- This commit aims to fix the internal graph building process of KaHIP, more specifically, it sets the correct number of edges.
List of Pull Requests - DOLFIN-X repository:
- PR #451 - This PR aims to solve Issue #116.
- Create
FIndKaHIP.cmake
file to link KaHIP and add a small test for interface in serial; - Define KaHIP as an optional package;
- Create and interface for KaHIP following the code available for ParMETIS and PT-SCOTCH;
- Add option to use the new interface in
Partitioning.h
; - Add some tests in the C++ interface.
- Create
To test the interface and the performance of the new parallel partitioner, we performed some numerical experiments to evaluate its weak scalability. The runtime results were compared with those obtained using PT-SCOTCH, considered here as the benchmark.
These numerical experiments were performed using resources provided by the Cambridge Service for Data-Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (www.csd3.cam.ac.uk).
We consider, for comparison purposes, the total mesh building time, which basically consists of three well-defined steps:
- Read - Read local mesh data (points and cells) from a XDMF file with HDF5 encoding.
- Partition - Build the distributed dual graph (cell-cell connections) from local mesh data; partition the mesh using one of the supported libraries (KaHIP or PT-SCOTCH), and perform halo exchange of cell partition data.
- Distribute - Distribute mesh from a set of points and cells on each local process with the pre-computed partition data.
The number of cells per processor is fixed at 100,000. Moreover, the number of processors was increased from 128 to 1024. The mesh generation was performed in a pre-processing step and is not considered for runtime measurement purposes.
The first tests were performed without the corrections on the KaHIP parallel interface, and the results are shown right below. At first, PT-SCOTCH was considerable more scalable than KaHIP (even using its fastest mode), which did not match the results available in the literature. So, I have investigated the possibility of some parallel interface bugs.
After some small bug fixes, we repeated the numerical experiments, and the results changed dramatically. Although the partition step still dominates the total time, the KaHIP partitioner becomes more scalable.
Up to 1024 processors, the partitioning step using KaHIP behaved almost independently to the number of processors. This is a very promising result; however, tests with higher core counts are still necessary for more definitive conclusions.
List of Pull Requests - DOLFIN-X repository:
- PR #459 - This PR is a proposal for closing Issue #9, and related to Issue #358 of dolfin. Merged on on Jul 20.
This PR entail the following tasks:- Create
MPI::SubsetComm
, to extract a new communicator with only a subset of processes. - (XDMFFile) Separate read mesh data and mesh creation.
- Move
Partitioning::partition_cells
to the public interface, and allow partitioning with a different number of parts and processes. - Provide a fine-grained control of mesh partition (different communicator for reading mesh data and partitioning, selection of partitioner), an small example is provided below.
- Add new c++ test for distributed meshes (generate, write, read).
- Create
MPI_COMM subset_comm = MPI::SubsetComm(MPI_COMM_WORLD, num_processes);
std::tie(cell_type, points, cells, global_cell_indices)
= infile.read_mesh_data(subset_comm);
mesh::PartitionData cell_partition = mesh::Partitioning::partition_cells(
subset_comm, nparts, cell_type, cells, partitioner);
auto mesh = std::make_shared<mesh::Mesh>(mesh::Partitioning::build_from_partition(
MPI_COMM_WORLD, cell_type, cells, points, global_cell_indices,
ghost_mode, cell_partition));
Again the wall-time considering the whole distributed mesh building process is measured for an increasing number of processors, and it is used as a performance metric. However, instead of using different partitioners, we compare the parallel performance using different numbers of processors for partitioning, and then we distribute the correspondent mesh data for each process of the primary communicator. We consider here three different settings:
- Use all available processes of the primary MPI communicator (the only option that was available on dolfinx before PR).
- Use half of the MPI available processes.
- Use a fixed number of processes, fixed to 128 in the current numerical experiment.
It can be noted from this preliminary numerical experiment, that the total mesh building time strongly depends on the number of processors used in the mesh partitioning step. Using a smaller processor subset leads to a shorter total running time; however, the time of the distribution step (Distribute) is affected adversely.
- PR #436 - Avoid partitioning small meshes, by throwing an error if the expected local number of cells is less than 2. This PR fixes Issue #114 - Merged on Jun 13.
- PR #465 - Remove unused comm parameter from read_mesh - Merged on Jul 8.
-
A preliminary study on the mesh distribution step (Distribute) was performed. We concluded that the bottleneck is in the distribution of cells, while the distribution of points has a relatively reduced cost. As future work, we intend to propose more efficient solutions for cell distribution that consider the partitioned graph topology, avoiding redundant communication.
-
Numerical experiments were performed with a maximum of 1024 processors. We intend to perform similar tests for a higher processor count. Besides, a study including the complete process of solving a boundary value problem is planned.
-
Some open issues regarding parallel reading and mesh building will be addressed in the coming months. Eg. Issue #429.