HPX, as a framework designed for high performance computing, has various benchmarks for measuring the performance of its various components, which includes parallel algorithms, its runtime system, etc.. But these benchmarks (performance tests) lack a standardized format and a visualization tool that can help in analyzing performance trends over time, in different operating environments. Additionally, as these benchmarks use the std::chrono timers for measuring performance, the results obtained may not be stable to use reliably.
Hence, the goal of this project was to standardize the benchmarks' output formats within HPX, and to also add integration with an external benchmarking framework, i.e., nanobench, for obtaining stable results. Additionally, a visualization tool was also developed, which leverages the standardized results to plot the results, and compare them with some pre-defined baselines. These plots were then displayed on the HPX CDash dashboard, by integrating all the above mentioned components with the CI pipelines
First a CMake configuration option for using Nanobench with HPX was defined. Then, we let the HPX runtime know that Nanobench is being used for benchmarking, by using the hpx_add_config_define CMake function, which sets the C++ macro HPX_HAVE_BENCHMARK.
hpx_option(
HPX_WITH_NANOBENCH
BOOL
"Use Nanobench for performance tests. Nanobench will be fetched using FetchContent (default: OFF)"
OFF
CATEGORY "Build Targets"
ADVANCED
)
if(HPX_WITH_NANOBENCH)
hpx_add_config_define(HPX_HAVE_NANOBENCH)
include(HPX_SetupNanobench)
endif()Next, in the HPX_SetupNanobench file, Nanobench is fetched using CMake FetchContent, and inserted into the build system by adding system interfaces to the library
include(FetchContent)
fetchcontent_declare(
nanobench
GIT_REPOSITORY https://github.com/martinus/nanobench.git
GIT_TAG v4.3.11
GIT_SHALLOW TRUE
)
if(NOT nanobench_POPULATED)
fetchcontent_populate(nanobench)
endif()
set(Nanobench_ROOT ${nanobench_SOURCE_DIR})
add_library(nanobench INTERFACE)
target_include_directories(
nanobench SYSTEM INTERFACE $<BUILD_INTERFACE:${Nanobench_ROOT}/src/include/>
$<INSTALL_INTERFACE:${CMAKE_INSTALL_INCLUDEDIR}>
)HPX already contained a benchmarking interface perftest_report which takes the test function as an input, executes it multiple times and measures execution times using the std::chrono timers. The perftest_print_times function then displays the execution time taken for each iteration in a JSON format.
Using the HPX_HAVE_NANOBENCH macro defined above, we can define another variant of this interface to utilise nanobench.
First, we configure nanobench's benchmarking interface, as follows:
#if defined(HPX_HAVE_NANOBENCH)
ankerl::nanobench::Bench& bench()
{
static ankerl::nanobench::Bench b;
static ankerl::nanobench::Config cfg;
cfg.mWarmup = nanobench_warmup;
return b.config(cfg);
}
#endifTo maintain consistency with the outputs of the already present perftests_print_times interface, we utilise Nanobench's mustache templates to render the outputs of the benchmarks.
#if defined(HPX_HAVE_NANOBENCH)
char const* nanobench_hpx_template() noexcept
{
return R"DELIM({
"outputs": [
{{#result}} {
"name": "{{name}}",
"executor": "{{context(executor)}}",
"series": [
{{#measurement}}{{elapsed}}{{^-last}},
{{/-last}}{{/measurement}}
]
}{{^-last}},{{/-last}}
{{/result}} ]
}
)DELIM";
}
#endifNow, the nanobench-specialised interfaces for perftests_report and perftests_print_times can be defined simply as:
#if defined(HPX_HAVE_NANOBENCH)
void perftests_report(std::string const& name, std::string const& exec,
std::size_t const steps, hpx::function<void()>&& test)
{
if (steps == 0)
return;
detail::bench()
.name(name)
.context("executor", exec)
.epochs(steps)
.run(test);
}
void perftests_print_times(char const* templ, std::ostream& strm)
{
detail::bench().render(templ, strm);
}
void perftests_print_times(std::ostream& strm)
{
perftests_print_times(detail::nanobench_hpx_template(), strm);
}
void perftests_print_times()
{
perftests_print_times(detail::nanobench_hpx_template(), std::cout);
}Additionally, we would like to obtain these detailed results for analytical purposes. In the default state, the benchmarks should only display the average execution time. So, we make use of a detailed_ flag for enabling the JSON output, which is set by using the HPX command line flag --hpx:detailed_bench
For nanobench, we define the default template as follows:
char const* nanobench_hpx_simple_template() noexcept
{
return R"DELIM(Results:
{{#result}}
name: {{name}},
executor: {{context(executor)}},
average: {{average(elapsed)}}{{^-last}}
{{/-last}}
{{/result}})DELIM";
}And the modified perftests_print_times function is:
void perftests_print_times()
{
if (detailed_)
perftests_print_times(detail::nanobench_hpx_template(), std::cout);
else
perftests_print_times(
detail::nanobench_hpx_simple_template(), std::cout);
}Lets take a look at a sample benchmark. So this is how the benchmark test for hpx::min looked like, before:
double run_min_element_benchmark(
int test_count, hpx::partitioned_vector<int> const& v)
{
std::uint64_t time = hpx::chrono::high_resolution_clock::now();
for (int i = 0; i != test_count; ++i)
{
// invoke minmax
/*auto iters = */ hpx::min_element(
hpx::execution::par, v.begin(), v.end());
}
time = hpx::chrono::high_resolution_clock::now() - time;
return (time * 1e-9) / test_count;
}
///////////////////////////////////////////////////////////////
int hpx_main(hpx::program_options::variables_map& vm)
{
// create as many partitions as we have localities
hpx::partitioned_vector<int> v(
size, hpx::container_layout(hpx::find_all_localities()));
// initialize data
hpx::generate(hpx::execution::par, v.begin(), v.end(), random_fill());
// run benchmark
double time_min = run_min_element_benchmark(test_count, v);
std::cout << "min" << test_count << "," << time_min << std::endl;
return 0;
}Since all of this functionality is already internalised in the centralised benchmarking interfaces we defined above, the standardised benchmark is as follows:
int hpx_main(hpx::program_options::variables_map& vm)
{
// create as many partitions as we have localities
hpx::partitioned_vector<int> v(
size, hpx::container_layout(hpx::find_all_localities()));
// initialize data
hpx::generate(hpx::execution::par, v.begin(), v.end(), random_fill());
hpx::util::perftests_init(vm, "minmax_element_performance");
// run benchmark
hpx::util::perftests_report("hpx::min", "par", test_count,
[&] { hpx::min_element(hpx::execution::par, v.begin(), v.end()); });
hpx::util::perftests_print_times();
return 0;
}Lets take a look at the output, rendered by Nanobench:
$ ./bin/min_test
Results:
name: hpx::min,
executor: par,
average: 2.0075114646571e-05And the detailed JSON result is:
$ ./bin/min_test --detailed_bench
{
"outputs": [
{
"name": "hpx::min",
"executor": "par",
"series": [
0.0001351228125,
2.4984e-05,
2.50793e-05,
2.47716e-05,
8.21671538461538e-05,
4.06398333333333e-05,
2.24538333333333e-05,
4.26035714285714e-05,
3.41145714285714e-05,
2.19190714285714e-05,
2.25392941176471e-05,
2.54408823529412e-05,
1.79653888888889e-05,
1.7820380952381e-05,
2.06333913043478e-05,
1.75531666666667e-05,
1.80185e-05,
1.79215e-05,
1.7698125e-05,
1.89251851851852e-05,
1.88475714285714e-05,
1.8555e-05,
[..output truncated]
]
}
]
}(This test was a part of larger test minmax_element_performance)
For visualisation, we can use the JSON format obtained from the detailed config of the benchmarks. The idea would be to compare the benchmarks' current distribution with some pre-defined baselines, so as to gauge if is some improvement or regression in performance or not.
Initially, as HPX had multiple functions/configurations tested in the same file, a single plot comparing the distributions for each algorithm/ was utilised, using matplotlib's boxplot function
The issue encountered with this approach though was that if there was a lot of variation between the different sub-tests, then it would be diffcult to visualise the comparision
Finally, it was decided to have separate plots for each sub-test and to utilise the kdeplot function from seaborn to plot the results
Now the plots are easy to visualise individually, giving more insights into the performance of the various algorithms and data structures
Lets take a look at how all of the above components were incoroporated in HPX using CI pipelines
The main goal here was to have a CI (continuous integration) pipeline that would run after any changes are made to a Pull Request, so that it would indicate whether or not the changes made caused any significant performance regressions or improvements. Also, it would be great to have these images displayed on the HPX CDash dashboard, for that corresponding CI run.
But what is CDash?
CDash is a dashboard server that takes all the XML files created by CTest (CMake's testing utility) in Dashboard mode, and displays the results of the configuration, building and execution of the tests, along with statistics about the number of tests that have passed or failed, execution time, etc..
To display images with the corresponding tests, CDash requires that the path to that image should be included in the output of the results, in this format:
"<CTestMeasurementFile type=\"image/gif\" name=\"ValidImage\">/dir/to/valid_img.gif</CTestMeasurementFile>"
A few thing become clear here, that the images should be available during the tests' runtime, and that the above lines should be displayed only when the results are being uploaded to CDash, in the CI pipeline. The latter problem is easily solved, by defining another command line option, like --print_cdash_img_path, which will display the required output.
But CTest does not allow us to pass arguments in runtime, so it was decided that we would have to define dedicated tests for our performance measurement case. Hence, we defined a new CMake function for the same, which will create new tests with the _perftest suffix
function(add_hpx_performance_report_test subcategory name)
string(REPLACE "_perftest" "" name ${name})
add_test_and_deps_test(
"performance"
"${subcategory}"
${name}_perftest
EXECUTABLE
${name}
PSEUDO_DEPS_NAME
${name}
${ARGN}
RUN_SERIAL
"--print_cdash_img_path"
)
)And to ensure that the images were produced before tests' execution, custom CMake targets were also defined together with the above tests. These custom cdash_results targets invoke the Python script we developed here and compare the current execution results with pre-defined baselines
find_package(Python REQUIRED)
add_custom_target(
${name}_cdash_results
COMMAND
sh -c
"${CMAKE_BINARY_DIR}/bin/${name}_test ${ARGN} --test_count=1000 --hpx:detailed_bench >${CMAKE_BINARY_DIR}/${name}.json"
COMMAND
${Python_EXECUTABLE} ${CMAKE_SOURCE_DIR}/tools/perftests_plot.py
${CMAKE_BINARY_DIR}/${name}.json
${CMAKE_SOURCE_DIR}/tools/perftests_ci/perftest/references/lsu_default/${name}.json
${CMAKE_BINARY_DIR}/${name}
)
add_dependencies(${name}_cdash_results ${name}_test)Finally, we can now use these components to construct the CI pipeline using the ctest.cmake file. This file allows us to define the all the steps to be displayed on CDash, along with any configurations required
set(CTEST_CMAKE_GENERATOR Ninja)
set(CTEST_SITE "lsu(rostam)")
set(CTEST_UPDATE_COMMAND "git")
set(CTEST_UPDATE_VERSION_ONLY "ON")
# Configure project
set(CTEST_CONFIGURE_COMMAND "${CMAKE_COMMAND} ${CTEST_SOURCE_DIRECTORY}")
set(CTEST_CONFIGURE_COMMAND
"${CTEST_CONFIGURE_COMMAND} -G${CTEST_CMAKE_GENERATOR}"
)
set(CTEST_CONFIGURE_COMMAND
"${CTEST_CONFIGURE_COMMAND} -B${CTEST_BINARY_DIRECTORY}"
)
set(CTEST_CONFIGURE_COMMAND
"${CTEST_CONFIGURE_COMMAND} -DHPX_WITH_NANOBENCH=ON"
)
set(CTEST_CONFIGURE_COMMAND
"${CTEST_CONFIGURE_COMMAND} -DHPX_WITH_PARALLEL_TESTS_BIND_NONE=ON"
)
set(CTEST_CONFIGURE_COMMAND
"${CTEST_CONFIGURE_COMMAND} ${CTEST_CONFIGURE_EXTRA_OPTIONS}"
)
ctest_configure()
# Build the custom cdash targets
foreach(benchmark ${benchmarks})
ctest_build(TARGET ${benchmark}_cdash_results FLAGS "-k0 -j ${CTEST_BUILD_PARALLELISM}")
endforeach()
# Tests!!
ctest_test(
INCLUDE "_perftest$"
PARALLEL_LEVEL "${CTEST_TEST_PARALLELISM}"
)Then, we can simply invoke the entire mechanism to upload all the results on CDash using the following command:
$ ctest \
-VV \
--output-on-failure \
-S ${src_dir}/.jenkins/lsu-perftests/ctest.cmake \
-DCTEST_BUILD_CONFIGURATION_NAME="${configuration_name}" \
-DCTEST_CONFIGURE_EXTRA_OPTIONS="${configure_extra_options}" \
-DCTEST_SOURCE_DIRECTORY="${src_dir}" \
-DCTEST_BINARY_DIRECTORY="${build_dir}"And here is how it looks on CDash, after the Jenkins CI run on Rostam:
But let's not forget that we also need to indicate if any performance discrepancy has occured or not. The data is already being processed in the visualisation script, Hence, we extend it to measure any differences between the baselines and the current execution times, using bootstrapping, which again can be done using scipy's bootstrap method. This method returns the confidence interval for the hypothesis that the two distributions are part of each other.
To alert the contributors / members involved in that Pull Request, we can utilise the GitHub API to comment about all the relevant info for the tests showing significant deviation from the baselines. This feature is still a work in progress, so thats why we see some erratic results here:
Here is a flowchart, which summarises the entire CI pipeline
My work is splitted into two pull requests, which are both currently under review
- Integration of Nanobench into the HPX build system, along with standardising benchmarks (sections 1 and 2)
- Integration of standardised tests with the perfests CI pipelines (sections 3 and 4)
I want to thank my mentors Panagiotis Syskakis and Isidoros Tsaousis-Seiras for their mentorship and valuable feedback given throught the course of this project. Special thanks to Dr. Hartmut Kaiser for providing his inputs about how the CI pipelines should work.










