GPTL is a is a library to instrument C, C++, and Fortran codes for performance analysis and profiling. To enable it in QuantumESPRESSO, the first step is to compile PAPI library (optional) and GPTL as static libraries.
# optional, load intel compilers
module load autoload intelmpi
mkdir $HOME/qe_profiling
# configure and compile PAPI
wget http://icl.utk.edu/projects/papi/downloads/papi-5.5.1.tar.gz
tar xvzf papi-5.5.1.tar.gz
cd papi-5.5.1/src
./configure CC=icc --prefix=$HOME/qe_profiling --with-static-lib=yes --with-shared-lib=no
make
make install
# configure and compile GPTL
wget https://github.com/jmrosinski/GPTL/archive/v5.5.tar.gz
tar xvzf v5.5.tar.gz
cd GPTL-5.5
# Now configure GPTL for your system. You need to create a
# macros.make file starting from the various templates.
# configuration for Intel compilers and Intel MPI is given below.
wget https://gist.githubusercontent.com/bonfus/21dec6b966859f5f509b935f8b055a7f/raw/macros.make
make && make install
At this stage you should have in $HOME/qe_profiling/lib/
the following (static) libraries:
libgptl.a libpapi.a libpfm.a
To enable GPTL timers in QuantumESPRESSO apply this patch as follows:
tar xzf qe-6.1.tar.gz
cd qe-6.1
wget https://gist.githubusercontent.com/bonfus/21dec6b966859f5f509b935f8b055a7f/raw/qe_gptl.patch
patch -p1 < qe_gptl.patch
./configure --enable-openmp CC=icc FC=mpiifort
Befor compiling the code you have to manually add a couple of flags and paths to make.inc
.
MANUAL_DFLAGS = -D__GPTL_TIMING -D__GPTL_TIMING_PAPI
IFLAGS = -I($TOPDIR)/include [...] -I$(HOME)/qe_profiling/include
[...]
LD_LIBS = -L$(HOME)/qe_profiling/lib -lgptl -lpapi
Now compile as usual with
make all
In addition to standard QE timers, GTPL timers will also be active. This can possibly add a small overhead that is however negligible in standard runs. There are two ways to customize GPTL timers details:
- Using the configure file
gptl.conf
with the namelistgptlnl
shown here; - Only for PAPI events, with the environment variable QE_PAPI_EVENTS.
The values should be a set of comma separated EVENTS available on the execution nodes.
You can check the available events with
$HOME/qe_profiling/bin/papi_avail
After a successful run, two files will appear in the current working directory:
timings.0
timings.summary
these files contain the results obtained in the master MPI process and a summary of the results collected on all MPI nodes.
A sample timing file collected with -npool 8
, -ntg 1
and -ndiag 1
is shown below:
timings.0
---------
[...]
If a '%_of' field is present, it is w.r.t. the first timer for thread 0.
If a 'e6_per_sec' field is present, it is in millions of PAPI counts per sec.
A '*' in column 1 below means the timer had multiple parents, though the
values printed are for all calls.
Further down the listing may be more detailed information about multiple
parents. Look for 'Multiple parent info'
Process size=657 MB rss=157 MB
Stats for thread 0:
Called Recurse Wallclock max min %_of_PWSCF
PWSCF 1 - 25.697 25.697 25.697 100.00
init_run 1 - 7.157 7.157 7.157 27.85
init_vloc 1 - 0.867 0.867 0.867 3.37
init_us_1 1 - 1.379 1.379 1.379 5.37
init_at_1 1 - 0.190 0.190 0.190 0.74
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
potinit 1 - 1.513 1.513 1.513 5.89
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
* v_of_rho 6 - 1.218 0.214 0.197 4.74
v_xc 6 - 1.172 0.206 0.189 4.56
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
v_h 6 - 0.045 8.28e-03 7.26e-03 0.18
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
* PAW_pot 6 - 1.035 0.201 0.163 4.03
* interpolate 11 - 0.096 0.012 7.84e-03 0.37
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
ffts 11 - 0.014 1.67e-03 1.09e-03 0.06
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
* newd 6 - 3.372 0.593 0.550 13.12
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
wfcinit 1 - 0.442 0.442 0.442 1.72
* init_us_2 55 - 0.163 4.24e-03 2.79e-03 0.63
wfcinit:atomic 5 - 9.16e-03 1.85e-03 1.80e-03 0.04
atomic_wfc 5 - 8.98e-03 1.81e-03 1.77e-03 0.03
wfcinit:wfcrot 5 - 0.404 0.177 0.056 1.57
wfcrot 5 - 0.404 0.177 0.055 1.57
* h_psi_bgrp 96 - 3.336 0.055 2.57e-03 12.98
h_psi 96 - 3.333 0.055 2.55e-03 12.97
h_psi:pot 96 - 3.312 0.055 2.52e-03 12.89
vloc_psi 96 - 3.108 0.051 1.92e-03 12.10
* fftw 3995 - 2.996 6.75e-03 6.43e-04 11.66
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
h_psi:calbec 96 - 0.092 1.67e-03 3.10e-04 0.36
* calbec 121 - 0.118 3.56e-03 2.67e-04 0.46
add_vuspsi 96 - 0.107 2.21e-03 2.27e-04 0.41
* s_psi_bgrp 96 - 0.108 2.12e-03 2.43e-04 0.42
s_psi 96 - 0.100 2.04e-03 2.08e-04 0.39
* cdiaghg 91 - 0.538 0.117 9.29e-04 2.09
electrons 1 - 14.490 14.490 14.490 56.39
* v_of_rho 6 - 1.218 0.214 0.197 4.74
v_xc 6 - 1.172 0.206 0.189 4.56
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
v_h 6 - 0.045 8.28e-03 7.26e-03 0.18
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
* PAW_pot 6 - 1.035 0.201 0.163 4.03
* interpolate 11 - 0.096 0.012 7.84e-03 0.37
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
ffts 11 - 0.014 1.67e-03 1.09e-03 0.06
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
* newd 6 - 3.372 0.593 0.550 13.12
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
c_bands 5 - 4.909 1.147 0.805 19.10
* init_us_2 55 - 0.163 4.24e-03 2.79e-03 0.63
cegterg 25 - 4.031 0.221 0.135 15.69
* h_psi_bgrp 96 - 3.336 0.055 2.57e-03 12.98
h_psi 96 - 3.333 0.055 2.55e-03 12.97
h_psi:pot 96 - 3.312 0.055 2.52e-03 12.89
vloc_psi 96 - 3.108 0.051 1.92e-03 12.10
* fftw 3995 - 2.996 6.75e-03 6.43e-04 11.66
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
h_psi:calbec 96 - 0.092 1.67e-03 3.10e-04 0.36
* calbec 121 - 0.118 3.56e-03 2.67e-04 0.46
add_vuspsi 96 - 0.107 2.21e-03 2.27e-04 0.41
* s_psi_bgrp 96 - 0.108 2.12e-03 2.43e-04 0.42
s_psi 96 - 0.100 2.04e-03 2.08e-04 0.39
* cdiaghg 91 - 0.538 0.117 9.29e-04 2.09
cegterg:update 66 - 0.108 4.11e-03 4.15e-04 0.42
g_psi 66 - 0.042 1.15e-03 5.30e-05 0.16
cegterg:overlap 66 - 0.118 2.81e-03 9.52e-04 0.46
cegterg:last 25 - 0.027 1.37e-03 9.64e-04 0.10
sum_band 5 - 4.275 0.885 0.843 16.64
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
* interpolate 11 - 0.096 0.012 7.84e-03 0.37
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
ffts 11 - 0.014 1.67e-03 1.09e-03 0.06
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
* init_us_2 55 - 0.163 4.24e-03 2.79e-03 0.63
* fftw 3995 - 2.996 6.75e-03 6.43e-04 11.66
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
* calbec 121 - 0.118 3.56e-03 2.67e-04 0.46
sum_band:becsum 25 - 5.19e-03 1.31e-03 1.22e-04 0.02
addusdens 5 - 3.314 0.669 0.658 12.90
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
mix_rho 5 - 0.115 0.034 0.018 0.45
* fft 94 - 1.108 0.512 5.30e-03 4.31
* fft_scatter 4100 - 0.993 0.029 9.20e-05 3.86
* davcio 15 - 0.154 0.024 9.60e-05 0.60
* davcio 15 - 0.154 0.024 9.60e-05 0.60
Overhead sum = 0.0561 wallclock seconds
Total calls = 9612
[...]
A few notes on the table above:
- All pw timers started on the master MPI process are automatically printed. The simulation is performed with only 1 OMP thread and, in general, no timers appear within OpenMP parallel regions in QE therefore thread information will not be very useful in general.
- min and max report the minimum and maximum time for the given timer.
- The total memory occupied by the process is reported at the beginning of the log.
As an example, let's check a few interesting timers for the same input file running on a different processor.
timings.0
---------
[...]
Stats for thread 0:
Called Recurse Wallclock max min %_of_PWSCF
PWSCF 1 - 7.672 7.672 7.672 100.00
init_run 1 - 1.152 1.152 1.152 15.01
[...]
wfcinit:wfcrot 5 - 0.111 0.024 0.021 1.45
[...]
If one compares the timer wfcinit:wfcrot
reported above with the one from the previous report, the execution time appears to be very different with respect to the above run. In this last case all the 5 calls always take the same time to run while in the first report min
and max
columns differ by about a factor 3.
Let's move to the summary file, i.e.:
timings.summary
---------------
Total ranks in communicator=32
nthreads on rank 0=1
'N' used for mean, std. dev. calcs.: 'ncalls'/'nthreads'
'ncalls': number of times the region was invoked across tasks and threads.
'nranks': number of ranks which invoked the region.
mean, std. dev: computed using per-rank max time across all threads on each rank
wallmax and wallmin: max, min time across tasks and threads.
name ncalls nranks mean_time std_dev wallmax (rank ) wallmin (rank )
PWSCF 32 32 25.714 0.014 25.733 ( 14) 25.680 ( 28)
init_run 32 32 7.149 0.005 7.157 ( 22) 7.143 ( 21)
init_vloc 32 32 0.860 0.006 0.868 ( 10) 0.851 ( 28)
init_us_1 32 32 1.375 0.023 1.405 ( 8) 1.339 ( 19)
init_at_1 32 32 0.187 0.001 0.190 ( 2) 0.187 ( 18)
fft 3008 32 1.159 0.087 1.450 ( 2) 1.085 ( 27)
fft_scatter 119160 32 0.913 0.188 1.489 ( 2) 0.747 ( 27)
potinit 32 32 1.499 0.006 1.513 ( 0) 1.494 ( 14)
v_of_rho 192 32 1.168 0.027 1.218 ( 0) 1.142 ( 30)
v_xc 192 32 1.127 0.025 1.173 ( 3) 1.102 ( 28)
v_h 192 32 0.040 0.003 0.045 ( 0) 0.037 ( 13)
PAW_pot 192 32 1.087 0.027 1.114 ( 24) 1.035 ( 0)
interpolate 352 32 0.085 0.004 0.096 ( 0) 0.080 ( 24)
ffts 352 32 0.015 0.001 0.017 ( 14) 0.013 ( 21)
newd 192 32 3.180 0.075 3.373 ( 1) 3.115 ( 28)
wfcinit 32 32 0.405 0.031 0.442 ( 1) 0.370 ( 27)
init_us_2 1584 32 0.139 0.016 0.163 ( 0) 0.121 ( 26)
wfcinit:atomic 144 32 0.008 0.001 0.009 ( 14) 0.007 ( 24)
atomic_wfc 144 32 0.008 0.001 0.009 ( 0) 0.007 ( 24)
wfcinit:wfcrot 144 32 0.375 0.029 0.412 ( 3) 0.337 ( 27)
wfcrot 144 32 0.375 0.029 0.411 ( 3) 0.337 ( 27)
h_psi_bgrp 2864 32 2.934 0.340 3.354 ( 2) 2.584 ( 27)
h_psi 2864 32 2.930 0.340 3.350 ( 2) 2.581 ( 27)
h_psi:pot 2864 32 2.911 0.338 3.329 ( 2) 2.563 ( 27)
vloc_psi 2864 32 2.729 0.319 3.130 ( 2) 2.394 ( 23)
fftw 115800 32 2.621 0.314 3.069 ( 2) 2.294 ( 23)
h_psi:calbec 2864 32 0.085 0.009 0.094 ( 13) 0.072 ( 27)
calbec 3584 32 0.108 0.011 0.125 ( 13) 0.091 ( 27)
add_vuspsi 2864 32 0.090 0.010 0.107 ( 0) 0.076 ( 27)
s_psi_bgrp 2864 32 0.094 0.011 0.109 ( 10) 0.083 ( 23)
s_psi 2864 32 0.086 0.010 0.100 ( 0) 0.075 ( 23)
cdiaghg 2720 32 0.518 0.047 0.594 ( 8) 0.455 ( 26)
electrons 32 32 14.427 0.026 14.490 ( 0) 14.376 ( 8)
c_bands 160 32 4.918 0.011 4.944 ( 17) 4.907 ( 27)
cegterg 720 32 3.594 0.405 4.051 ( 2) 3.161 ( 31)
cegterg:update 2000 32 0.093 0.011 0.112 ( 10) 0.079 ( 23)
g_psi 2000 32 0.037 0.004 0.042 ( 8) 0.032 ( 30)
cegterg:overlap 2000 32 0.114 0.013 0.135 ( 11) 0.097 ( 30)
cegterg:last 720 32 0.024 0.003 0.027 ( 10) 0.020 ( 28)
sum_band 160 32 4.263 0.006 4.275 ( 0) 4.252 ( 14)
sum_band:becsum 720 32 0.004 0.001 0.010 ( 9) 0.003 ( 16)
addusdens 160 32 3.074 0.092 3.317 ( 2) 3.022 ( 27)
mix_rho 160 32 0.117 0.004 0.122 ( 19) 0.108 ( 1)
davcio 464 32 0.077 0.036 0.190 ( 2) 0.053 ( 15)
In the above report, the total number of calls collected on all processes is shown. As a consequence, all clocks have at least
32 calls. This is a consequence of how clocks have been defined in QE. For example, one should expect to see the cegterg
set of timers to appear only 8 times (only the master of the pool runs the diagonalization) but the call to the timer is performed before the if statement which selects the sub-set of MPI processes that actually perform the diagonalization and the same timer is stopped after the results are received on all processes with a MPI call.
As already detailed above, it is also possible to collect processor events. Below is an example reporting PAPI_DP_OPS
.
Stats for thread 0:
Called Recurse Usr sys usr+sys PAPI_DP_OPS e6_/_sec
PWSCF 1 - 6.690 0.870 7.560 2.0946851199e+10 0.00
init_run 1 - 1.030 0.100 1.130 3.3837314470e+09 0.00
init_vloc 1 - 0.090 0.000 0.090 4.9248796500e+08 0.00
init_us_1 1 - 0.180 0.000 0.180 4.4793199900e+08 0.00
init_at_1 1 - 0.020 0.000 0.020 4.9536002000e+07 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
potinit 1 - 0.260 0.020 0.280 6.8754499000e+08 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
* v_of_rho 6 - 0.410 0.020 0.430 5.1731972700e+08 0.00
v_xc 6 - 0.390 0.020 0.410 4.7782073300e+08 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
v_h 6 - 0.020 0.000 0.020 3.9498802000e+07 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
* PAW_pot 6 - 0.290 0.040 0.330 5.3928186200e+08 0.00
* interpolate 11 - 0.030 0.000 0.030 8.2871096000e+07 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
ffts 11 - 0.000 0.000 0.000 1.4295924000e+07 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
* newd 6 - 1.120 0.310 1.430 3.0941937420e+09 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
wfcinit 1 - 0.110 0.000 0.110 6.4585088800e+08 0.00
* init_us_2 55 - 0.040 0.000 0.040 7.4965638000e+07 0.00
wfcinit:atomic 5 - 0.000 0.000 0.000 5.1291010000e+06 0.00
atomic_wfc 5 - 0.000 0.000 0.000 5.1290010000e+06 0.00
wfcinit:wfcrot 5 - 0.110 0.000 0.110 6.3384544000e+08 0.00
wfcrot 5 - 0.110 0.000 0.110 6.3384534000e+08 0.00
electrons 1 - 5.110 0.720 5.830 1.7556956753e+10 0.00
* v_of_rho 6 - 0.410 0.020 0.430 5.1731972700e+08 0.00
v_xc 6 - 0.390 0.020 0.410 4.7782073300e+08 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
v_h 6 - 0.020 0.000 0.020 3.9498802000e+07 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
* PAW_pot 6 - 0.290 0.040 0.330 5.3928186200e+08 0.00
* interpolate 11 - 0.030 0.000 0.030 8.2871096000e+07 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
ffts 11 - 0.000 0.000 0.000 1.4295924000e+07 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
* newd 6 - 1.120 0.310 1.430 3.0941937420e+09 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
c_bands 5 - 1.800 0.060 1.860 9.7972284840e+09 0.00
* init_us_2 55 - 0.040 0.000 0.040 7.4965638000e+07 0.00
cegterg 25 - 1.530 0.060 1.590 9.2359865780e+09 0.00
h_psi_bgrp 91 - 1.130 0.030 1.160 5.2714834540e+09 0.00
s_psi_bgrp 91 - 0.030 0.000 0.030 9.3230422400e+08 0.00
cegterg:update 66 - 0.070 0.000 0.070 7.3185780000e+08 0.00
g_psi 66 - 0.020 0.000 0.020 1.3892604000e+07 0.00
cegterg:overlap 66 - 0.070 0.000 0.070 1.1791737520e+09 0.00
cdiaghg 86 - 0.120 0.010 0.130 3.0174008200e+08 0.00
cegterg:last 25 - 0.000 0.000 0.000 4.3691150600e+08 0.00
sum_band 5 - 1.660 0.350 2.010 4.1881319920e+09 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
* interpolate 11 - 0.030 0.000 0.030 8.2871096000e+07 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
ffts 11 - 0.000 0.000 0.000 1.4295924000e+07 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
* init_us_2 55 - 0.040 0.000 0.040 7.4965638000e+07 0.00
fftw 625 - 0.190 0.000 0.190 6.6385000000e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
calbec 25 - 0.020 0.000 0.020 3.6973454000e+08 0.00
sum_band:becsum 25 - 0.000 0.000 0.000 2.2321000000e+06 0.00
addusdens 5 - 1.270 0.340 1.610 2.9381309950e+09 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
mix_rho 5 - 0.050 0.010 0.060 6.6551539000e+07 0.00
* fft 86 - 0.230 0.010 0.240 5.3148298800e+08 0.00
* fft_scatter 665 - 0.120 0.010 0.130 5320 0.00
* davcio 15 - 0.000 0.000 0.000 120 0.00
* davcio 15 - 0.000 0.000 0.000 120 0.00
Overhead sum = 0.00244 wallclock seconds
Total calls = 2112
Total GPTL memory usage = 31.96 KB
Components:
Hashmem = 16.672 KB
Regionmem = 13.376 KB (papimem portion = 5.472 KB)
Parent/child arrays = 0.888 KB
Callstackmem = 1.024 KB
Thread mapping:
GPTLthreadid_omp[0] = 0
Compile libpfm
wget https://vorboss.dl.sourceforge.net/project/perfmon2/libpfm4/libpfm-4.8.0.tar.gz
tar xvzf libpfm-4.8.0.tar.gz
cd libpfm-4.8.0/
# change config.mk to your needs (an compile static) or get it from this gist
make
make install