bonfus/README.md

## README.md

      
    Raw
  

              README.md
            
          
    QE performance analysis with GTPL lib

GPTL is a  is a library to instrument C, C++, and Fortran codes for performance analysis and profiling.
To enable it in QuantumESPRESSO, the first step is to compile PAPI library (optional) and GPTL as static libraries.
Compilations of the libraries

# optional, load intel compilers
module load autoload intelmpi

mkdir $HOME/qe_profiling

# configure and compile PAPI
wget http://icl.utk.edu/projects/papi/downloads/papi-5.5.1.tar.gz
tar xvzf papi-5.5.1.tar.gz
cd papi-5.5.1/src
./configure CC=icc --prefix=$HOME/qe_profiling --with-static-lib=yes --with-shared-lib=no
make
make install

# configure and compile GPTL

wget https://github.com/jmrosinski/GPTL/archive/v5.5.tar.gz
tar xvzf v5.5.tar.gz
cd GPTL-5.5

# Now configure GPTL for your system. You need to create a 
#  macros.make file starting from the various templates.
#  configuration for Intel compilers and Intel MPI is given below.
wget https://gist.githubusercontent.com/bonfus/21dec6b966859f5f509b935f8b055a7f/raw/macros.make
make && make install
At this stage you should have in $HOME/qe_profiling/lib/ the following (static) libraries:
libgptl.a  libpapi.a  libpfm.a

Compilation of QE

To enable GPTL timers in QuantumESPRESSO apply this patch as follows:
tar xzf qe-6.1.tar.gz
cd qe-6.1
wget https://gist.githubusercontent.com/bonfus/21dec6b966859f5f509b935f8b055a7f/raw/qe_gptl.patch
patch -p1 < qe_gptl.patch
./configure --enable-openmp CC=icc FC=mpiifort

Befor compiling the code you have to manually add a couple of flags and paths to make.inc.
MANUAL_DFLAGS  = -D__GPTL_TIMING -D__GPTL_TIMING_PAPI
IFLAGS = -I($TOPDIR)/include [...] -I$(HOME)/qe_profiling/include
[...]
LD_LIBS = -L$(HOME)/qe_profiling/lib -lgptl -lpapi

Now compile as usual with
make all

Running the code and tuning the timers

In addition to standard QE timers, GTPL timers will also be active. This can possibly add a small overhead
that is however negligible in standard runs.
There are two ways to customize GPTL timers details:

Using the configure file gptl.conf with the namelist gptlnl shown here;
Only for PAPI events, with the environment variable QE_PAPI_EVENTS.
The values should be a set of comma separated EVENTS available on the execution nodes.
You can check the available events with $HOME/qe_profiling/bin/papi_avail

After a successful run, two files will appear in the current working directory:
timings.0
timings.summary

these files contain the results obtained in the master MPI process and a summary of the results collected on all MPI nodes.
Description of the output

A sample timing file collected with -npool 8, -ntg 1 and -ndiag 1 is shown below:
timings.0
---------

[...]

If a '%_of' field is present, it is w.r.t. the first timer for thread 0.
If a 'e6_per_sec' field is present, it is in millions of PAPI counts per sec.

A '*' in column 1 below means the timer had multiple parents, though the
values printed are for all calls.
Further down the listing may be more detailed information about multiple
parents. Look for 'Multiple parent info'

Process size=657 MB rss=157 MB

Stats for thread 0:
                                       Called  Recurse Wallclock max       min       %_of_PWSCF 
  PWSCF                                       1    -      25.697    25.697    25.697     100.00 
    init_run                                  1    -       7.157     7.157     7.157      27.85 
      init_vloc                               1    -       0.867     0.867     0.867       3.37 
      init_us_1                               1    -       1.379     1.379     1.379       5.37 
      init_at_1                               1    -       0.190     0.190     0.190       0.74 
*     fft                                    94    -       1.108     0.512  5.30e-03       4.31 
*       fft_scatter                        4100    -       0.993     0.029  9.20e-05       3.86 
      potinit                                 1    -       1.513     1.513     1.513       5.89 
*       fft                                  94    -       1.108     0.512  5.30e-03       4.31 
*         fft_scatter                      4100    -       0.993     0.029  9.20e-05       3.86 
*       v_of_rho                              6    -       1.218     0.214     0.197       4.74 
          v_xc                                6    -       1.172     0.206     0.189       4.56 
*           fft                              94    -       1.108     0.512  5.30e-03       4.31 
*             fft_scatter                  4100    -       0.993     0.029  9.20e-05       3.86 
          v_h                                 6    -       0.045  8.28e-03  7.26e-03       0.18 
*           fft                              94    -       1.108     0.512  5.30e-03       4.31 
*             fft_scatter                  4100    -       0.993     0.029  9.20e-05       3.86 
*       PAW_pot                               6    -       1.035     0.201     0.163       4.03 
*       interpolate                          11    -       0.096     0.012  7.84e-03       0.37 
*         fft                                94    -       1.108     0.512  5.30e-03       4.31 
*           fft_scatter                    4100    -       0.993     0.029  9.20e-05       3.86 
          ffts                               11    -       0.014  1.67e-03  1.09e-03       0.06 
*           fft_scatter                    4100    -       0.993     0.029  9.20e-05       3.86 
*     newd                                    6    -       3.372     0.593     0.550      13.12 
*       fft                                  94    -       1.108     0.512  5.30e-03       4.31 
*         fft_scatter                      4100    -       0.993     0.029  9.20e-05       3.86 
      wfcinit                                 1    -       0.442     0.442     0.442       1.72 
*       init_us_2                            55    -       0.163  4.24e-03  2.79e-03       0.63 
        wfcinit:atomic                        5    -    9.16e-03  1.85e-03  1.80e-03       0.04 
          atomic_wfc                          5    -    8.98e-03  1.81e-03  1.77e-03       0.03 
        wfcinit:wfcrot                        5    -       0.404     0.177     0.056       1.57 
          wfcrot                              5    -       0.404     0.177     0.055       1.57 
*           h_psi_bgrp                       96    -       3.336     0.055  2.57e-03      12.98 
              h_psi                          96    -       3.333     0.055  2.55e-03      12.97 
                h_psi:pot                    96    -       3.312     0.055  2.52e-03      12.89 
                  vloc_psi                   96    -       3.108     0.051  1.92e-03      12.10 
*                   fftw                   3995    -       2.996  6.75e-03  6.43e-04      11.66 
*                     fft_scatter          4100    -       0.993     0.029  9.20e-05       3.86 
                  h_psi:calbec               96    -       0.092  1.67e-03  3.10e-04       0.36 
*                   calbec                  121    -       0.118  3.56e-03  2.67e-04       0.46 
                  add_vuspsi                 96    -       0.107  2.21e-03  2.27e-04       0.41 
*           s_psi_bgrp                       96    -       0.108  2.12e-03  2.43e-04       0.42 
              s_psi                          96    -       0.100  2.04e-03  2.08e-04       0.39 
*           cdiaghg                          91    -       0.538     0.117  9.29e-04       2.09 
    electrons                                 1    -      14.490    14.490    14.490      56.39 
*     v_of_rho                                6    -       1.218     0.214     0.197       4.74 
        v_xc                                  6    -       1.172     0.206     0.189       4.56 
*         fft                                94    -       1.108     0.512  5.30e-03       4.31 
*           fft_scatter                    4100    -       0.993     0.029  9.20e-05       3.86 
        v_h                                   6    -       0.045  8.28e-03  7.26e-03       0.18 
*         fft                                94    -       1.108     0.512  5.30e-03       4.31 
*           fft_scatter                    4100    -       0.993     0.029  9.20e-05       3.86 
*     PAW_pot                                 6    -       1.035     0.201     0.163       4.03 
*     interpolate                            11    -       0.096     0.012  7.84e-03       0.37 
*       fft                                  94    -       1.108     0.512  5.30e-03       4.31 
*         fft_scatter                      4100    -       0.993     0.029  9.20e-05       3.86 
        ffts                                 11    -       0.014  1.67e-03  1.09e-03       0.06 
*         fft_scatter                      4100    -       0.993     0.029  9.20e-05       3.86 
*     newd                                    6    -       3.372     0.593     0.550      13.12 
*       fft                                  94    -       1.108     0.512  5.30e-03       4.31 
*         fft_scatter                      4100    -       0.993     0.029  9.20e-05       3.86 
      c_bands                                 5    -       4.909     1.147     0.805      19.10 
*       init_us_2                            55    -       0.163  4.24e-03  2.79e-03       0.63 
        cegterg                              25    -       4.031     0.221     0.135      15.69 
*         h_psi_bgrp                         96    -       3.336     0.055  2.57e-03      12.98 
            h_psi                            96    -       3.333     0.055  2.55e-03      12.97 
              h_psi:pot                      96    -       3.312     0.055  2.52e-03      12.89 
                vloc_psi                     96    -       3.108     0.051  1.92e-03      12.10 
*                 fftw                     3995    -       2.996  6.75e-03  6.43e-04      11.66 
*                   fft_scatter            4100    -       0.993     0.029  9.20e-05       3.86 
                h_psi:calbec                 96    -       0.092  1.67e-03  3.10e-04       0.36 
*                 calbec                    121    -       0.118  3.56e-03  2.67e-04       0.46 
                add_vuspsi                   96    -       0.107  2.21e-03  2.27e-04       0.41 
*         s_psi_bgrp                         96    -       0.108  2.12e-03  2.43e-04       0.42 
            s_psi                            96    -       0.100  2.04e-03  2.08e-04       0.39 
*         cdiaghg                            91    -       0.538     0.117  9.29e-04       2.09 
          cegterg:update                     66    -       0.108  4.11e-03  4.15e-04       0.42 
          g_psi                              66    -       0.042  1.15e-03  5.30e-05       0.16 
          cegterg:overlap                    66    -       0.118  2.81e-03  9.52e-04       0.46 
          cegterg:last                       25    -       0.027  1.37e-03  9.64e-04       0.10 
      sum_band                                5    -       4.275     0.885     0.843      16.64 
*       fft                                  94    -       1.108     0.512  5.30e-03       4.31 
*         fft_scatter                      4100    -       0.993     0.029  9.20e-05       3.86 
*       interpolate                          11    -       0.096     0.012  7.84e-03       0.37 
*         fft                                94    -       1.108     0.512  5.30e-03       4.31 
*           fft_scatter                    4100    -       0.993     0.029  9.20e-05       3.86 
          ffts                               11    -       0.014  1.67e-03  1.09e-03       0.06 
*           fft_scatter                    4100    -       0.993     0.029  9.20e-05       3.86 
*       init_us_2                            55    -       0.163  4.24e-03  2.79e-03       0.63 
*       fftw                               3995    -       2.996  6.75e-03  6.43e-04      11.66 
*         fft_scatter                      4100    -       0.993     0.029  9.20e-05       3.86 
*       calbec                              121    -       0.118  3.56e-03  2.67e-04       0.46 
        sum_band:becsum                      25    -    5.19e-03  1.31e-03  1.22e-04       0.02 
        addusdens                             5    -       3.314     0.669     0.658      12.90 
*         fft                                94    -       1.108     0.512  5.30e-03       4.31 
*           fft_scatter                    4100    -       0.993     0.029  9.20e-05       3.86 
      mix_rho                                 5    -       0.115     0.034     0.018       0.45 
*       fft                                  94    -       1.108     0.512  5.30e-03       4.31 
*         fft_scatter                      4100    -       0.993     0.029  9.20e-05       3.86 
*     davcio                                 15    -       0.154     0.024  9.60e-05       0.60 
*   davcio                                   15    -       0.154     0.024  9.60e-05       0.60 
Overhead sum =    0.0561 wallclock seconds
Total calls  = 9612

[...]

A few notes on the table above:

All pw timers started on the master MPI process are automatically printed. The simulation is performed with only 1 OMP thread and, in general, no timers appear within OpenMP parallel regions in QE therefore thread information will not be very useful in general.
min and max report the minimum and maximum time for the given timer.
The total memory occupied by the process is reported at the beginning of the log.

As an example, let's check a few interesting timers for the same input file running on a different processor.
timings.0
---------
[...]
Stats for thread 0:
                                       Called  Recurse Wallclock max       min       %_of_PWSCF 
  PWSCF                                       1    -       7.672     7.672     7.672     100.00 
    init_run                                  1    -       1.152     1.152     1.152      15.01 
[...]
        wfcinit:wfcrot                        5    -       0.111     0.024     0.021       1.45 
[...]

If one compares the timer wfcinit:wfcrot reported above with the one from the previous report, the execution time appears to be very different with respect to the above run. In this last case all the 5 calls always take the same time to run while in the first report min and max columns differ by about a factor 3.
Let's move to the summary file, i.e.:
timings.summary
---------------

Total ranks in communicator=32
nthreads on rank 0=1
'N' used for mean, std. dev. calcs.: 'ncalls'/'nthreads'
'ncalls': number of times the region was invoked across tasks and threads.
'nranks': number of ranks which invoked the region.
mean, std. dev: computed using per-rank max time across all threads on each rank
wallmax and wallmin: max, min time across tasks and threads.

name              ncalls nranks mean_time   std_dev   wallmax (rank  )   wallmin (rank  )
PWSCF                 32     32    25.714     0.014    25.733 (    14)    25.680 (    28)
init_run              32     32     7.149     0.005     7.157 (    22)     7.143 (    21)
init_vloc             32     32     0.860     0.006     0.868 (    10)     0.851 (    28)
init_us_1             32     32     1.375     0.023     1.405 (     8)     1.339 (    19)
init_at_1             32     32     0.187     0.001     0.190 (     2)     0.187 (    18)
fft                 3008     32     1.159     0.087     1.450 (     2)     1.085 (    27)
fft_scatter       119160     32     0.913     0.188     1.489 (     2)     0.747 (    27)
potinit               32     32     1.499     0.006     1.513 (     0)     1.494 (    14)
v_of_rho             192     32     1.168     0.027     1.218 (     0)     1.142 (    30)
v_xc                 192     32     1.127     0.025     1.173 (     3)     1.102 (    28)
v_h                  192     32     0.040     0.003     0.045 (     0)     0.037 (    13)
PAW_pot              192     32     1.087     0.027     1.114 (    24)     1.035 (     0)
interpolate          352     32     0.085     0.004     0.096 (     0)     0.080 (    24)
ffts                 352     32     0.015     0.001     0.017 (    14)     0.013 (    21)
newd                 192     32     3.180     0.075     3.373 (     1)     3.115 (    28)
wfcinit               32     32     0.405     0.031     0.442 (     1)     0.370 (    27)
init_us_2           1584     32     0.139     0.016     0.163 (     0)     0.121 (    26)
wfcinit:atomic       144     32     0.008     0.001     0.009 (    14)     0.007 (    24)
atomic_wfc           144     32     0.008     0.001     0.009 (     0)     0.007 (    24)
wfcinit:wfcrot       144     32     0.375     0.029     0.412 (     3)     0.337 (    27)
wfcrot               144     32     0.375     0.029     0.411 (     3)     0.337 (    27)
h_psi_bgrp          2864     32     2.934     0.340     3.354 (     2)     2.584 (    27)
h_psi               2864     32     2.930     0.340     3.350 (     2)     2.581 (    27)
h_psi:pot           2864     32     2.911     0.338     3.329 (     2)     2.563 (    27)
vloc_psi            2864     32     2.729     0.319     3.130 (     2)     2.394 (    23)
fftw              115800     32     2.621     0.314     3.069 (     2)     2.294 (    23)
h_psi:calbec        2864     32     0.085     0.009     0.094 (    13)     0.072 (    27)
calbec              3584     32     0.108     0.011     0.125 (    13)     0.091 (    27)
add_vuspsi          2864     32     0.090     0.010     0.107 (     0)     0.076 (    27)
s_psi_bgrp          2864     32     0.094     0.011     0.109 (    10)     0.083 (    23)
s_psi               2864     32     0.086     0.010     0.100 (     0)     0.075 (    23)
cdiaghg             2720     32     0.518     0.047     0.594 (     8)     0.455 (    26)
electrons             32     32    14.427     0.026    14.490 (     0)    14.376 (     8)
c_bands              160     32     4.918     0.011     4.944 (    17)     4.907 (    27)
cegterg              720     32     3.594     0.405     4.051 (     2)     3.161 (    31)
cegterg:update      2000     32     0.093     0.011     0.112 (    10)     0.079 (    23)
g_psi               2000     32     0.037     0.004     0.042 (     8)     0.032 (    30)
cegterg:overlap     2000     32     0.114     0.013     0.135 (    11)     0.097 (    30)
cegterg:last         720     32     0.024     0.003     0.027 (    10)     0.020 (    28)
sum_band             160     32     4.263     0.006     4.275 (     0)     4.252 (    14)
sum_band:becsum      720     32     0.004     0.001     0.010 (     9)     0.003 (    16)
addusdens            160     32     3.074     0.092     3.317 (     2)     3.022 (    27)
mix_rho              160     32     0.117     0.004     0.122 (    19)     0.108 (     1)
davcio               464     32     0.077     0.036     0.190 (     2)     0.053 (    15)

In the above report, the total number of calls collected on all processes is shown. As a consequence, all clocks have at least
32 calls. This is a consequence of how clocks have been defined in QE. For example, one should expect to see the cegterg set of timers to appear only 8 times (only the master of the pool runs the diagonalization) but the call to the timer is performed before the if statement which selects the sub-set of MPI processes that actually perform the diagonalization and the same timer is stopped after the results are received on all processes with a MPI call.
PAPI events

As already detailed above, it is also possible to collect processor events. Below is an example reporting PAPI_DP_OPS.
Stats for thread 0:
                               Called  Recurse Usr       sys       usr+sys        PAPI_DP_OPS e6_/_sec 
  PWSCF                               1    -       6.690     0.870     7.560 2.0946851199e+10     0.00 
    init_run                          1    -       1.030     0.100     1.130 3.3837314470e+09     0.00 
      init_vloc                       1    -       0.090     0.000     0.090 4.9248796500e+08     0.00 
      init_us_1                       1    -       0.180     0.000     0.180 4.4793199900e+08     0.00 
      init_at_1                       1    -       0.020     0.000     0.020 4.9536002000e+07     0.00 
*     fft                            86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*       fft_scatter                 665    -       0.120     0.010     0.130             5320     0.00 
      potinit                         1    -       0.260     0.020     0.280 6.8754499000e+08     0.00 
*       fft                          86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*         fft_scatter               665    -       0.120     0.010     0.130             5320     0.00 
*       v_of_rho                      6    -       0.410     0.020     0.430 5.1731972700e+08     0.00 
          v_xc                        6    -       0.390     0.020     0.410 4.7782073300e+08     0.00 
*           fft                      86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*             fft_scatter           665    -       0.120     0.010     0.130             5320     0.00 
          v_h                         6    -       0.020     0.000     0.020 3.9498802000e+07     0.00 
*           fft                      86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*             fft_scatter           665    -       0.120     0.010     0.130             5320     0.00 
*       PAW_pot                       6    -       0.290     0.040     0.330 5.3928186200e+08     0.00 
*       interpolate                  11    -       0.030     0.000     0.030 8.2871096000e+07     0.00 
*         fft                        86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*           fft_scatter             665    -       0.120     0.010     0.130             5320     0.00 
          ffts                       11    -       0.000     0.000     0.000 1.4295924000e+07     0.00 
*           fft_scatter             665    -       0.120     0.010     0.130             5320     0.00 
*     newd                            6    -       1.120     0.310     1.430 3.0941937420e+09     0.00 
*       fft                          86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*         fft_scatter               665    -       0.120     0.010     0.130             5320     0.00 
      wfcinit                         1    -       0.110     0.000     0.110 6.4585088800e+08     0.00 
*       init_us_2                    55    -       0.040     0.000     0.040 7.4965638000e+07     0.00 
        wfcinit:atomic                5    -       0.000     0.000     0.000 5.1291010000e+06     0.00 
          atomic_wfc                  5    -       0.000     0.000     0.000 5.1290010000e+06     0.00 
        wfcinit:wfcrot                5    -       0.110     0.000     0.110 6.3384544000e+08     0.00 
          wfcrot                      5    -       0.110     0.000     0.110 6.3384534000e+08     0.00 
    electrons                         1    -       5.110     0.720     5.830 1.7556956753e+10     0.00 
*     v_of_rho                        6    -       0.410     0.020     0.430 5.1731972700e+08     0.00 
        v_xc                          6    -       0.390     0.020     0.410 4.7782073300e+08     0.00 
*         fft                        86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*           fft_scatter             665    -       0.120     0.010     0.130             5320     0.00 
        v_h                           6    -       0.020     0.000     0.020 3.9498802000e+07     0.00 
*         fft                        86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*           fft_scatter             665    -       0.120     0.010     0.130             5320     0.00 
*     PAW_pot                         6    -       0.290     0.040     0.330 5.3928186200e+08     0.00 
*     interpolate                    11    -       0.030     0.000     0.030 8.2871096000e+07     0.00 
*       fft                          86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*         fft_scatter               665    -       0.120     0.010     0.130             5320     0.00 
        ffts                         11    -       0.000     0.000     0.000 1.4295924000e+07     0.00 
*         fft_scatter               665    -       0.120     0.010     0.130             5320     0.00 
*     newd                            6    -       1.120     0.310     1.430 3.0941937420e+09     0.00 
*       fft                          86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*         fft_scatter               665    -       0.120     0.010     0.130             5320     0.00 
      c_bands                         5    -       1.800     0.060     1.860 9.7972284840e+09     0.00 
*       init_us_2                    55    -       0.040     0.000     0.040 7.4965638000e+07     0.00 
        cegterg                      25    -       1.530     0.060     1.590 9.2359865780e+09     0.00 
          h_psi_bgrp                 91    -       1.130     0.030     1.160 5.2714834540e+09     0.00 
          s_psi_bgrp                 91    -       0.030     0.000     0.030 9.3230422400e+08     0.00 
          cegterg:update             66    -       0.070     0.000     0.070 7.3185780000e+08     0.00 
          g_psi                      66    -       0.020     0.000     0.020 1.3892604000e+07     0.00 
          cegterg:overlap            66    -       0.070     0.000     0.070 1.1791737520e+09     0.00 
          cdiaghg                    86    -       0.120     0.010     0.130 3.0174008200e+08     0.00 
          cegterg:last               25    -       0.000     0.000     0.000 4.3691150600e+08     0.00 
      sum_band                        5    -       1.660     0.350     2.010 4.1881319920e+09     0.00 
*       fft                          86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*         fft_scatter               665    -       0.120     0.010     0.130             5320     0.00 
*       interpolate                  11    -       0.030     0.000     0.030 8.2871096000e+07     0.00 
*         fft                        86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*           fft_scatter             665    -       0.120     0.010     0.130             5320     0.00 
          ffts                       11    -       0.000     0.000     0.000 1.4295924000e+07     0.00 
*           fft_scatter             665    -       0.120     0.010     0.130             5320     0.00 
*       init_us_2                    55    -       0.040     0.000     0.040 7.4965638000e+07     0.00 
        fftw                        625    -       0.190     0.000     0.190 6.6385000000e+08     0.00 
*         fft_scatter               665    -       0.120     0.010     0.130             5320     0.00 
        calbec                       25    -       0.020     0.000     0.020 3.6973454000e+08     0.00 
        sum_band:becsum              25    -       0.000     0.000     0.000 2.2321000000e+06     0.00 
        addusdens                     5    -       1.270     0.340     1.610 2.9381309950e+09     0.00 
*         fft                        86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*           fft_scatter             665    -       0.120     0.010     0.130             5320     0.00 
      mix_rho                         5    -       0.050     0.010     0.060 6.6551539000e+07     0.00 
*       fft                          86    -       0.230     0.010     0.240 5.3148298800e+08     0.00 
*         fft_scatter               665    -       0.120     0.010     0.130             5320     0.00 
*     davcio                         15    -       0.000     0.000     0.000              120     0.00 
*   davcio                           15    -       0.000     0.000     0.000              120     0.00 
Overhead sum =   0.00244 wallclock seconds
Total calls  = 2112

Total GPTL memory usage = 31.96 KB
Components:
Hashmem                 = 16.672 KB
Regionmem               = 13.376 KB (papimem portion = 5.472 KB)
Parent/child arrays     = 0.888 KB
Callstackmem            = 1.024 KB

Thread mapping:
GPTLthreadid_omp[0] = 0

Addendum

Compile libpfm
wget https://vorboss.dl.sourceforge.net/project/perfmon2/libpfm4/libpfm-4.8.0.tar.gz
tar xvzf libpfm-4.8.0.tar.gz
cd libpfm-4.8.0/
# change config.mk to your needs (an compile static) or get it from this gist
make
make install


## config.mk
#
# Copyright (c) 2002-2006 Hewlett-Packard Development Company, L.P.
# Contributed by Stephane Eranian <eranian@hpl.hp.com>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
# of the Software, and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
# PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
# CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
# OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
# This file is part of libpfm, a performance monitoring support library for
# applications on Linux.
#

#
# This file defines the global compilation settings.
# It is included by every Makefile
#
#
SYS  := $(shell uname -s)
ARCH := $(shell uname -m)
ifeq (i686,$(findstring i686,$(ARCH)))
override ARCH=i386
endif
ifeq (i586,$(findstring i586,$(ARCH)))
override ARCH=i386
endif
ifeq (i486,$(findstring i486,$(ARCH)))
override ARCH=i386
endif
ifeq (i386,$(findstring i386,$(ARCH)))
override ARCH=i386
endif
ifeq (i86pc,$(findstring i86pc,$(ARCH)))
override ARCH=i386
endif
ifeq (x86,$(findstring x86,$(ARCH)))
override ARCH=x86_64
endif
ifeq ($(ARCH),x86_64)
override ARCH=x86_64
endif
ifeq ($(ARCH),amd64)
override ARCH=x86_64
endif
ifeq (ppc,$(findstring ppc,$(ARCH)))
override ARCH=powerpc
endif
ifeq (sparc64,$(findstring sparc64,$(ARCH)))
override ARCH=sparc
endif
ifeq (armv6,$(findstring armv6,$(ARCH)))
override ARCH=arm
endif
ifeq (armv7,$(findstring armv7,$(ARCH)))
override ARCH=arm
endif
ifeq (armv7,$(findstring armv7,$(ARCH)))
override ARCH=arm
endif
ifeq (aarch32,$(findstring aarch32,$(ARCH)))
override ARCH=arm
endif
ifeq (armv8l,$(findstring armv8l,$(ARCH)))
override ARCH=arm
endif
ifeq (mips64,$(findstring mips64,$(ARCH)))
override ARCH=mips
endif
ifeq (mips,$(findstring mips,$(ARCH)))
override ARCH=mips
endif

ifeq (MINGW,$(findstring MINGW,$(SYS)))
override SYS=WINDOWS
endif

#
# CONFIG_PFMLIB_SHARED: y=compile static and shared versions, n=static only
# CONFIG_PFMLIB_DEBUG: enable debugging output support
# CONFIG_PFMLIB_NOPYTHON: do not generate the python support, incompatible
# with PFMLIB_SHARED=n
#
CONFIG_PFMLIB_SHARED?=n
CONFIG_PFMLIB_DEBUG?=n
CONFIG_PFMLIB_NOPYTHON?=y

#
# Cell Broadband Engine is reported as PPC but needs special handling.
#
ifeq ($(SYS),Linux)
MACHINE := $(shell grep -q 'Cell Broadband Engine' /proc/cpuinfo && echo cell)
ifeq (cell,$(MACHINE))
override ARCH=cell
endif
endif

#
# Library version
#
VERSION=4
REVISION=8
AGE=0

#
# Where should things (lib, headers, man) go in the end.
#
PREFIX=$(HOME)/qe_profiling
LIBDIR=$(PREFIX)/lib
INCDIR=$(PREFIX)/include
MANDIR=$(PREFIX)/share/man
DOCDIR=$(PREFIX)/share/doc/libpfm-$(VERSION).$(REVISION).$(AGE)

#
# System header files
#
# SYSINCDIR : where to find standard header files (default to .)
SYSINCDIR=.

#
# Configuration Paramaters for libpfm library
#
ifeq ($(ARCH),ia64)
CONFIG_PFMLIB_ARCH_IA64=y
endif

ifeq ($(ARCH),x86_64)
CONFIG_PFMLIB_ARCH_X86_64=y
CONFIG_PFMLIB_ARCH_X86=y
endif

ifeq ($(ARCH),i386)
CONFIG_PFMLIB_ARCH_I386=y
CONFIG_PFMLIB_ARCH_X86=y
endif

ifeq ($(ARCH),mips)
CONFIG_PFMLIB_ARCH_MIPS=y
endif

ifeq ($(ARCH),powerpc)
CONFIG_PFMLIB_ARCH_POWERPC=y
endif

ifeq ($(ARCH),sparc)
CONFIG_PFMLIB_ARCH_SPARC=y
endif

ifeq ($(ARCH),arm)
CONFIG_PFMLIB_ARCH_ARM=y
endif

ifeq ($(ARCH),aarch64)
CONFIG_PFMLIB_ARCH_ARM64=y
endif

ifeq ($(ARCH),arm64)
CONFIG_PFMLIB_ARCH_ARM64=y
endif

ifeq ($(ARCH),s390x)
CONFIG_PFMLIB_ARCH_S390X=y
endif

ifeq ($(ARCH),cell)
CONFIG_PFMLIB_CELL=y
endif


#
# you shouldn't have to touch anything beyond this point
#

#
# The entire package can be compiled using
# icc the Intel Itanium Compiler (7.x,8.x, 9.x)
# or GNU C
CC=icc
#CC?=gcc
LIBS=
INSTALL=install
LDCONFIG=ldconfig
LN?=ln -sf
PFMINCDIR=$(TOPDIR)/include
PFMLIBDIR=$(TOPDIR)/lib
#
# -Wextra: to enable extra compiler sanity checks (e.g., signed vs. unsigned)
# -Wno-unused-parameter: to avoid warnings on unused foo(void *this) parameter
#
DBG?=-g -Wall -Wno-unused-parameter

ifeq ($(SYS),Darwin)
# older gcc-4.2 does not like -Wextra and some of our initialization code
# Xcode uses a gcc version which is too old for some static initializers
CC=clang
DBG?=-g -Wall -Werror
LDCONFIG=true
endif

ifeq ($(SYS),FreeBSD)
# gcc-4.2 does not like -Wextra and some of our initialization code
DBG=-g -Wall -Werror
endif

CFLAGS+=$(OPTIM) $(DBG) -I$(SYSINCDIR) -I$(PFMINCDIR)
MKDEP=makedepend
PFMLIB=$(PFMLIBDIR)/libpfm.a

ifeq ($(CONFIG_PFMLIB_DEBUG),y)
CFLAGS += -DCONFIG_PFMLIB_DEBUG
endif

CTAGS?=ctags

#
# Python is for use with perf_events
# so it only works on Linux
#
ifneq ($(SYS),Linux)
CONFIG_PFMLIB_NOPYTHON=y
endif

#
# mark that we are compiling on Linux
#
ifeq ($(SYS),Linux)
CFLAGS+= -DCONFIG_PFMLIB_OS_LINUX
endif

#
# compile examples statically if library is
# compile static
# not compatible with python support, so disable for now
#
ifeq ($(CONFIG_PFMLIB_SHARED),n)
LDFLAGS+= -static
CONFIG_PFMLIB_NOPYTHON=y
endif

ifeq ($(SYS),WINDOWS)
CFLAGS +=-DPFMLIB_WINDOWS
endif

## macros.make
# This file contains macro settings which are used by the Makefile. Some require
# "yes" or "no", where various subflags are required if the value is
# "yes". The intent is for the user to edit whichever parts of this file are
# necessary, save to a file named "macros.make", then run "make" to build the
# GPTL library.

##########################################################################

# Where to install GPTL library and include files.
# if "git" is not available, set REVNO by hand
REVNO = $(shell git describe)
INSTALLDIR = $(HOME)/qe_profiling

# Where to install man pages (if blank, defaults to $INSTALLDIR)
MANDIR =

# C compiler
CC = mpiicc

# Whether to build debug lib or optimized lib, and associated flags
DEBUG = no
ifeq ($(DEBUG),yes)
  CFLAGS = -g -O0 -Wall -fno-inline
else
  CFLAGS = -g -O3 -finline-functions -Winline -Wall
endif

# Set compiler flags for non-default ABIs (e.g. 64-bit addressing).
# Normally this can be blank.
ABIFLAGS =

# Is the /proc filesystem available? On most Linux systems it is. If available, get_memusage()
# reads from it to find memory usage info. Otherwise get_memusage() will use get_rusage()
HAVE_SLASHPROC = yes

# GPTL can enable threading either via OPENMP=yes, or PTHREADS=yes. Since
# most OpenMP implementations are built on top of pthreads, OpenMP
# applications linked with GPTL as built with PTHREADS=yes should work fine.
# Thus COMPFLAG should be set to the compiler flag that enables OpenMP directives
# if either OPENMP=yes, or PTHREADS=yes. If OPENMP=no and PTHREADS=no, GPTL
# will not be thread-safe.
OPENMP = yes
COMPFLAG = -qopenmp
# Set PTHREADS if available and OPENMP=no
ifeq ($(OPENMP),no)
  PTHREADS = yes
endif

# For gcc, -Dinline=inline is a no-op. For other C compilers, things like
# -Dinline=__inline__ may be required. To find your compiler's definition, try
# running "./suggestions CC=<your_C_compiler>".
INLINEFLAG = -Dinline=inline

# To get some C compilers such as gcc to behave properly with -O0 and no inlining,
# need to effectively delete the "inline" keyword
ifeq ($(DEBUG),yes)
  INLINEFLAG = -Dinline=
endif

# To build the Fortran interface, set FORTRAN=yes and define the entries under
# ifeq ($(FORTRAN),yes). Otherwise, set FORTRAN=no and skip this section.
FORTRAN = yes
ifeq ($(FORTRAN),yes)
# Fortran name mangling: possibilities are: leave UNDERSCORING blank meaning none
# (e.g. xlf90), -DFORTRANDOUBLEUNDERSCORE (e.g. g77), and -DFORTRANUNDERSCORE
# (e.g. gfortran, pathf95)
#
#  UNDERSCORING =
#  UNDERSCORING = -DFORTRANDOUBLEUNDERSCORE
  UNDERSCORING = -DFORTRANUNDERSCORE

# Set Fortran compiler, flags, and OpenMP compiler flag. Note that Fortran
# OpenMP tests are possible with OPENMP=no as long as PTHREADS=yes
# These settings are only used by the Fortran test applications in ftests/.
  FC     = mpiifort
  FFLAGS = -g -O2
  FOMPFLAG = -qopenmp
endif

# Archiver: normally it's just ar
AR = ar

# PAPI: If a recent version of PAPI is installed, set HAVE_PAPI=yes and set inc and lib
# info as appropriate.
HAVE_PAPI = yes
ifeq ($(HAVE_PAPI),yes)
  PAPI_INCFLAGS = -I$(HOME)/qe_profiling/include
  PAPI_LIBFLAGS = -L$(HOME)/qe_profiling/lib -lpapi
endif

# Whether to build GPTL with MPI support. Set inc and lib info if needed.
# If CC=mpicc, MPI_INCFLAGS and MPI_LIBFLAGS can be blank.
HAVE_MPI = yes
ifeq ($(HAVE_MPI),yes)
# Hopefully MPI_Comm_f2c() exists, but if not, set HAVE_COMM_F2C = no
  HAVE_COMM_F2C = yes
  MPI_INCFLAGS =
  MPI_LIBFLAGS =
# Want 2 MPI tasks
  MPICMD = mpiexec -n 2
endif

# clock_gettime() in librt.a is an option for gathering wallclock time stats
# on some machines. Setting HAVE_LIBRT=yes enables this, but will probably
# require linking applications with -lrt
HAVE_LIBRT = no

# Only define HAVE_NANOTIME if this is a x86. It provides by far the finest grained,
# lowest overhead wallclock timer on that architecture.
# If HAVE_NANOTIME=yes, set BIT64=yes if this is an x86_64
HAVE_NANOTIME = yes
ifeq ($(HAVE_NANOTIME),yes)
  BIT64 = yes
endif

# Some old compilers don't support vprintf. Set to "no" in this case
HAVE_VPRINTF = yes

# Some old compilers don't support the C times() function. Set to "no" in this case
HAVE_TIMES = yes

# gettimeofday() should be available everywhere. But if not, set to "no"
HAVE_GETTIMEOFDAY = yes

# Whether to test auto-profiling (adds 2 tests to "make test"). If compiler is gcc or
# pathscale, set INSTRFLAG to -finstrument-functions. PGI 8.0.2 and later provide
# -Minstrument:functions.
TEST_AUTOPROFILE = yes
ifeq ($(TEST_AUTOPROFILE),yes)
  INSTRFLAG = -finstrument-functions
  CXX = mpiicpc
endif

# Whether to enable PMPI wrappers. Turning this on will result in automatic
# start/stop entries for a number of MPI calls.
ifeq ($(HAVE_MPI),yes)
  ENABLE_PMPI = no
  ifeq ($(ENABLE_PMPI),yes)
# Some newer MPI releases change the prototype of some input arguments from "void *"
# to "const void *". The prototype for these functions inside GPTL must match the
# MPI library in use. For ENABLE_PMPI=yes, set MPI_CONST to yes or no appropriately.
    MPI_CONST = no
# Yet another (Fortran-only) wart: the value of MPI_STATUS_SIZE varies among MPI distributions.
# Normal values are 5 or 10. AIX and very new MPICH versions it is 10. Set it here--if you get it
# wrong your instrumented code will tell you at run time the proper value. Or run ftests/pmpi
   MPI_STATUS_SIZE_IN_INTS = 5
  endif
# If Fortran is enabled, the ability to automatically initialize GPTL from
# mpi_init can be enabled if iargc() and getarg() are available.
  ifeq ($(FORTRAN),yes)
    HAVE_IARGCGETARG = no
  endif
endif

# Whether system functions backtrace() and backtrace_symbols() exist. Setting HAVE_BACKTRACE=yes
# will enable auto-generation of function name when auto-profiling and GPTLdopr_memusage has been
# enabled at run-time. If HAVE_BACKTRACE=no, function address will be printed instead.
# GNU compilers: compile application code with -finstrument-functions -rdynamic
# Intel compilers: compile application code with -finstrument-functions -rdynamic -g
HAVE_BACKTRACE = no
ifeq ($(HAVE_BACKTRACE),yes)
  CFLAGS += -DHAVE_BACKTRACE
endif

## qe_gptl.patch
diff --git a/Modules/clocks.f90 b/Modules/clocks.f90
index 52f4223c3..a3f72f141 100644
--- a/Modules/clocks.f90
+++ b/Modules/clocks.f90
@@ -51,6 +51,9 @@ MODULE mytime
 #if defined (__TRACE)
   INTEGER :: trace_depth = 0
 #endif
+#ifdef __GPTL_TIMING
+  include 'gptl.inc'                 ! Fortran GPTL include file
+#endif
   !
 END MODULE mytime
 !
@@ -64,15 +67,45 @@ SUBROUTINE init_clocks( go )
   USE kinds,  ONLY : DP
   USE mytime, ONLY : called, t0cpu, cputime, no, notrunning, maxclock, &
        clock_label, walltime, t0wall, nclock
+#ifdef __GPTL_TIMING
+  USE io_global, ONLY : stdout
+  USE mytime, ONLY : gptlsetoption, gptlabort_on_error, gptlnarrowprint, &
+      gptlpercent, gptloverhead, gptlinitialize
+#endif
   !
   IMPLICIT NONE
   !
+#ifdef __GPTL_TIMING
+  LOGICAL :: gptl_conf_file
+  INTEGER :: ret
+#endif
+  !
   LOGICAL :: go
   INTEGER :: n
   !
   no = .not. go
   nclock = 0
   !
+#ifdef __GPTL_TIMING
+  !
+  INQUIRE( FILE='gptl.conf', EXIST=gptl_conf_file )
+  IF ( gptl_conf_file ) THEN
+    call gptlprocess_namelist ('gptl.conf', 1, ret)
+    if (ret /= 0) then
+      write(stdout,*) 'GPTL namelist read failure'
+      call exit (1)
+    end if
+  else
+    n = gptlsetoption (gptlabort_on_error, 1) ! Abort on GPTL error
+#ifdef __GPTL_TIMING_PAPI
+    CALL activate_papi_events_from_environment()
+#endif
+    n = gptlsetoption (gptlnarrowprint, 1)    ! Print fewer sig figs
+    n = gptlsetoption (gptlpercent, 1)        ! Turn on "% of" print
+    n = gptlsetoption (gptloverhead, 0)       ! Turn off overhead estimate
+  !
+  END IF
+#endif
   DO n = 1, maxclock
      !
      called(n)      = 0
@@ -84,6 +117,10 @@ SUBROUTINE init_clocks( go )
      !
   ENDDO
   !
+#ifdef __GPTL_TIMING
+  ! Initialize GPTL
+  n = gptlinitialize ()
+#endif
   RETURN
   !
 END SUBROUTINE init_clocks
@@ -100,6 +137,9 @@ SUBROUTINE start_clock( label )
 #endif
   USE mytime,    ONLY : nclock, clock_label, notrunning, no, maxclock, &
                         t0cpu, t0wall
+#ifdef __GPTL_TIMING
+  USE mytime,    ONLY : gptlstart
+#endif
   !
   IMPLICIT NONE
   !
@@ -116,6 +156,10 @@ SUBROUTINE start_clock( label )
   !
   IF ( no .and. ( nclock == 1 ) ) RETURN
   !
+#ifdef __GPTL_TIMING
+  n  = gptlstart (label)
+#endif
+  !
   ! ... prevent trouble if label is longer than 12 characters
   !
   label_ = trim ( label )
@@ -172,6 +216,9 @@ SUBROUTINE stop_clock( label )
 #endif
   USE mytime,    ONLY : no, nclock, clock_label, cputime, walltime, &
                         notrunning, called, t0cpu, t0wall
+#ifdef __GPTL_TIMING
+  USE mytime,    ONLY : gptlstop
+#endif
   !
   IMPLICIT NONE
   !
@@ -188,6 +235,10 @@ SUBROUTINE stop_clock( label )
   !
   IF ( no ) RETURN
   !
+#ifdef __GPTL_TIMING
+  n  = gptlstop (label)
+#endif
+  !
   ! ... prevent trouble if label is longer than 12 characters
   !
   label_ = trim ( label )
@@ -489,3 +540,59 @@ FUNCTION get_clock( label )
   RETURN
   !
 END FUNCTION get_clock
+
+
+#ifdef __GPTL_TIMING
+SUBROUTINE stop_gptl_clocks( )
+  USE mytime,    ONLY : gptlpr, gptlfinalize, GPTLpr_summary
+  USE mp_world,  ONLY : world_comm, mpime, root
+  !
+  IMPLICIT NONE
+  INTEGER :: ret
+  IF (mpime == root) ret = gptlpr (0) ! Print timer stats
+  ret = GPTLpr_summary(world_comm)
+  ret = gptlfinalize ()            ! Clean up
+END SUBROUTINE stop_gptl_clocks
+
+#ifdef __GPTL_TIMING_PAPI
+SUBROUTINE activate_papi_events_from_environment()
+  USE mytime,    ONLY : gptlsetoption, gptlevent_name_to_code
+  USE io_global, ONLY : stdout
+  !
+  IMPLICIT NONE
+  !
+  INTEGER :: n, ret, ee_pos, ee_len, code
+  character(len=300) :: env_events
+  character(len=20) :: env_evnt
+  include 'f90papi.h'                ! Needed for PAPI_FP_OPS
+  !
+  !
+  call GET_ENVIRONMENT_VARIABLE('QE_PAPI_EVENTS', env_events, ee_len, ret)
+  !
+  IF (ret /= 0) THEN
+    WRITE( stdout, "(5X,A60)") "QE_PAPI_EVENTS environment variable not found. No PAPI!"
+    RETURN
+  END IF
+  !
+  n=1
+  do while (n < ee_len)
+    ee_pos = index(env_events(n:ee_len),",")
+    if ((ee_pos > 0) .and. (ee_len > 0) ) then
+      env_evnt = TRIM(env_events(n:n+ee_pos-2))
+    else if (ee_len>0) then
+      env_evnt = TRIM(env_events(n:ee_len))
+      ee_pos=ee_len
+    end if
+    !
+    if (gptlevent_name_to_code (TRIM(env_evnt), code) == 0) then
+      ret = gptlsetoption (code, 1)
+      WRITE( stdout, "(5X, 'Activated PAPI: ' A20)") env_evnt
+    end if
+
+    n = ee_pos + n
+  end do
+  !
+END SUBROUTINE activate_papi_events_from_environment
+#endif
+
+#endif
diff --git a/Modules/qexsd.f90 b/Modules/qexsd.f90
index 889a2c950..1e9791c66 100644
--- a/Modules/qexsd.f90
+++ b/Modules/qexsd.f90
@@ -244,7 +244,6 @@ CONTAINS
     !------------------------------------------------------------------------
     SUBROUTINE qexsd_closeschema()
       !------------------------------------------------------------------------
-      USE mytime,    ONLY: nclock, clock_label
       IMPLICIT NONE
       REAL(DP),EXTERNAL    :: get_clock
       !
diff --git a/PW/src/stop_run.f90 b/PW/src/stop_run.f90
index e656154cd..bba521db0 100644
--- a/PW/src/stop_run.f90
+++ b/PW/src/stop_run.f90
@@ -50,6 +50,10 @@ SUBROUTINE stop_run( exit_status )
   !
   CALL environment_end( 'PWSCF' )
   !
+#ifdef __GPTL_TIMING
+  CALL stop_gptl_clocks ()
+#endif
+  !
   CALL mp_global_end ()
   !
 END SUBROUTINE stop_run
	#
	# Copyright (c) 2002-2006 Hewlett-Packard Development Company, L.P.
	# Contributed by Stephane Eranian <eranian@hpl.hp.com>
	#
	# Permission is hereby granted, free of charge, to any person obtaining a copy
	# of this software and associated documentation files (the "Software"), to deal
	# in the Software without restriction, including without limitation the rights
	# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
	# of the Software, and to permit persons to whom the Software is furnished to do so,
	# subject to the following conditions:
	#
	# The above copyright notice and this permission notice shall be included in all
	# copies or substantial portions of the Software.
	#
	# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
	# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
	# PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
	# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
	# CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
	# OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
	#
	# This file is part of libpfm, a performance monitoring support library for
	# applications on Linux.
	#

	#
	# This file defines the global compilation settings.
	# It is included by every Makefile
	#
	#
	SYS := $(shell uname -s)
	ARCH := $(shell uname -m)
	ifeq (i686,$(findstring i686,$(ARCH)))
	override ARCH=i386
	endif
	ifeq (i586,$(findstring i586,$(ARCH)))
	override ARCH=i386
	endif
	ifeq (i486,$(findstring i486,$(ARCH)))
	override ARCH=i386
	endif
	ifeq (i386,$(findstring i386,$(ARCH)))
	override ARCH=i386
	endif
	ifeq (i86pc,$(findstring i86pc,$(ARCH)))
	override ARCH=i386
	endif
	ifeq (x86,$(findstring x86,$(ARCH)))
	override ARCH=x86_64
	endif
	ifeq ($(ARCH),x86_64)
	override ARCH=x86_64
	endif
	ifeq ($(ARCH),amd64)
	override ARCH=x86_64
	endif
	ifeq (ppc,$(findstring ppc,$(ARCH)))
	override ARCH=powerpc
	endif
	ifeq (sparc64,$(findstring sparc64,$(ARCH)))
	override ARCH=sparc
	endif
	ifeq (armv6,$(findstring armv6,$(ARCH)))
	override ARCH=arm
	endif
	ifeq (armv7,$(findstring armv7,$(ARCH)))
	override ARCH=arm
	endif
	ifeq (armv7,$(findstring armv7,$(ARCH)))
	override ARCH=arm
	endif
	ifeq (aarch32,$(findstring aarch32,$(ARCH)))
	override ARCH=arm
	endif
	ifeq (armv8l,$(findstring armv8l,$(ARCH)))
	override ARCH=arm
	endif
	ifeq (mips64,$(findstring mips64,$(ARCH)))
	override ARCH=mips
	endif
	ifeq (mips,$(findstring mips,$(ARCH)))
	override ARCH=mips
	endif

	ifeq (MINGW,$(findstring MINGW,$(SYS)))
	override SYS=WINDOWS
	endif

	#
	# CONFIG_PFMLIB_SHARED: y=compile static and shared versions, n=static only
	# CONFIG_PFMLIB_DEBUG: enable debugging output support
	# CONFIG_PFMLIB_NOPYTHON: do not generate the python support, incompatible
	# with PFMLIB_SHARED=n
	#
	CONFIG_PFMLIB_SHARED?=n
	CONFIG_PFMLIB_DEBUG?=n
	CONFIG_PFMLIB_NOPYTHON?=y

	#
	# Cell Broadband Engine is reported as PPC but needs special handling.
	#
	ifeq ($(SYS),Linux)
	MACHINE := $(shell grep -q 'Cell Broadband Engine' /proc/cpuinfo && echo cell)
	ifeq (cell,$(MACHINE))
	override ARCH=cell
	endif
	endif

	#
	# Library version
	#
	VERSION=4
	REVISION=8
	AGE=0

	#
	# Where should things (lib, headers, man) go in the end.
	#
	PREFIX=$(HOME)/qe_profiling
	LIBDIR=$(PREFIX)/lib
	INCDIR=$(PREFIX)/include
	MANDIR=$(PREFIX)/share/man
	DOCDIR=$(PREFIX)/share/doc/libpfm-$(VERSION).$(REVISION).$(AGE)

	#
	# System header files
	#
	# SYSINCDIR : where to find standard header files (default to .)
	SYSINCDIR=.

	#
	# Configuration Paramaters for libpfm library
	#
	ifeq ($(ARCH),ia64)
	CONFIG_PFMLIB_ARCH_IA64=y
	endif

	ifeq ($(ARCH),x86_64)
	CONFIG_PFMLIB_ARCH_X86_64=y
	CONFIG_PFMLIB_ARCH_X86=y
	endif

	ifeq ($(ARCH),i386)
	CONFIG_PFMLIB_ARCH_I386=y
	CONFIG_PFMLIB_ARCH_X86=y
	endif

	ifeq ($(ARCH),mips)
	CONFIG_PFMLIB_ARCH_MIPS=y
	endif

	ifeq ($(ARCH),powerpc)
	CONFIG_PFMLIB_ARCH_POWERPC=y
	endif

	ifeq ($(ARCH),sparc)
	CONFIG_PFMLIB_ARCH_SPARC=y
	endif

	ifeq ($(ARCH),arm)
	CONFIG_PFMLIB_ARCH_ARM=y
	endif

	ifeq ($(ARCH),aarch64)
	CONFIG_PFMLIB_ARCH_ARM64=y
	endif

	ifeq ($(ARCH),arm64)
	CONFIG_PFMLIB_ARCH_ARM64=y
	endif

	ifeq ($(ARCH),s390x)
	CONFIG_PFMLIB_ARCH_S390X=y
	endif

	ifeq ($(ARCH),cell)
	CONFIG_PFMLIB_CELL=y
	endif


	#
	# you shouldn't have to touch anything beyond this point
	#

	#
	# The entire package can be compiled using
	# icc the Intel Itanium Compiler (7.x,8.x, 9.x)
	# or GNU C
	CC=icc
	#CC?=gcc
	LIBS=
	INSTALL=install
	LDCONFIG=ldconfig
	LN?=ln -sf
	PFMINCDIR=$(TOPDIR)/include
	PFMLIBDIR=$(TOPDIR)/lib
	#
	# -Wextra: to enable extra compiler sanity checks (e.g., signed vs. unsigned)
	# -Wno-unused-parameter: to avoid warnings on unused foo(void *this) parameter
	#
	DBG?=-g -Wall -Wno-unused-parameter

	ifeq ($(SYS),Darwin)
	# older gcc-4.2 does not like -Wextra and some of our initialization code
	# Xcode uses a gcc version which is too old for some static initializers
	CC=clang
	DBG?=-g -Wall -Werror
	LDCONFIG=true
	endif

	ifeq ($(SYS),FreeBSD)
	# gcc-4.2 does not like -Wextra and some of our initialization code
	DBG=-g -Wall -Werror
	endif

	CFLAGS+=$(OPTIM) $(DBG) -I$(SYSINCDIR) -I$(PFMINCDIR)
	MKDEP=makedepend
	PFMLIB=$(PFMLIBDIR)/libpfm.a

	ifeq ($(CONFIG_PFMLIB_DEBUG),y)
	CFLAGS += -DCONFIG_PFMLIB_DEBUG
	endif

	CTAGS?=ctags

	#
	# Python is for use with perf_events
	# so it only works on Linux
	#
	ifneq ($(SYS),Linux)
	CONFIG_PFMLIB_NOPYTHON=y
	endif

	#
	# mark that we are compiling on Linux
	#
	ifeq ($(SYS),Linux)
	CFLAGS+= -DCONFIG_PFMLIB_OS_LINUX
	endif

	#
	# compile examples statically if library is
	# compile static
	# not compatible with python support, so disable for now
	#
	ifeq ($(CONFIG_PFMLIB_SHARED),n)
	LDFLAGS+= -static
	CONFIG_PFMLIB_NOPYTHON=y
	endif

	ifeq ($(SYS),WINDOWS)
	CFLAGS +=-DPFMLIB_WINDOWS
	endif
	# This file contains macro settings which are used by the Makefile. Some require
	# "yes" or "no", where various subflags are required if the value is
	# "yes". The intent is for the user to edit whichever parts of this file are
	# necessary, save to a file named "macros.make", then run "make" to build the
	# GPTL library.

	##########################################################################

	# Where to install GPTL library and include files.
	# if "git" is not available, set REVNO by hand
	REVNO = $(shell git describe)
	INSTALLDIR = $(HOME)/qe_profiling

	# Where to install man pages (if blank, defaults to $INSTALLDIR)
	MANDIR =

	# C compiler
	CC = mpiicc

	# Whether to build debug lib or optimized lib, and associated flags
	DEBUG = no
	ifeq ($(DEBUG),yes)
	CFLAGS = -g -O0 -Wall -fno-inline
	else
	CFLAGS = -g -O3 -finline-functions -Winline -Wall
	endif

	# Set compiler flags for non-default ABIs (e.g. 64-bit addressing).
	# Normally this can be blank.
	ABIFLAGS =

	# Is the /proc filesystem available? On most Linux systems it is. If available, get_memusage()
	# reads from it to find memory usage info. Otherwise get_memusage() will use get_rusage()
	HAVE_SLASHPROC = yes

	# GPTL can enable threading either via OPENMP=yes, or PTHREADS=yes. Since
	# most OpenMP implementations are built on top of pthreads, OpenMP
	# applications linked with GPTL as built with PTHREADS=yes should work fine.
	# Thus COMPFLAG should be set to the compiler flag that enables OpenMP directives
	# if either OPENMP=yes, or PTHREADS=yes. If OPENMP=no and PTHREADS=no, GPTL
	# will not be thread-safe.
	OPENMP = yes
	COMPFLAG = -qopenmp
	# Set PTHREADS if available and OPENMP=no
	ifeq ($(OPENMP),no)
	PTHREADS = yes
	endif

	# For gcc, -Dinline=inline is a no-op. For other C compilers, things like
	# -Dinline=__inline__ may be required. To find your compiler's definition, try
	# running "./suggestions CC=<your_C_compiler>".
	INLINEFLAG = -Dinline=inline

	# To get some C compilers such as gcc to behave properly with -O0 and no inlining,
	# need to effectively delete the "inline" keyword
	ifeq ($(DEBUG),yes)
	INLINEFLAG = -Dinline=
	endif

	# To build the Fortran interface, set FORTRAN=yes and define the entries under
	# ifeq ($(FORTRAN),yes). Otherwise, set FORTRAN=no and skip this section.
	FORTRAN = yes
	ifeq ($(FORTRAN),yes)
	# Fortran name mangling: possibilities are: leave UNDERSCORING blank meaning none
	# (e.g. xlf90), -DFORTRANDOUBLEUNDERSCORE (e.g. g77), and -DFORTRANUNDERSCORE
	# (e.g. gfortran, pathf95)
	#
	# UNDERSCORING =
	# UNDERSCORING = -DFORTRANDOUBLEUNDERSCORE
	UNDERSCORING = -DFORTRANUNDERSCORE

	# Set Fortran compiler, flags, and OpenMP compiler flag. Note that Fortran
	# OpenMP tests are possible with OPENMP=no as long as PTHREADS=yes
	# These settings are only used by the Fortran test applications in ftests/.
	FC = mpiifort
	FFLAGS = -g -O2
	FOMPFLAG = -qopenmp
	endif

	# Archiver: normally it's just ar
	AR = ar

	# PAPI: If a recent version of PAPI is installed, set HAVE_PAPI=yes and set inc and lib
	# info as appropriate.
	HAVE_PAPI = yes
	ifeq ($(HAVE_PAPI),yes)
	PAPI_INCFLAGS = -I$(HOME)/qe_profiling/include
	PAPI_LIBFLAGS = -L$(HOME)/qe_profiling/lib -lpapi
	endif

	# Whether to build GPTL with MPI support. Set inc and lib info if needed.
	# If CC=mpicc, MPI_INCFLAGS and MPI_LIBFLAGS can be blank.
	HAVE_MPI = yes
	ifeq ($(HAVE_MPI),yes)
	# Hopefully MPI_Comm_f2c() exists, but if not, set HAVE_COMM_F2C = no
	HAVE_COMM_F2C = yes
	MPI_INCFLAGS =
	MPI_LIBFLAGS =
	# Want 2 MPI tasks
	MPICMD = mpiexec -n 2
	endif

	# clock_gettime() in librt.a is an option for gathering wallclock time stats
	# on some machines. Setting HAVE_LIBRT=yes enables this, but will probably
	# require linking applications with -lrt
	HAVE_LIBRT = no

	# Only define HAVE_NANOTIME if this is a x86. It provides by far the finest grained,
	# lowest overhead wallclock timer on that architecture.
	# If HAVE_NANOTIME=yes, set BIT64=yes if this is an x86_64
	HAVE_NANOTIME = yes
	ifeq ($(HAVE_NANOTIME),yes)
	BIT64 = yes
	endif

	# Some old compilers don't support vprintf. Set to "no" in this case
	HAVE_VPRINTF = yes

	# Some old compilers don't support the C times() function. Set to "no" in this case
	HAVE_TIMES = yes

	# gettimeofday() should be available everywhere. But if not, set to "no"
	HAVE_GETTIMEOFDAY = yes

	# Whether to test auto-profiling (adds 2 tests to "make test"). If compiler is gcc or
	# pathscale, set INSTRFLAG to -finstrument-functions. PGI 8.0.2 and later provide
	# -Minstrument:functions.
	TEST_AUTOPROFILE = yes
	ifeq ($(TEST_AUTOPROFILE),yes)
	INSTRFLAG = -finstrument-functions
	CXX = mpiicpc
	endif

	# Whether to enable PMPI wrappers. Turning this on will result in automatic
	# start/stop entries for a number of MPI calls.
	ifeq ($(HAVE_MPI),yes)
	ENABLE_PMPI = no
	ifeq ($(ENABLE_PMPI),yes)
	# Some newer MPI releases change the prototype of some input arguments from "void *"
	# to "const void *". The prototype for these functions inside GPTL must match the
	# MPI library in use. For ENABLE_PMPI=yes, set MPI_CONST to yes or no appropriately.
	MPI_CONST = no
	# Yet another (Fortran-only) wart: the value of MPI_STATUS_SIZE varies among MPI distributions.
	# Normal values are 5 or 10. AIX and very new MPICH versions it is 10. Set it here--if you get it
	# wrong your instrumented code will tell you at run time the proper value. Or run ftests/pmpi
	MPI_STATUS_SIZE_IN_INTS = 5
	endif
	# If Fortran is enabled, the ability to automatically initialize GPTL from
	# mpi_init can be enabled if iargc() and getarg() are available.
	ifeq ($(FORTRAN),yes)
	HAVE_IARGCGETARG = no
	endif
	endif

	# Whether system functions backtrace() and backtrace_symbols() exist. Setting HAVE_BACKTRACE=yes
	# will enable auto-generation of function name when auto-profiling and GPTLdopr_memusage has been
	# enabled at run-time. If HAVE_BACKTRACE=no, function address will be printed instead.
	# GNU compilers: compile application code with -finstrument-functions -rdynamic
	# Intel compilers: compile application code with -finstrument-functions -rdynamic -g
	HAVE_BACKTRACE = no
	ifeq ($(HAVE_BACKTRACE),yes)
	CFLAGS += -DHAVE_BACKTRACE
	endif
	diff --git a/Modules/clocks.f90 b/Modules/clocks.f90
	index 52f4223c3..a3f72f141 100644
	--- a/Modules/clocks.f90
	+++ b/Modules/clocks.f90
	@@ -51,6 +51,9 @@ MODULE mytime
	#if defined (__TRACE)
	INTEGER :: trace_depth = 0
	#endif
	+#ifdef __GPTL_TIMING
	+ include 'gptl.inc' ! Fortran GPTL include file
	+#endif
	!
	END MODULE mytime
	!
	@@ -64,15 +67,45 @@ SUBROUTINE init_clocks( go )
	USE kinds, ONLY : DP
	USE mytime, ONLY : called, t0cpu, cputime, no, notrunning, maxclock, &
	clock_label, walltime, t0wall, nclock
	+#ifdef __GPTL_TIMING
	+ USE io_global, ONLY : stdout
	+ USE mytime, ONLY : gptlsetoption, gptlabort_on_error, gptlnarrowprint, &
	+ gptlpercent, gptloverhead, gptlinitialize
	+#endif
	!
	IMPLICIT NONE
	!
	+#ifdef __GPTL_TIMING
	+ LOGICAL :: gptl_conf_file
	+ INTEGER :: ret
	+#endif
	+ !
	LOGICAL :: go
	INTEGER :: n
	!
	no = .not. go
	nclock = 0
	!
	+#ifdef __GPTL_TIMING
	+ !
	+ INQUIRE( FILE='gptl.conf', EXIST=gptl_conf_file )
	+ IF ( gptl_conf_file ) THEN
	+ call gptlprocess_namelist ('gptl.conf', 1, ret)
	+ if (ret /= 0) then
	+ write(stdout,*) 'GPTL namelist read failure'
	+ call exit (1)
	+ end if
	+ else
	+ n = gptlsetoption (gptlabort_on_error, 1) ! Abort on GPTL error
	+#ifdef __GPTL_TIMING_PAPI
	+ CALL activate_papi_events_from_environment()
	+#endif
	+ n = gptlsetoption (gptlnarrowprint, 1) ! Print fewer sig figs
	+ n = gptlsetoption (gptlpercent, 1) ! Turn on "% of" print
	+ n = gptlsetoption (gptloverhead, 0) ! Turn off overhead estimate
	+ !
	+ END IF
	+#endif
	DO n = 1, maxclock
	!
	called(n) = 0
	@@ -84,6 +117,10 @@ SUBROUTINE init_clocks( go )
	!
	ENDDO
	!
	+#ifdef __GPTL_TIMING
	+ ! Initialize GPTL
	+ n = gptlinitialize ()
	+#endif
	RETURN
	!
	END SUBROUTINE init_clocks
	@@ -100,6 +137,9 @@ SUBROUTINE start_clock( label )
	#endif
	USE mytime, ONLY : nclock, clock_label, notrunning, no, maxclock, &
	t0cpu, t0wall
	+#ifdef __GPTL_TIMING
	+ USE mytime, ONLY : gptlstart
	+#endif
	!
	IMPLICIT NONE
	!
	@@ -116,6 +156,10 @@ SUBROUTINE start_clock( label )
	!
	IF ( no .and. ( nclock == 1 ) ) RETURN
	!
	+#ifdef __GPTL_TIMING
	+ n = gptlstart (label)
	+#endif
	+ !
	! ... prevent trouble if label is longer than 12 characters
	!
	label_ = trim ( label )
	@@ -172,6 +216,9 @@ SUBROUTINE stop_clock( label )
	#endif
	USE mytime, ONLY : no, nclock, clock_label, cputime, walltime, &
	notrunning, called, t0cpu, t0wall
	+#ifdef __GPTL_TIMING
	+ USE mytime, ONLY : gptlstop
	+#endif
	!
	IMPLICIT NONE
	!
	@@ -188,6 +235,10 @@ SUBROUTINE stop_clock( label )
	!
	IF ( no ) RETURN
	!
	+#ifdef __GPTL_TIMING
	+ n = gptlstop (label)
	+#endif
	+ !
	! ... prevent trouble if label is longer than 12 characters
	!
	label_ = trim ( label )
	@@ -489,3 +540,59 @@ FUNCTION get_clock( label )
	RETURN
	!
	END FUNCTION get_clock
	+
	+
	+#ifdef __GPTL_TIMING
	+SUBROUTINE stop_gptl_clocks( )
	+ USE mytime, ONLY : gptlpr, gptlfinalize, GPTLpr_summary
	+ USE mp_world, ONLY : world_comm, mpime, root
	+ !
	+ IMPLICIT NONE
	+ INTEGER :: ret
	+ IF (mpime == root) ret = gptlpr (0) ! Print timer stats
	+ ret = GPTLpr_summary(world_comm)
	+ ret = gptlfinalize () ! Clean up
	+END SUBROUTINE stop_gptl_clocks
	+
	+#ifdef __GPTL_TIMING_PAPI
	+SUBROUTINE activate_papi_events_from_environment()
	+ USE mytime, ONLY : gptlsetoption, gptlevent_name_to_code
	+ USE io_global, ONLY : stdout
	+ !
	+ IMPLICIT NONE
	+ !
	+ INTEGER :: n, ret, ee_pos, ee_len, code
	+ character(len=300) :: env_events
	+ character(len=20) :: env_evnt
	+ include 'f90papi.h' ! Needed for PAPI_FP_OPS
	+ !
	+ !
	+ call GET_ENVIRONMENT_VARIABLE('QE_PAPI_EVENTS', env_events, ee_len, ret)
	+ !
	+ IF (ret /= 0) THEN
	+ WRITE( stdout, "(5X,A60)") "QE_PAPI_EVENTS environment variable not found. No PAPI!"
	+ RETURN
	+ END IF
	+ !
	+ n=1
	+ do while (n < ee_len)
	+ ee_pos = index(env_events(n:ee_len),",")
	+ if ((ee_pos > 0) .and. (ee_len > 0) ) then
	+ env_evnt = TRIM(env_events(n:n+ee_pos-2))
	+ else if (ee_len>0) then
	+ env_evnt = TRIM(env_events(n:ee_len))
	+ ee_pos=ee_len
	+ end if
	+ !
	+ if (gptlevent_name_to_code (TRIM(env_evnt), code) == 0) then
	+ ret = gptlsetoption (code, 1)
	+ WRITE( stdout, "(5X, 'Activated PAPI: ' A20)") env_evnt
	+ end if
	+
	+ n = ee_pos + n
	+ end do
	+ !
	+END SUBROUTINE activate_papi_events_from_environment
	+#endif
	+
	+#endif
	diff --git a/Modules/qexsd.f90 b/Modules/qexsd.f90
	index 889a2c950..1e9791c66 100644
	--- a/Modules/qexsd.f90
	+++ b/Modules/qexsd.f90
	@@ -244,7 +244,6 @@ CONTAINS
	!------------------------------------------------------------------------
	SUBROUTINE qexsd_closeschema()
	!------------------------------------------------------------------------
	- USE mytime, ONLY: nclock, clock_label
	IMPLICIT NONE
	REAL(DP),EXTERNAL :: get_clock
	!
	diff --git a/PW/src/stop_run.f90 b/PW/src/stop_run.f90
	index e656154cd..bba521db0 100644
	--- a/PW/src/stop_run.f90
	+++ b/PW/src/stop_run.f90
	@@ -50,6 +50,10 @@ SUBROUTINE stop_run( exit_status )
	!
	CALL environment_end( 'PWSCF' )
	!
	+#ifdef __GPTL_TIMING
	+ CALL stop_gptl_clocks ()
	+#endif
	+ !
	CALL mp_global_end ()
	!
	END SUBROUTINE stop_run