Skip to content

Instantly share code, notes, and snippets.

@lcw
Last active May 23, 2020 01:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lcw/3dadb8d1934795bb0d7cb8ace39ac8b1 to your computer and use it in GitHub Desktop.
Save lcw/3dadb8d1934795bb0d7cb8ace39ac8b1 to your computer and use it in GitHub Desktop.

GCM Tracer timings in CLIMA

We are experimenting with the number of tracers in ClimateMachine. We are starting with ClimateMachine branch lcw/diff_nstate (aka f25ed2a9c93674ae27c3e0e5280c1aadab64807e). The following change is made reduce the overall runtime of the simulation.

diff --git a/experiments/AtmosGCM/heldsuarez.jl b/experiments/AtmosGCM/heldsuarez.jl
index e2d980ee1..a1537958b 100755
--- a/experiments/AtmosGCM/heldsuarez.jl
+++ b/experiments/AtmosGCM/heldsuarez.jl
@@ -1,6 +1,7 @@
 #!/usr/bin/env julia --project
 using ClimateMachine
 using ArgParse
+using CUDAdrv

 s = ArgParseSettings()
 @add_arg_table! s begin
@@ -208,7 +209,7 @@ function main()
     poly_order = 5                           # discontinuous Galerkin polynomial order
     n_horz = 5                               # horizontal element number
     n_vert = 5                               # vertical element number
-    n_days = 120                             # experiment day number
+    n_days = 0.1                             # experiment day number
     timestart = FT(0)                        # start time (s)
     timeend = FT(n_days * day(param_set))    # end time (s)

@@ -243,12 +244,14 @@ function main()
     end

     # Run the model
-    result = ClimateMachine.invoke!(
-        solver_config;
-        diagnostics_config = dgn_config,
-        user_callbacks = (cbfilter,),
-        check_euclidean_distance = true,
-    )
+    CUDAdrv.@profile begin
+        result = ClimateMachine.invoke!(
+            solver_config;
+            diagnostics_config = dgn_config,
+            user_callbacks = (cbfilter,),
+            check_euclidean_distance = true,
+        )
+    end
 end

 main()

For the first table, we vary the number of tracers and see what the time per time step is. We are running with Julia v1.3.1 on an AWS instance with a V100 and use the following command to compute the wall-clock time per time-step.

for m = 2, n = 0:5
    number_of_ranks = m
    number_of_tracers = 2^n
    @info "Starting...." number_of_ranks number_of_tracers
    run(`mpirun -np $(m) julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers $(2^n)`)
end

We get the following performance characteristics

Ranks   Number of Tracers   Wall-Time Per Step (s)
                             before changes        after changes
------------------------------------------------------------------------
1        1                  5.8616807999999996e-03 5.7307617999999994e-03
1        2                  6.8119575000000002e-03 6.0563041999999994e-03
1        4                  8.2639034999999993e-03 6.8886373999999997e-03
1        8                  1.2300703499999999e-02 8.7006017000000012e-03
1       16                  1.3299454180000000e-01 1.4463497600000000e-02
1       32                  5.1850621520000006e-01 2.6536463500000003e-02
2        1                                         1.7634898700000002e-02
2        2                                         1.7925797700000003e-02
2        4                                         1.7990231900000000e-02
2        8                                         1.8047714700000002e-02
2       16                                         1.8218115100000001e-02
2       32                                         2.2956788700000001e-02

I just grabbed a random output of the time per time-step and put it into wall_clock_time. The timings did bounce around a little.

Profiling

Below is a profile from a single GPU run with 2 tracers.

❯ /usr/local/cuda-9.0/bin/nvprof julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 2
==8410== NVPROF is profiling process 8410, command: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 2
┌ Info: Model composition
│     param_set = EarthParameterSet()
│     orientation = SphericalOrientation()
│     ref_state = HydrostaticState{DecayingTemperatureProfile{Float32},Float32}(DecayingTemperatureProfile{Float32}(290.0f0, 220.0f0, 8484.2705f0), 0.0f0)
│     turbulence = SmagorinskyLilly{Float32}(0.21f0)
│     hyperdiffusion = StandardHyperDiffusion{Float32}(14400.0f0)
│     moisture = DryModel()
│     precipitation = NoPrecipitation()
│     radiation = NoRadiation()
│     source = (Gravity(), Coriolis(), held_suarez_forcing!, RayleighSponge{Float32}(30000.0f0, 12000.0f0, 0.0011111111f0, Float32[0.0, 0.0, 0.0], 2.0f0))
│     tracers = NTracers{2,Float32}(Float32[1.0, 2.0])
│     boundarycondition = AtmosBC{Impenetrable{FreeSlip},Insulating,Impermeable,ImpermeableTracer}(Impenetrable{FreeSlip}(FreeSlip()), Insulating(), Impermeable(), ImpermeableTracer())
│     init_state_conservative = init_heldsuarez!
└     data_config = HeldSuarezDataConfig{Float32}(255.0f0)
┌ Info: Establishing Atmos GCM configuration for HeldSuarez
│     precision        = Float32
│     polynomial order = 5
│     #horiz elems     = 5#vert elems      = 5
│     domain height    = 3.00e+04 m
│     MPI ranks        = 1
│     min(Δ_horz)      = 167863.59 m
└     min(Δ_vert)      = 703.00 m
[ Info: Initializing HeldSuarez
┌ Info: Starting HeldSuarez
│     dt              = 9.81818e+01
│     timeend         =  8640.00
│     number of steps = 88
└     norm(Q)         = 7.6343697997824000e+13
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    5.5339992551999995e+00
│    minimum (s) =    5.5339992551999995e+00
│    median  (s) =    5.5339992551999995e+00
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    1.0385914899999999e-02
│    minimum (s) =    1.0385914899999999e-02
│    median  (s) =    1.0385914899999999e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    1.0211630600000000e-02
│    minimum (s) =    1.0211630600000000e-02
│    median  (s) =    1.0211630600000000e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    1.0108128700000000e-02
│    minimum (s) =    1.0108128700000000e-02
│    median  (s) =    1.0108128700000000e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    9.9687229000000009e-03
│    minimum (s) =    9.9687229000000009e-03
│    median  (s) =    9.9687229000000009e-03
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    1.0053819799999999e-02
│    minimum (s) =    1.0053819799999999e-02
│    median  (s) =    1.0053819799999999e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    1.0057780700000001e-02
│    minimum (s) =    1.0057780700000001e-02
│    median  (s) =    1.0057780700000001e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    1.0062114200000000e-02
│    minimum (s) =    1.0062114200000000e-02
│    median  (s) =    1.0062114200000000e-02
└    std     (s) =                       NaN
┌ Info: Finished
│     norm(Q)            = 7.6379534131200000e+13
│     norm(Q) / norm(Q₀) = 1.0004694461822510e+00
└     norm(Q) - norm(Q₀) = 3.5836133376000000e+10
┌ Info: Euclidean distance
│     norm(Q - Qe)            = 4.3016309964800000e+11
└     norm(Q - Qe) / norm(Qe) = 5.6345746852457523e-03
==8410== Profiling application: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 2
==8410== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   24.17%  213.23ms       178  1.1979ms  1.1633ms  1.3229ms  ptxcall_gpu_band_forward_kernel__25
                   20.83%  183.82ms       178  1.0327ms  1.0061ms  1.0921ms  ptxcall_gpu_band_back_kernel__26
                   10.80%  95.254ms       267  356.76us  342.05us  370.69us  ptxcall_gpu_interface_gradients__17
                    8.69%  76.704ms       267  287.28us  278.46us  312.86us  ptxcall_gpu_interface_tendency__23
                    8.57%  75.619ms       267  283.22us  258.62us  299.68us  ptxcall_gpu_volume_gradients__16
                    5.95%  52.507ms       267  196.65us  187.55us  214.30us  ptxcall_gpu_volume_tendency__22
                    5.80%  51.167ms         1  51.167ms  51.167ms  51.167ms  ptxcall_gpu_band_lu_kernel__14
                    3.50%  30.844ms        20  1.5422ms  1.6320us  6.6805ms  [CUDA memcpy DtoH]
                    2.86%  25.259ms       267  94.603us  93.247us  97.087us  ptxcall_gpu_interface_gradients_of_laplacians__21
                    2.07%  18.307ms       267  68.564us  67.263us  69.888us  ptxcall_gpu_interface_divergence_of_gradients__19
                    1.09%  9.5809ms       268  35.749us  33.280us  37.792us  ptxcall_gpu_kernel_nodal_update_auxiliary_state__6
                    0.81%  7.1712ms       267  26.858us  25.376us  29.056us  ptxcall_gpu_volume_divergence_of_gradients__18
                    0.75%  6.6457ms       267  24.890us  23.552us  26.688us  ptxcall_gpu_volume_gradients_of_laplacians__20
                    0.52%  4.5928ms       178  25.802us  24.480us  28.320us  ptxcall_anonymous25_27
                    0.50%  4.4443ms       150  29.628us  28.671us  38.368us  ptxcall_gpu_volume_tendency__10
                    0.47%  4.1114ms       150  27.409us  27.232us  27.648us  ptxcall_anonymous25_12
                    0.40%  3.5553ms        89  39.946us  39.072us  41.344us  ptxcall_gpu_stage_update__28
                    0.37%  3.2247ms       182  17.718us  15.296us  41.568us  ptxcall_copy_kernel__5
                    0.35%  3.0527ms        59  51.741us  1.4080us  960.35us  [CUDA memcpy HtoD]
                    0.31%  2.7744ms       150  18.496us  17.983us  19.936us  ptxcall_gpu_interface_tendency__11
                    0.30%  2.6561ms        89  29.843us  29.280us  30.592us  ptxcall_gpu_solution_update__29
                    0.28%  2.5039ms        89  28.133us  27.456us  29.440us  ptxcall_gpu_stage_update__24
                    0.19%  1.6869ms        89  18.954us  17.888us  20.639us  ptxcall_gpu_kernel_apply_filter__30
                    0.12%  1.0773ms       150  7.1820us  5.9200us  8.0000us  ptxcall_gpu_kernel_set_banded_matrix__13
                    0.09%  799.32us       150  5.3280us  4.8640us  6.5280us  ptxcall_gpu_kernel_set_banded_data__9
                    0.06%  517.66us         2  258.83us  254.94us  262.72us  ptxcall_mapreducedim_kernel_parallel_2
                    0.04%  323.52us         2  161.76us  153.89us  169.63us  ptxcall_gpu_kernel_min_neighbor_distance__1
                    0.03%  229.44us       158  1.4520us  1.3750us  2.1120us  [CUDA memset]
                    0.02%  209.09us         1  209.09us  209.09us  209.09us  ptxcall_mapreducedim_kernel_parallel_8
                    0.02%  159.74us         1  159.74us  159.74us  159.74us  ptxcall_gpu_kernel_init_state_auxiliary__4
                    0.01%  127.39us         3  42.464us  38.464us  47.520us  ptxcall_reduce_kernel_15
                    0.01%  79.359us         1  79.359us  79.359us  79.359us  ptxcall_gpu_kernel_min_neighbor_distance__3
                    0.01%  45.183us         1  45.183us  45.183us  45.183us  ptxcall_reduce_kernel_31
                    0.00%  18.912us         1  18.912us  18.912us  18.912us  ptxcall_gpu_kernel_local_courant__7
      API calls:   76.03%  1.62968s        31  52.570ms  311.95us  347.94ms  cuModuleLoadDataEx
                    9.46%  202.81ms         1  202.81ms  202.81ms  202.81ms  cuDevicePrimaryCtxRetain
                    5.50%  117.88ms     73834  1.5960us  1.3390us  554.10us  cuEventQuery
                    3.25%  69.764ms      4239  16.457us  11.149us  2.7691ms  cuLaunchKernel
                    1.80%  38.620ms        20  1.9310ms  41.680us  7.4451ms  cuMemcpyDtoH
                    0.81%  17.359ms      6758  2.5680us  1.6340us  18.214us  cuStreamWaitEvent
                    0.61%  13.100ms      3910  3.3500us  1.4250us  548.71us  cuStreamQuery
                    0.50%  10.745ms      6494  1.6540us     765ns  80.689us  cuEventCreate
                    0.39%  8.3102ms      6494  1.2790us     769ns  15.404us  cuEventRecord
                    0.34%  7.2781ms     10976     663ns     511ns  15.944us  cuCtxGetCurrent
                    0.28%  6.1039ms        53  115.17us  6.1060us  514.12us  cuMemAlloc
                    0.25%  5.4598ms      6494     840ns     618ns  12.691us  cuEventDestroy
                    0.19%  4.1454ms        59  70.260us  7.8060us  1.0051ms  cuMemcpyHtoD
                    0.19%  4.0027ms        31  129.12us  44.343us  461.31us  cuModuleUnload
                    0.16%  3.3973ms        22  154.42us  19.973us  2.8421ms  cuMemHostAlloc
                    0.13%  2.8140ms       158  17.809us  15.331us  60.788us  cuMemsetD32
                    0.03%  716.18us         1  716.18us  716.18us  716.18us  cuMemFree
                    0.02%  489.72us        17  28.807us  5.9450us  257.48us  cuStreamCreate
                    0.01%  250.53us        17  14.736us  6.8140us  49.941us  cuCtxSynchronize
                    0.01%  147.18us        17  8.6570us  4.0860us  26.257us  cuStreamDestroy
                    0.00%  80.993us        31  2.6120us  1.9100us  3.3070us  cuCtxPushCurrent
                    0.00%  66.736us        74     901ns     530ns  1.6470us  cuDeviceGetAttribute
                    0.00%  54.457us        31  1.7560us  1.3620us  7.0930us  cuModuleGetFunction
                    0.00%  51.857us        31  1.6720us  1.1190us  2.6930us  cuModuleGetGlobal
                    0.00%  48.341us        22  2.1970us  1.1900us  7.4510us  cuMemHostGetDevicePointer
                    0.00%  35.389us        34  1.0400us     679ns  5.3470us  cuCtxGetDevice
                    0.00%  25.389us        31     819ns     718ns  1.3540us  cuCtxPopCurrent
                    0.00%  8.0520us         3  2.6840us  1.5450us  3.4050us  cuFuncGetAttribute
                    0.00%  6.9880us         1  6.9880us  6.9880us  6.9880us  cuInit
                    0.00%  6.6200us         5  1.3240us     549ns  2.5390us  cuDriverGetVersion
                    0.00%  5.2570us         4  1.3140us     533ns  2.0740us  cuDeviceGetCount
                    0.00%  4.9140us         1  4.9140us  4.9140us  4.9140us  cuCtxSetCurrent
                    0.00%  2.8130us         2  1.4060us  1.3840us  1.4290us  cuDeviceGet

And below is a profile from a single GPU run with 32 tracers.

❯ /usr/local/cuda-9.0/bin/nvprof julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
==8630== NVPROF is profiling process 8630, command: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
┌ Info: Model composition
│     param_set = EarthParameterSet()
│     orientation = SphericalOrientation()
│     ref_state = HydrostaticState{DecayingTemperatureProfile{Float32},Float32}(DecayingTemperatureProfile{Float32}(290.0f0, 220.0f0, 8484.2705f0), 0.0f0)
│     turbulence = SmagorinskyLilly{Float32}(0.21f0)
│     hyperdiffusion = StandardHyperDiffusion{Float32}(14400.0f0)
│     moisture = DryModel()
│     precipitation = NoPrecipitation()
│     radiation = NoRadiation()
│     source = (Gravity(), Coriolis(), held_suarez_forcing!, RayleighSponge{Float32}(30000.0f0, 12000.0f0, 0.0011111111f0, Float32[0.0, 0.0, 0.0], 2.0f0))
│     tracers = NTracers{32,Float32}(Float32[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0])
│     boundarycondition = AtmosBC{Impenetrable{FreeSlip},Insulating,Impermeable,ImpermeableTracer}(Impenetrable{FreeSlip}(FreeSlip()), Insulating(), Impermeable(), ImpermeableTracer())
│     init_state_conservative = init_heldsuarez!
└     data_config = HeldSuarezDataConfig{Float32}(255.0f0)
┌ Info: Establishing Atmos GCM configuration for HeldSuarez
│     precision        = Float32
│     polynomial order = 5
│     #horiz elems     = 5#vert elems      = 5
│     domain height    = 3.00e+04 m
│     MPI ranks        = 1
│     min(Δ_horz)      = 167863.59 m
└     min(Δ_vert)      = 703.00 m
[ Info: Initializing HeldSuarez
┌ Info: Starting HeldSuarez
│     dt              = 9.81818e+01
│     timeend         =  8640.00
│     number of steps = 88
└     norm(Q)         = 7.6345023397888000e+13
┌ Info: Update
│     simtime =    98.18 /  8640.00
│     runtime = 00:01:33
└     norm(Q) = 7.6345367330816000e+13
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    9.5916184646999998e+00
│    minimum (s) =    9.5916184646999998e+00
│    median  (s) =    9.5916184646999998e+00
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    3.0826927699999999e-02
│    minimum (s) =    3.0826927699999999e-02
│    median  (s) =    3.0826927699999999e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    2.9374440600000003e-02
│    minimum (s) =    2.9374440600000003e-02
│    median  (s) =    2.9374440600000003e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    2.9341548200000001e-02
│    minimum (s) =    2.9341548200000001e-02
│    median  (s) =    2.9341548200000001e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    3.0509419700000002e-02
│    minimum (s) =    3.0509419700000002e-02
│    median  (s) =    3.0509419700000002e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    2.9288231099999999e-02
│    minimum (s) =    2.9288231099999999e-02
│    median  (s) =    2.9288231099999999e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    2.9377836400000003e-02
│    minimum (s) =    2.9377836400000003e-02
│    median  (s) =    2.9377836400000003e-02
└    std     (s) =                       NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    2.9326005700000001e-02
│    minimum (s) =    2.9326005700000001e-02
│    median  (s) =    2.9326005700000001e-02
└    std     (s) =                       NaN
┌ Info: Finished
│     norm(Q)            = 7.6380867919872000e+13
│     norm(Q) / norm(Q₀) = 1.0004695653915405e+00
└     norm(Q) - norm(Q₀) = 3.5844521984000000e+10
┌ Info: Euclidean distance
│     norm(Q - Qe)            = 4.2996259225600000e+11
└     norm(Q - Qe) / norm(Qe) = 5.6318440474569798e-03
==8630== Profiling application: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
==8630== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   25.95%  695.69ms       267  2.6056ms  2.5800ms  2.7157ms  ptxcall_gpu_interface_tendency__23
                   19.54%  523.92ms       267  1.9623ms  1.9260ms  2.1054ms  ptxcall_gpu_volume_gradients__16
                   14.65%  392.67ms       267  1.4707ms  1.4535ms  1.5205ms  ptxcall_gpu_interface_gradients__17
                   13.43%  360.04ms       267  1.3485ms  1.3212ms  1.3967ms  ptxcall_gpu_volume_tendency__22
                    7.83%  209.89ms       178  1.1792ms  1.1660ms  1.2611ms  ptxcall_gpu_band_forward_kernel__25
                    6.71%  179.89ms       178  1.0106ms  998.62us  1.0796ms  ptxcall_gpu_band_back_kernel__26
                    2.14%  57.354ms        21  2.7311ms  1.5360us  18.825ms  [CUDA memcpy DtoH]
                    1.91%  51.307ms         1  51.307ms  51.307ms  51.307ms  ptxcall_gpu_band_lu_kernel__14
                    0.94%  25.221ms       267  94.459us  93.279us  97.023us  ptxcall_gpu_interface_gradients_of_laplacians__21
                    0.86%  22.977ms       268  85.734us  84.256us  87.328us  ptxcall_gpu_kernel_nodal_update_auxiliary_state__6
                    0.78%  20.971ms       178  117.81us  116.26us  135.17us  ptxcall_anonymous25_27
                    0.75%  20.140ms       150  134.27us  134.02us  134.75us  ptxcall_anonymous25_12
                    0.70%  18.751ms        89  210.69us  209.63us  211.84us  ptxcall_gpu_stage_update__28
                    0.68%  18.288ms       267  68.492us  67.135us  70.591us  ptxcall_gpu_interface_divergence_of_gradients__19
                    0.51%  13.778ms        89  154.80us  154.14us  155.71us  ptxcall_gpu_stage_update__24
                    0.51%  13.585ms        89  152.64us  151.87us  153.28us  ptxcall_gpu_solution_update__29
                    0.51%  13.567ms       182  74.546us  73.056us  107.10us  ptxcall_copy_kernel__5
                    0.37%  9.9749ms        89  112.08us  109.98us  123.10us  ptxcall_gpu_kernel_apply_filter__30
                    0.27%  7.1531ms       267  26.790us  25.376us  28.672us  ptxcall_gpu_volume_divergence_of_gradients__18
                    0.25%  6.7055ms        59  113.65us  1.4080us  2.3613ms  [CUDA memcpy HtoD]
                    0.24%  6.5529ms       267  24.542us  23.552us  26.848us  ptxcall_gpu_volume_gradients_of_laplacians__20
                    0.18%  4.8596ms       150  32.397us  31.231us  40.160us  ptxcall_gpu_volume_tendency__10
                    0.11%  2.8805ms       150  19.203us  18.399us  21.504us  ptxcall_gpu_interface_tendency__11
                    0.04%  1.1954ms       150  7.9690us  6.5600us  9.2160us  ptxcall_gpu_kernel_set_banded_matrix__13
                    0.03%  933.69us       150  6.2240us  5.8240us  6.9760us  ptxcall_gpu_kernel_set_banded_data__9
                    0.03%  856.03us         4  214.01us  200.54us  227.97us  ptxcall_reduce_kernel_15
                    0.02%  524.13us         2  262.06us  254.46us  269.66us  ptxcall_mapreducedim_kernel_parallel_2
                    0.01%  323.58us         2  161.79us  155.20us  168.38us  ptxcall_gpu_kernel_min_neighbor_distance__1
                    0.01%  245.28us         1  245.28us  245.28us  245.28us  ptxcall_gpu_kernel_init_state_auxiliary__4
                    0.01%  231.97us       159  1.4580us  1.3760us  2.8160us  [CUDA memset]
                    0.01%  212.03us         1  212.03us  212.03us  212.03us  ptxcall_reduce_kernel_31
                    0.01%  210.49us         1  210.49us  210.49us  210.49us  ptxcall_mapreducedim_kernel_parallel_8
                    0.00%  83.040us         1  83.040us  83.040us  83.040us  ptxcall_gpu_kernel_min_neighbor_distance__3
                    0.00%  18.624us         1  18.624us  18.624us  18.624us  ptxcall_gpu_kernel_local_courant__7
      API calls:   74.34%  3.20947s        31  103.53ms  307.36us  822.88ms  cuModuleLoadDataEx
                   15.11%  652.31ms    415374  1.5700us  1.1730us  573.33us  cuEventQuery
                    4.53%  195.37ms         1  195.37ms  195.37ms  195.37ms  cuDevicePrimaryCtxRetain
                    1.65%  71.253ms      4240  16.805us  10.758us  3.8295ms  cuLaunchKernel
                    1.64%  70.886ms        21  3.3755ms  43.558us  19.948ms  cuMemcpyDtoH
                    0.60%  25.794ms        17  1.5173ms  7.2710us  3.7073ms  cuCtxSynchronize
                    0.41%  17.799ms      6758  2.6330us  1.5980us  544.43us  cuStreamWaitEvent
                    0.34%  14.529ms      4964  2.9260us  1.4640us  26.409us  cuStreamQuery
                    0.26%  11.093ms      6494  1.7080us     747ns  125.90us  cuEventCreate
                    0.19%  8.0869ms      6494  1.2450us     781ns  15.386us  cuEventRecord
                    0.18%  7.8925ms        59  133.77us  8.4800us  2.4269ms  cuMemcpyHtoD
                    0.17%  7.1780ms     10978     653ns     509ns  16.014us  cuCtxGetCurrent
                    0.16%  7.0254ms      6494  1.0810us     623ns  16.530us  cuEventDestroy
                    0.14%  6.2573ms        53  118.06us  6.3470us  521.02us  cuMemAlloc
                    0.10%  4.1826ms        31  134.92us  45.632us  491.68us  cuModuleUnload
                    0.08%  3.3178ms        22  150.81us  19.702us  2.7680ms  cuMemHostAlloc
                    0.07%  2.8385ms       159  17.852us  14.124us  62.253us  cuMemsetD32
                    0.02%  743.41us         2  371.71us  41.525us  701.89us  cuMemFree
                    0.01%  480.70us        17  28.276us  6.2170us  260.83us  cuStreamCreate
                    0.00%  149.65us        17  8.8020us  4.0660us  31.065us  cuStreamDestroy
                    0.00%  76.700us        31  2.4740us  1.8700us  3.3520us  cuCtxPushCurrent
                    0.00%  76.556us        74  1.0340us     532ns  4.6820us  cuDeviceGetAttribute
                    0.00%  51.206us        31  1.6510us  1.0450us  2.3150us  cuModuleGetGlobal
                    0.00%  48.767us        31  1.5730us  1.3080us  2.0740us  cuModuleGetFunction
                    0.00%  44.572us        22  2.0260us  1.2210us  6.8370us  cuMemHostGetDevicePointer
                    0.00%  28.014us        34     823ns     662ns  1.1920us  cuCtxGetDevice
                    0.00%  24.982us        31     805ns     715ns  1.3550us  cuCtxPopCurrent
                    0.00%  8.1770us         1  8.1770us  8.1770us  8.1770us  cuInit
                    0.00%  7.6880us         3  2.5620us  1.3500us  3.2670us  cuFuncGetAttribute
                    0.00%  7.4370us         5  1.4870us     566ns  3.1340us  cuDriverGetVersion
                    0.00%  5.5980us         4  1.3990us     592ns  1.9070us  cuDeviceGetCount
                    0.00%  4.6710us         1  4.6710us  4.6710us  4.6710us  cuCtxSetCurrent
                    0.00%  2.6270us         2  1.3130us  1.2830us  1.3440us  cuDeviceGet

Here are the detailed kernel metrics for 32 tracers

❯ /usr/local/cuda-9.0/bin/nvprof --profile-from-start off --metrics all julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
==8711== NVPROF is profiling process 8711, command: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
┌ Warning: You are using CUDNN 7.6.2 for CUDA 10.1.0 with CUDA toolkit 9.0.176; these might be incompatible.
└ @ CuArrays ~/.julia/packages/CuArrays/A6GUx/src/CuArrays.jl:128
[1590103948.401436] [ip-172-31-35-196:8711 :0]         parser.c:1310 UCX  WARN  unused env variable: UCX_MEMTYPE_CACHE (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
┌ Info: Model composition
│     param_set = EarthParameterSet()
│     orientation = SphericalOrientation()
│     ref_state = HydrostaticState{DecayingTemperatureProfile{Float32},Float32}(DecayingTemperatureProfile{Float32}(290.0f0, 220.0f0, 8484.2705f0), 0.0f0)
│     turbulence = SmagorinskyLilly{Float32}(0.21f0)
│     hyperdiffusion = StandardHyperDiffusion{Float32}(14400.0f0)
│     moisture = DryModel()
│     precipitation = NoPrecipitation()
│     radiation = NoRadiation()
│     source = (Gravity(), Coriolis(), held_suarez_forcing!, RayleighSponge{Float32}(30000.0f0, 12000.0f0, 0.0011111111f0, Float32[0.0, 0.0, 0.0], 2.0f0))
│     tracers = NTracers{32,Float32}(Float32[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0])
│     boundarycondition = AtmosBC{Impenetrable{FreeSlip},Insulating,Impermeable,ImpermeableTracer}(Impenetrable{FreeSlip}(FreeSlip()), Insulating(), Impermeable(), ImpermeableTracer())
│     init_state_conservative = init_heldsuarez!
└     data_config = HeldSuarezDataConfig{Float32}(255.0f0)
┌ Info: Establishing Atmos GCM configuration for HeldSuarez
│     precision        = Float32
│     polynomial order = 5
│     #horiz elems     = 5#vert elems      = 5
│     domain height    = 3.00e+04 m
│     MPI ranks        = 1
│     min(Δ_horz)      = 167863.59 m
└     min(Δ_vert)      = 703.00 m
[ Info: Initializing HeldSuarez
┌ Warning: Calling CUDAdrv.@profile only informs an external profiler to start.
│ The user is responsible for launching Julia under a CUDA profiler like `nvprof`.
│
│ For improved usability, launch Julia under the Nsight Systems profiler:
│ $ nsys launch -t cuda,cublas,cudnn,nvtx julia
└ @ CUDAdrv.Profile ~/.julia/packages/CUDAdrv/b1mvw/src/profile.jl:42
==8711== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Starting HeldSuarez
│     dt              = 9.81818e+01
│     timeend         =  8640.00
│     number of steps = 88
└     norm(Q)         = 7.6344847237120000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│     simtime =    98.18 /  8640.00
│     runtime = 00:01:48
└     norm(Q) = 7.6345182781440000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│     simtime =   490.91 /  8640.00
│     runtime = 00:02:49
└     norm(Q) = 7.6346701119488000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│     simtime =   883.64 /  8640.00
│     runtime = 00:03:56
└     norm(Q) = 7.6348211068928000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    2.5658203417399999e+01
│    minimum (s) =    2.5658203417399999e+01
│    median  (s) =    2.5658203417399999e+01
└    std     (s) =                       NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│     simtime =  1276.36 /  8640.00
│     runtime = 00:05:08
└     norm(Q) = 7.6349737795584000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updateernal events
│     simtime =  1669.09 /  8640.00
│     runtime = 00:06:24
└     norm(Q) = 7.6351298076672000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│     simtime =  1963.64 /  8640.00
│     runtime = 00:07:25
└     norm(Q) = 7.6352464093184000e+13
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    1.9160640540700001e+01
│    minimum (s) =    1.9160640540700001e+01
│    median  (s) =    1.9160640540700001e+01
└    std     (s) =                       NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│     simtime =  2258.18 /  8640.00
│     runtime = 00:08:28
└     norm(Q) = 7.6353655275520000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│     simtime =  2552.73 /  8640.00
│     runtime = 00:09:34
└     norm(Q) = 7.6354838069248000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│     simtime =  2847.27 /  8640.00
│     runtime = 00:10:43
└     norm(Q) = 7.6356029251584000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    2.2144820997299998e+01
│    minimum (s) =    2.2144820997299998e+01
│    median  (s) =    2.2144820997299998e+01
└    std     (s) =                       NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updateernal events
│     simtime =  3141.82 /  8640.00
│     runtime = 00:11:55
└     norm(Q) = 7.6357220433920000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│     simtime =  3436.36 /  8640.00
│     runtime = 00:13:11
└     norm(Q) = 7.6358436782080000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│     simtime =  3730.91 /  8640.00
│     runtime = 00:14:29
└     norm(Q) = 7.6359644741632000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    2.5581604319499998e+01
│    minimum (s) =    2.5581604319499998e+01
│    median  (s) =    2.5581604319499998e+01
└    std     (s) =                       NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│     simtime =  4025.46 /  8640.00
│     runtime = 00:15:50
└     norm(Q) = 7.6360852701184000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│     simtime =  4320.00 /  8640.00
│     runtime = 00:17:14
└     norm(Q) = 7.6362077437952000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updateernal events
│     simtime =  4614.55 /  8640.00
│     runtime = 00:18:41
└     norm(Q) = 7.6363293786112000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updateernal events
│     simtime =  4909.09 /  8640.00
│     runtime = 00:20:10
└     norm(Q) = 7.6364510134272000e+13
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│    maximum (s) =    2.8795298310000000e+01
│    minimum (s) =    2.8795298310000000e+01
│    median  (s) =    2.8795298310000000e+01
└    std     (s) =                       NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (22 of 22)...
        l2_subp0_write_tex_hit_sectors
        l2_subp1_write_tex_hit_sectors
        7 internal events
^C
signal (2): Interrupt
in expression starting at /home/lucas/research/code/ClimateMachine.jl/experiments/AtmosGCM/heldsuarez.jl:257
unknown function (ip: 0x7f14dfa5d12c)
unknown function (ip: 0x7ffd8f48e13f)
unknown function (ip: 0x7f14df785a0a)
unknown function (ip: 0x88)
unknown function (ip: 0xffffffffffffffff)
Allocations: 579745084 (Pool: 579653498; Big: 91586); GC: 437
==8711== Profiling application: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
==8711== Profiling result:
==8711== Metric result:
Invocations                               Metric Name                                    Metric Description         Min         Max         Avg
Device "Tesla V100-SXM2-16GB (0)"
    Kernel: ptxcall_anonymous25_27
        103                             inst_per_warp                                 Instructions per warp  652.978401  652.978401  652.978401
        103                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        103                 warp_execution_efficiency                             Warp Execution Efficiency     100.00%     100.00%     100.00%
        103         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      95.80%      95.80%      95.80%
        103                      inst_replay_overhead                           Instruction Replay Overhead    0.000371    0.000963    0.000583
        103      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        103     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        103       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        103      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        103              gld_transactions_per_request                  Global Load Transactions Per Request    3.999989    3.999989    3.999989
        103              gst_transactions_per_request                 Global Store Transactions Per Request    3.999989    3.999989    3.999989
        103                 shared_store_transactions                             Shared Store Transactions           0           0           0
        103                  shared_load_transactions                              Shared Load Transactions           0           0           0
        103                   local_load_transactions                               Local Load Transactions           0           0           0
        103                  local_store_transactions                              Local Store Transactions           0           0           0
        103                          gld_transactions                              Global Load Transactions     1498500     1498500     1498500
        103                          gst_transactions                             Global Store Transactions      749250      749250      749250
        103                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        103                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        103                      l2_read_transactions                                  L2 Read Transactions     1498596     1499604     1499040
        103                     l2_write_transactions                                 L2 Write Transactions      749287      775859      758273
        103                    dram_read_transactions                       Device Memory Read Transactions     1498506     1498922     1498591
        103                   dram_write_transactions                      Device Memory Write Transactions      736920      767566      753743
        103                           global_hit_rate                     Global Hit Rate in unified l1/tex      33.09%      33.31%      33.22%
        103                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
        103                  gld_requested_throughput                      Requested Global Load Throughput  329.94GB/s  383.44GB/s  360.17GB/s
        103                  gst_requested_throughput                     Requested Global Store Throughput  164.97GB/s  191.72GB/s  180.09GB/s
        103                            gld_throughput                                Global Load Throughput  329.94GB/s  383.44GB/s  360.17GB/s
        103                            gst_throughput                               Global Store Throughput  164.97GB/s  191.72GB/s  180.09GB/s
        103                     local_memory_overhead                                 Local Memory Overhead      33.11%      33.32%      33.22%
        103                        tex_cache_hit_rate                                Unified Cache Hit Rate       0.00%       0.00%       0.00%
        103                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
        103                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)     100.00%     100.00%     100.00%
        103                      dram_read_throughput                         Device Memory Read Throughput  329.97GB/s  383.48GB/s  360.19GB/s
        103                     dram_write_throughput                        Device Memory Write Throughput  162.28GB/s  195.81GB/s  181.17GB/s
        103                      tex_cache_throughput                        Unified cache to SM throughput  494.91GB/s  575.16GB/s  540.26GB/s
        103                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  329.94GB/s  383.44GB/s  360.17GB/s
        103                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  164.97GB/s  191.72GB/s  180.09GB/s
        103                        l2_read_throughput                                 L2 Throughput (Reads)  330.16GB/s  383.71GB/s  360.30GB/s
        103                       l2_write_throughput                                L2 Throughput (Writes)  164.98GB/s  197.51GB/s  182.25GB/s
        103                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                   sysmem_write_throughput                        System Memory Write Throughput  1.1273MB/s  1.3101MB/s  1.2306MB/s
        103                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                            gld_efficiency                         Global Memory Load Efficiency     100.00%     100.00%     100.00%
        103                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
        103                    tex_cache_transactions                      Unified cache to SM transactions      561946      561946      561946
        103                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        103                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        103                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        103                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        103                             flop_count_sp           Floating Point Operations(Single Precision)     5994000     5994000     5994000
        103                         flop_count_sp_add       Floating Point Operations(Single Precision Add)     5994000     5994000     5994000
        103                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)           0           0           0
        103                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
        103                     flop_count_sp_special   Floating Point Operations(Single Precision Special)    11988000    11988000    11988000
        103                             inst_executed                                 Instructions Executed    42332892   122315914    81936135
        103                               inst_issued                                   Instructions Issued    42348613    42373678    42358014
        103                          dram_utilization                             Device Memory Utilization     Mid (6)    High (7)     Mid (6)
        103                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        103                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       2.03%       2.68%       2.30%
        103                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)      16.07%      16.62%      16.35%
        103                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      19.29%      25.51%      23.00%
        103                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        103                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
        103                               stall_other                           Issue Stall Reasons (Other)       7.77%       8.75%       8.18%
        103          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.18%       0.62%       0.37%
        103                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)      20.92%      23.41%      21.85%
        103                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
        103                                inst_fp_32                               FP Instructions(Single)    17982000    17982000    17982000
        103                                inst_fp_64                               FP Instructions(Double)           0           0           0
        103                              inst_integer                                  Integer Instructions   965417386   965417386   965417386
        103                          inst_bit_convert                              Bit-Convert Instructions    23976000    23976000    23976000
        103                              inst_control                             Control-Flow Instructions    53946240    53946240    53946240
        103                        inst_compute_ld_st                               Load/Store Instructions    53946000    53946000    53946000
        103                                 inst_misc                                     Misc Instructions   167832480   167832480   167832480
        103           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        103                               issue_slots                                           Issue Slots    42348613    42373678    42358014
        103                                 cf_issued                      Issued Control-Flow Instructions     3558961     3558961     3558961
        103                               cf_executed                    Executed Control-Flow Instructions     3558961     3558961     3558961
        103                               ldst_issued                        Issued Load/Store Instructions      936579      936579      936579
        103                             ldst_executed                      Executed Load/Store Instructions      936579      936579      936579
        103                       atomic_transactions                                   Atomic Transactions           0           0           0
        103           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        103                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        103                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        103                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     1498500     1498500     1498500
        103                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.63%       0.65%       0.64%
        103                        stall_not_selected                    Issue Stall Reasons (Not Selected)      25.93%      29.23%      27.32%
        103                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)      749250      749250      749250
        103             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        103                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        103              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        103                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        103          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        103             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        103      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        103       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        103       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        103        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        103       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        103        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        103                nvlink_transmit_throughput                            NVLink Transmit Throughput  8.1166MB/s  9.4328MB/s  8.8604MB/s
        103                 nvlink_receive_throughput                             NVLink Receive Throughput  6.0875MB/s  7.0746MB/s  6.6453MB/s
        103       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        103        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        103                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        103                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        103                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        103                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        103                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        103                                       ipc                                          Executed IPC    0.541810    3.099902    1.818507
        103                                issued_ipc                                            Issued IPC    3.034042    3.101053    3.069795
        103                    issue_slot_utilization                                Issue Slot Utilization      75.85%      77.53%      76.74%
        103                             sm_efficiency                               Multiprocessor Activity      92.25%      99.71%      96.86%
        103                        achieved_occupancy                                    Achieved Occupancy    0.686338    0.694919    0.691002
        103                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle   13.383984   14.656893   13.914015
        103                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
        103                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (2)     Low (1)
        103                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        103                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        103                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (2)     Low (2)     Low (2)
        103                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        103                    special_fu_utilization                     Special Function Unit Utilization     Low (2)     Low (2)     Low (2)
        103             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        103           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Mid (6)     Mid (6)     Mid (6)
        103           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        103                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        103                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.02%       0.33%       0.28%
        103                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        103                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        103                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        103       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        103            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        103                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_interface_gradients__17
        155                             inst_per_warp                                 Instructions per warp  1.1750e+04  1.1750e+04  1.1750e+04
        155                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        155                 warp_execution_efficiency                             Warp Execution Efficiency      56.25%      56.25%      56.25%
        155         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      55.21%      55.21%      55.21%
        155                      inst_replay_overhead                           Instruction Replay Overhead    0.001535    0.002052    0.001742
        155      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155       local_load_transactions_per_request            Local Memory Load Transactions Per Request    2.500000    2.500000    2.500000
        155      local_store_transactions_per_request           Local Memory Store Transactions Per Request    2.500000    2.500000    2.500000
        155              gld_transactions_per_request                  Global Load Transactions Per Request    9.201834    9.202861    9.202299
        155              gst_transactions_per_request                 Global Store Transactions Per Request    9.250000    9.250000    9.250000
        155                 shared_store_transactions                             Shared Store Transactions           0           0           0
        155                  shared_load_transactions                              Shared Load Transactions           0           0           0
        155                   local_load_transactions                               Local Load Transactions     4260000     4260000     4260000
        155                  local_store_transactions                              Local Store Transactions     4020000     4020000     4020000
        155                          gld_transactions                              Global Load Transactions    11442480    11443758    11443058
        155                          gst_transactions                             Global Store Transactions     6549000     6549000     6549000
        155                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        155                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        155                      l2_read_transactions                                  L2 Read Transactions    13529123    13647031    13592782
        155                     l2_write_transactions                                 L2 Write Transactions    11393200    11426356    11409493
        155                    dram_read_transactions                       Device Memory Read Transactions    15091535    15293009    15176859
        155                   dram_write_transactions                      Device Memory Write Transactions    10452264    10476862    10465751
        155                           global_hit_rate                     Global Hit Rate in unified l1/tex      38.55%      38.68%      38.63%
        155                            local_hit_rate                                        Local Hit Rate      18.94%      20.44%      19.72%
        155                  gld_requested_throughput                      Requested Global Load Throughput  54.246GB/s  56.252GB/s  55.384GB/s
        155                  gst_requested_throughput                     Requested Global Store Throughput  30.754GB/s  31.891GB/s  31.399GB/s
        155                            gld_throughput                                Global Load Throughput  220.91GB/s  229.09GB/s  225.55GB/s
        155                            gst_throughput                               Global Store Throughput  126.43GB/s  131.11GB/s  129.08GB/s
        155                     local_memory_overhead                                 Local Memory Overhead      54.18%      54.45%      54.30%
        155                        tex_cache_hit_rate                                Unified Cache Hit Rate       7.80%       8.28%       8.05%
        155                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      11.49%      12.90%      12.31%
        155                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      61.69%      61.81%      61.76%
        155                      dram_read_throughput                         Device Memory Read Throughput  291.66GB/s  304.74GB/s  299.14GB/s
        155                     dram_write_throughput                        Device Memory Write Throughput  201.96GB/s  209.73GB/s  206.29GB/s
        155                      tex_cache_throughput                        Unified cache to SM throughput  303.94GB/s  315.25GB/s  310.39GB/s
        155                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  261.98GB/s  272.62GB/s  267.96GB/s
        155                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  204.04GB/s  211.59GB/s  208.32GB/s
        155                        l2_read_throughput                                 L2 Throughput (Reads)  262.35GB/s  272.73GB/s  267.92GB/s
        155                       l2_write_throughput                                L2 Throughput (Writes)  220.30GB/s  228.66GB/s  224.89GB/s
        155                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   sysmem_write_throughput                        System Memory Write Throughput  101.22KB/s  104.96KB/s  103.34KB/s
        155                     local_load_throughput                          Local Memory Load Throughput  82.241GB/s  85.283GB/s  83.967GB/s
        155                    local_store_throughput                         Local Memory Store Throughput  77.608GB/s  80.479GB/s  79.236GB/s
        155                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                            gld_efficiency                         Global Memory Load Efficiency      24.55%      24.56%      24.56%
        155                            gst_efficiency                        Global Memory Store Efficiency      24.32%      24.32%      24.32%
        155                    tex_cache_transactions                      Unified cache to SM transactions     3931356     3940408     3936815
        155                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        155                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        155                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        155                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        155                             flop_count_sp           Floating Point Operations(Single Precision)    87263985    87264000    87263999
        155                         flop_count_sp_add       Floating Point Operations(Single Precision Add)    14255999    14256000    14255999
        155                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)    21923993    21924000    21923999
        155                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)    29160000    29160000    29160000
        155                     flop_count_sp_special   Floating Point Operations(Single Precision Special)      647999      648000      647999
        155                             inst_executed                                 Instructions Executed    11538000    17625690    14954583
        155                               inst_issued                                   Instructions Issued    11554816    11561673    11557941
        155                          dram_utilization                             Device Memory Utilization     Mid (6)    High (7)     Mid (6)
        155                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        155                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       0.21%       1.14%       0.51%
        155                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       1.43%       1.68%       1.54%
        155                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      91.78%      93.85%      92.87%
        155                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        155                                stall_sync                 Issue Stall Reasons (Synchronization)       0.73%       0.93%       0.85%
        155                               stall_other                           Issue Stall Reasons (Other)       0.02%       0.02%       0.02%
        155          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.09%       0.29%       0.13%
        155                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.03%       0.04%       0.03%
        155                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
        155                                inst_fp_32                               FP Instructions(Single)    66419996    66420000    66419999
        155                                inst_fp_64                               FP Instructions(Double)           0           0           0
        155                              inst_integer                                  Integer Instructions    68391000    68391012    68391000
        155                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
        155                              inst_control                             Control-Flow Instructions     2646000     2646008     2646000
        155                        inst_compute_ld_st                               Load/Store Instructions    61128000    61128000    61128000
        155                                 inst_misc                                     Misc Instructions     7884000     7884005     7884000
        155           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        155                               issue_slots                                           Issue Slots    11554816    11561673    11557941
        155                                 cf_issued                      Issued Control-Flow Instructions      184500      184516      184500
        155                               cf_executed                    Executed Control-Flow Instructions      184500      184516      184500
        155                               ldst_issued                        Issued Load/Store Instructions     3432000     3432001     3432000
        155                             ldst_executed                      Executed Load/Store Instructions     3432000     3432001     3432000
        155                       atomic_transactions                                   Atomic Transactions           0           0           0
        155           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        155                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        155                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        155                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)    13534372    13660443    13594768
        155                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       3.27%       4.54%       3.97%
        155                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.06%       0.08%       0.07%
        155                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)    10569000    10569000    10569000
        155             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1920        1156
        155                nvlink_total_data_received                            NVLink Total Data Received         864        1440         867
        155              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        155                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        155          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        155             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        155      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        155       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        155       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        155        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        155       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        155        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        155                nvlink_transmit_throughput                            NVLink Transmit Throughput  728.76KB/s  1.1962MB/s  747.25KB/s
        155                 nvlink_receive_throughput                             NVLink Receive Throughput  546.57KB/s  918.67KB/s  560.43KB/s
        155       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        155        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        155                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        155                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        155                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        155                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        155                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        155                                       ipc                                          Executed IPC    0.071277    0.120621    0.091372
        155                                issued_ipc                                            Issued IPC    0.071367    0.083051    0.076398
        155                    issue_slot_utilization                                Issue Slot Utilization       1.78%       2.08%       1.91%
        155                             sm_efficiency                               Multiprocessor Activity      85.71%      90.37%      87.35%
        155                        achieved_occupancy                                    Achieved Occupancy    0.115959    0.123401    0.120441
        155                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.076379    0.089024    0.081801
        155                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
        155                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
        155                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        155                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        155                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.36%       0.42%       0.39%
        155                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        155                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        155                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        155       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        155            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        155                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_interface_divergence_of_gradients__19
        155                             inst_per_warp                                 Instructions per warp  1.6060e+03  1.6060e+03  1.6060e+03
        155                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        155                 warp_execution_efficiency                             Warp Execution Efficiency      56.25%      56.25%      56.25%
        155         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      55.27%      55.27%      55.27%
        155                      inst_replay_overhead                           Instruction Replay Overhead    0.012460    0.014881    0.013454
        155      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155              gld_transactions_per_request                  Global Load Transactions Per Request    9.198303    9.204780    9.201584
        155              gst_transactions_per_request                 Global Store Transactions Per Request    9.250000    9.250000    9.250000
        155                 shared_store_transactions                             Shared Store Transactions           0           0           0
        155                  shared_load_transactions                              Shared Load Transactions           0           0           0
        155                   local_load_transactions                               Local Load Transactions           0           0           0
        155                  local_store_transactions                              Local Store Transactions           0           0           0
        155                          gld_transactions                              Global Load Transactions     1945441     1946811     1946134
        155                          gst_transactions                             Global Store Transactions      222000      222000      222000
        155                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        155                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        155                      l2_read_transactions                                  L2 Read Transactions     1553181     1560584     1558571
        155                     l2_write_transactions                                 L2 Write Transactions      260048      286923      272299
        155                    dram_read_transactions                       Device Memory Read Transactions     1218056     1248356     1231080
        155                   dram_write_transactions                      Device Memory Write Transactions      273989      299321      287240
        155                           global_hit_rate                     Global Hit Rate in unified l1/tex      28.13%      28.38%      28.24%
        155                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
        155                  gld_requested_throughput                      Requested Global Load Throughput  202.88GB/s  207.89GB/s  205.79GB/s
        155                  gst_requested_throughput                     Requested Global Store Throughput  21.916GB/s  22.457GB/s  22.231GB/s
        155                            gld_throughput                                Global Load Throughput  789.78GB/s  809.53GB/s  801.21GB/s
        155                            gst_throughput                               Global Store Throughput  90.101GB/s  92.325GB/s  91.395GB/s
        155                     local_memory_overhead                                 Local Memory Overhead      12.28%      12.87%      12.58%
        155                        tex_cache_hit_rate                                Unified Cache Hit Rate      18.16%      18.48%      18.27%
        155                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      36.53%      38.13%      37.43%
        155                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      80.40%      81.52%      80.90%
        155                      dram_read_throughput                         Device Memory Read Throughput  495.39GB/s  514.32GB/s  506.82GB/s
        155                     dram_write_throughput                        Device Memory Write Throughput  112.84GB/s  124.10GB/s  118.25GB/s
        155                      tex_cache_throughput                        Unified cache to SM throughput  631.52GB/s  647.28GB/s  640.82GB/s
        155                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  632.16GB/s  647.90GB/s  641.35GB/s
        155                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  90.101GB/s  92.325GB/s  91.395GB/s
        155                        l2_read_throughput                                 L2 Throughput (Reads)  632.61GB/s  648.60GB/s  641.65GB/s
        155                       l2_write_throughput                                L2 Throughput (Writes)  106.26GB/s  118.80GB/s  112.10GB/s
        155                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   sysmem_write_throughput                        System Memory Write Throughput  2.0780MB/s  2.1293MB/s  2.1078MB/s
        155                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                            gld_efficiency                         Global Memory Load Efficiency      25.68%      25.69%      25.69%
        155                            gst_efficiency                        Global Memory Store Efficiency      24.32%      24.32%      24.32%
        155                    tex_cache_transactions                      Unified cache to SM transactions      388327      389841      389136
        155                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        155                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        155                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        155                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        155                             flop_count_sp           Floating Point Operations(Single Precision)     4752000     4752000     4752000
        155                         flop_count_sp_add       Floating Point Operations(Single Precision Add)     1296000     1296000     1296000
        155                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)     1296000     1296000     1296000
        155                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)      864000      864000      864000
        155                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           0           0           0
        155                             inst_executed                                 Instructions Executed     1608000     2409000     1974909
        155                               inst_issued                                   Instructions Issued     1627903     1630802     1629543
        155                          dram_utilization                             Device Memory Utilization    High (8)    High (8)    High (8)
        155                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        155                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       0.11%       3.32%       0.93%
        155                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       1.51%       1.83%       1.64%
        155                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      90.64%      94.33%      92.76%
        155                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        155                                stall_sync                 Issue Stall Reasons (Synchronization)       1.20%       1.51%       1.34%
        155                               stall_other                           Issue Stall Reasons (Other)       0.20%       0.28%       0.24%
        155          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.82%       1.59%       1.11%
        155                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.22%       0.35%       0.28%
        155                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
        155                                inst_fp_32                               FP Instructions(Single)     3456000     3456000     3456000
        155                                inst_fp_64                               FP Instructions(Double)           0           0           0
        155                              inst_integer                                  Integer Instructions    19521000    19521000    19521000
        155                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
        155                              inst_control                             Control-Flow Instructions      162000      162000      162000
        155                        inst_compute_ld_st                               Load/Store Instructions     4239000     4239000     4239000
        155                                 inst_misc                                     Misc Instructions     1323000     1323000     1323000
        155           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        155                               issue_slots                                           Issue Slots     1627903     1630802     1629543
        155                                 cf_issued                      Issued Control-Flow Instructions       16500       16500       16500
        155                               cf_executed                    Executed Control-Flow Instructions       16500       16500       16500
        155                               ldst_issued                        Issued Load/Store Instructions      247500      247500      247500
        155                             ldst_executed                      Executed Load/Store Instructions      247500      247500      247500
        155                       atomic_transactions                                   Atomic Transactions           0           0           0
        155           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        155                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        155                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        155                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     1553162     1560156     1557847
        155                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.70%       1.73%       1.13%
        155                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.50%       0.65%       0.56%
        155                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)      222000      222000      222000
        155             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        155                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        155              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        155                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        155          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        155             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        155      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        155       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        155       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        155        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        155       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        155        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        155                nvlink_transmit_throughput                            NVLink Transmit Throughput  14.962MB/s  15.331MB/s  15.177MB/s
        155                 nvlink_receive_throughput                             NVLink Receive Throughput  11.221MB/s  11.498MB/s  11.382MB/s
        155       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        155        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        155                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        155                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        155                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        155                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        155                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        155                                       ipc                                          Executed IPC    0.189691    0.291671    0.237549
        155                                issued_ipc                                            Issued IPC    0.192304    0.232761    0.209994
        155                    issue_slot_utilization                                Issue Slot Utilization       4.81%       5.82%       5.25%
        155                             sm_efficiency                               Multiprocessor Activity      78.14%      96.19%      92.22%
        155                        achieved_occupancy                                    Achieved Occupancy    0.286913    0.288411    0.287798
        155                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.286789    0.346164    0.313476
        155                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
        155                            l2_utilization                                  L2 Cache Utilization     Low (2)     Low (2)     Low (2)
        155                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        155                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                    special_fu_utilization                     Special Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        155                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.36%       0.49%       0.43%
        155                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        155                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        155                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        155       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        155            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        155                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_stage_update__24
         52                             inst_per_warp                                 Instructions per warp  334.989836  334.989836  334.989836
         52                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
         52                 warp_execution_efficiency                             Warp Execution Efficiency     100.00%     100.00%     100.00%
         52         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      95.32%      95.32%      95.32%
         52                      inst_replay_overhead                           Instruction Replay Overhead    0.001949    0.003757    0.002618
         52      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
         52     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
         52       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
         52      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
         52              gld_transactions_per_request                  Global Load Transactions Per Request    3.999989    3.999989    3.999989
         52              gst_transactions_per_request                 Global Store Transactions Per Request    3.999989    3.999989    3.999989
         52                 shared_store_transactions                             Shared Store Transactions           0           0           0
         52                  shared_load_transactions                              Shared Load Transactions           0           0           0
         52                   local_load_transactions                               Local Load Transactions           0           0           0
         52                  local_store_transactions                              Local Store Transactions           0           0           0
         52                          gld_transactions                              Global Load Transactions     2247750     2247750     2247750
         52                          gst_transactions                             Global Store Transactions     2247750     2247750     2247750
         52                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
         52                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
         52                      l2_read_transactions                                  L2 Read Transactions     1560492     1568084     1563760
         52                     l2_write_transactions                                 L2 Write Transactions     2247776     2271656     2262412
         52                    dram_read_transactions                       Device Memory Read Transactions     1498511     1498923     1498587
         52                   dram_write_transactions                      Device Memory Write Transactions     2230304     2254581     2242227
         52                           global_hit_rate                     Global Hit Rate in unified l1/tex      15.16%      15.28%      15.23%
         52                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
         52                  gld_requested_throughput                      Requested Global Load Throughput  426.48GB/s  429.97GB/s  428.68GB/s
         52                  gst_requested_throughput                     Requested Global Store Throughput  426.48GB/s  429.97GB/s  428.68GB/s
         52                            gld_throughput                                Global Load Throughput  426.48GB/s  429.97GB/s  428.68GB/s
         52                            gst_throughput                               Global Store Throughput  426.48GB/s  429.97GB/s  428.68GB/s
         52                     local_memory_overhead                                 Local Memory Overhead       0.00%       0.13%       0.02%
         52                        tex_cache_hit_rate                                Unified Cache Hit Rate      15.13%      15.29%      15.22%
         52                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       3.93%       4.39%       4.14%
         52                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)       0.00%       0.00%       0.00%
         52                      dram_read_throughput                         Device Memory Read Throughput  284.34GB/s  286.67GB/s  285.80GB/s
         52                     dram_write_throughput                        Device Memory Write Throughput  424.70GB/s  430.73GB/s  427.63GB/s
         52                      tex_cache_throughput                        Unified cache to SM throughput  568.64GB/s  573.31GB/s  571.58GB/s
         52                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  296.36GB/s  299.16GB/s  298.15GB/s
         52                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  426.48GB/s  429.97GB/s  428.68GB/s
         52                        l2_read_throughput                                 L2 Throughput (Reads)  296.43GB/s  299.39GB/s  298.23GB/s
         52                       l2_write_throughput                                L2 Throughput (Writes)  427.50GB/s  434.40GB/s  431.48GB/s
         52                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                   sysmem_write_throughput                        System Memory Write Throughput  994.75KB/s  0.9794MB/s  999.89KB/s
         52                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                            gld_efficiency                         Global Memory Load Efficiency     100.00%     100.00%     100.00%
         52                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
         52                    tex_cache_transactions                      Unified cache to SM transactions      749259      749259      749259
         52                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
         52                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
         52                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
         52                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
         52                             flop_count_sp           Floating Point Operations(Single Precision)   131868000   131868000   131868000
         52                         flop_count_sp_add       Floating Point Operations(Single Precision Add)    23976000    23976000    23976000
         52                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)    47952000    47952000    47952000
         52                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)    11988000    11988000    11988000
         52                     flop_count_sp_special   Floating Point Operations(Single Precision Special)     5994000     5994000     5994000
         52                             inst_executed                                 Instructions Executed    11238850    62750296    35013363
         52                               inst_issued                                   Instructions Issued    11260851    11278424    11267820
         52                          dram_utilization                             Device Memory Utilization    High (9)    High (9)    High (9)
         52                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
         52                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       1.17%       1.48%       1.29%
         52                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       3.56%       4.28%       3.80%
         52                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      89.26%      91.28%      90.52%
         52                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
         52                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
         52                               stall_other                           Issue Stall Reasons (Other)       0.11%       0.14%       0.12%
         52          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.21%       0.63%       0.37%
         52                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.85%       1.15%       0.95%
         52                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
         52                                inst_fp_32                               FP Instructions(Single)    95904000    95904000    95904000
         52                                inst_fp_64                               FP Instructions(Double)           0           0           0
         52                              inst_integer                                  Integer Instructions   167833440   167833440   167833440
         52                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
         52                              inst_control                             Control-Flow Instructions    17982240    17982240    17982240
         52                        inst_compute_ld_st                               Load/Store Instructions    35964000    35964000    35964000
         52                                 inst_misc                                     Misc Instructions    17982480    17982480    17982480
         52           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
         52                               issue_slots                                           Issue Slots    11260851    11278424    11267820
         52                                 cf_issued                      Issued Control-Flow Instructions     1123892     1123892     1123892
         52                               cf_executed                    Executed Control-Flow Instructions     1123892     1123892     1123892
         52                               ldst_issued                        Issued Load/Store Instructions     1685831     1685831     1685831
         52                             ldst_executed                      Executed Load/Store Instructions     1685831     1685831     1685831
         52                       atomic_transactions                                   Atomic Transactions           0           0           0
         52           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
         52                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
         52                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
         52                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     1560476     1567400     1563315
         52                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       1.66%       2.16%       1.86%
         52                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.99%       1.27%       1.09%
         52                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)     2247750     2247750     2247750
         52             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
         52                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
         52              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
         52                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
         52          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
         52             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
         52      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
         52       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
         52       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
         52        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
         52       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
         52        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
         52                nvlink_transmit_throughput                            NVLink Transmit Throughput  6.9944MB/s  7.0517MB/s  7.0305MB/s
         52                 nvlink_receive_throughput                             NVLink Receive Throughput  5.2458MB/s  5.2888MB/s  5.2729MB/s
         52       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
         52        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
         52                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
         52                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
         52                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
         52                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
         52                                inst_fp_16                                 HP Instructions(Half)           0           0           0
         52                                       ipc                                          Executed IPC    0.431233    0.709484    0.549376
         52                                issued_ipc                                            Issued IPC    0.604061    0.711792    0.639410
         52                    issue_slot_utilization                                Issue Slot Utilization      15.10%      17.79%      15.99%
         52                             sm_efficiency                               Multiprocessor Activity      95.77%      99.60%      96.82%
         52                        achieved_occupancy                                    Achieved Occupancy    0.909721    0.917467    0.914479
         52                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    1.194290    1.440419    1.282381
         52                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
         52                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (2)     Low (1)
         52                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
         52                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (2)     Low (1)
         52                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
         52                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         52                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
         52             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         52           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (2)     Low (2)     Low (2)
         52           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         52                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
         52                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.74%       6.26%       4.85%
         52                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
         52                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
         52                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
         52       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
         52            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
         52                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_stage_update__28
         52                             inst_per_warp                                 Instructions per warp  534.982362  534.982362  534.982362
         52                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
         52                 warp_execution_efficiency                             Warp Execution Efficiency     100.00%     100.00%     100.00%
         52         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      95.23%      95.23%      95.23%
         52                      inst_replay_overhead                           Instruction Replay Overhead    0.002053    0.002988    0.002480
         52      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
         52     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
         52       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
         52      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
         52              gld_transactions_per_request                  Global Load Transactions Per Request    3.999989    3.999989    3.999989
         52              gst_transactions_per_request                 Global Store Transactions Per Request    3.999989    3.999989    3.999989
         52                 shared_store_transactions                             Shared Store Transactions           0           0           0
         52                  shared_load_transactions                              Shared Load Transactions           0           0           0
         52                   local_load_transactions                               Local Load Transactions           0           0           0
         52                  local_store_transactions                              Local Store Transactions           0           0           0
         52                          gld_transactions                              Global Load Transactions     3746250     3746250     3746250
         52                          gst_transactions                             Global Store Transactions     2247750     2247750     2247750
         52                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
         52                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
         52                      l2_read_transactions                                  L2 Read Transactions     3052890     3058896     3055542
         52                     l2_write_transactions                                 L2 Write Transactions     2247775     2271552     2256582
         52                    dram_read_transactions                       Device Memory Read Transactions     2997078     2997459     2997199
         52                   dram_write_transactions                      Device Memory Write Transactions     2235146     2257506     2247535
         52                           global_hit_rate                     Global Hit Rate in unified l1/tex      11.48%      11.58%      11.53%
         52                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
         52                  gld_requested_throughput                      Requested Global Load Throughput  525.20GB/s  528.92GB/s  527.47GB/s
         52                  gst_requested_throughput                     Requested Global Store Throughput  315.12GB/s  317.35GB/s  316.48GB/s
         52                            gld_throughput                                Global Load Throughput  525.20GB/s  528.92GB/s  527.47GB/s
         52                            gst_throughput                               Global Store Throughput  315.12GB/s  317.35GB/s  316.48GB/s
         52                     local_memory_overhead                                 Local Memory Overhead       0.00%       0.08%       0.01%
         52                        tex_cache_hit_rate                                Unified Cache Hit Rate      11.46%      11.58%      11.53%
         52                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       1.80%       2.01%       1.89%
         52                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)       0.00%       0.00%       0.00%
         52                      dram_read_throughput                         Device Memory Read Throughput  420.18GB/s  423.16GB/s  422.00GB/s
         52                     dram_write_throughput                        Device Memory Write Throughput  313.41GB/s  318.54GB/s  316.45GB/s
         52                      tex_cache_throughput                        Unified cache to SM throughput  630.25GB/s  634.71GB/s  632.96GB/s
         52                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  428.16GB/s  431.26GB/s  430.13GB/s
         52                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  315.12GB/s  317.35GB/s  316.48GB/s
         52                        l2_read_throughput                                 L2 Throughput (Reads)  428.08GB/s  431.29GB/s  430.22GB/s
         52                       l2_write_throughput                                L2 Throughput (Writes)  315.46GB/s  320.62GB/s  317.72GB/s
         52                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                   sysmem_write_throughput                        System Memory Write Throughput  735.02KB/s  740.23KB/s  738.18KB/s
         52                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         52                            gld_efficiency                         Global Memory Load Efficiency     100.00%     100.00%     100.00%
         52                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
         52                    tex_cache_transactions                      Unified cache to SM transactions     1123885     1123885     1123885
         52                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
         52                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
         52                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
         52                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
         52                             flop_count_sp           Floating Point Operations(Single Precision)   263736000   263736000   263736000
         52                         flop_count_sp_add       Floating Point Operations(Single Precision Add)    35964000    35964000    35964000
         52                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)   101898000   101898000   101898000
         52                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)    23976000    23976000    23976000
         52                     flop_count_sp_special   Floating Point Operations(Single Precision Special)    11988000    11988000    11988000
         52                             inst_executed                                 Instructions Executed    17607492   100212896    68441586
         52                               inst_issued                                   Instructions Issued    17643869    17663490    17652174
         52                          dram_utilization                             Device Memory Utilization    High (9)    High (9)    High (9)
         52                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
         52                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       1.36%       1.70%       1.48%
         52                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       3.81%       4.54%       4.08%
         52                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      89.03%      90.91%      90.21%
         52                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
         52                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
         52                               stall_other                           Issue Stall Reasons (Other)       0.10%       0.13%       0.12%
         52          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.18%       0.47%       0.28%
         52                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       1.34%       1.66%       1.47%
         52                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
         52                                inst_fp_32                               FP Instructions(Single)   185814000   185814000   185814000
         52                                inst_fp_64                               FP Instructions(Double)           0           0           0
         52                              inst_integer                                  Integer Instructions   239761440   239761440   239761440
         52                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
         52                              inst_control                             Control-Flow Instructions    29970240    29970240    29970240
         52                        inst_compute_ld_st                               Load/Store Instructions    47952000    47952000    47952000
         52                                 inst_misc                                     Misc Instructions    23976480    23976480    23976480
         52           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
         52                               issue_slots                                           Issue Slots    17643869    17663490    17652174
         52                                 cf_issued                      Issued Control-Flow Instructions     1685831     1685831     1685831
         52                               cf_executed                    Executed Control-Flow Instructions     1685831     1685831     1685831
         52                               ldst_issued                        Issued Load/Store Instructions     2247770     2247770     2247770
         52                             ldst_executed                      Executed Load/Store Instructions     2247770     2247770     2247770
         52                       atomic_transactions                                   Atomic Transactions           0           0           0
         52           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
         52                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
         52                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
         52                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     3052248     3059492     3054944
         52                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.56%       0.74%       0.63%
         52                        stall_not_selected                    Issue Stall Reasons (Not Selected)       1.59%       1.98%       1.74%
         52                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)     2247750     2247750     2247750
         52             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
         52                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
         52              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
         52                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
         52          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
         52             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
         52      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
         52       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
         52       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
         52        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
         52       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
         52        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
         52                nvlink_transmit_throughput                            NVLink Transmit Throughput  5.1681MB/s  5.2047MB/s  5.1904MB/s
         52                 nvlink_receive_throughput                             NVLink Receive Throughput  3.8761MB/s  3.9036MB/s  3.8928MB/s
         52       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
         52        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
         52                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
         52                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
         52                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
         52                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
         52                                inst_fp_16                                 HP Instructions(Half)           0           0           0
         52                                       ipc                                          Executed IPC    0.462797    0.814774    0.637099
         52                                issued_ipc                                            Issued IPC    0.691919    0.815913    0.738492
         52                    issue_slot_utilization                                Issue Slot Utilization      17.30%      20.40%      18.46%
         52                             sm_efficiency                               Multiprocessor Activity      96.88%      99.61%      97.62%
         52                        achieved_occupancy                                    Achieved Occupancy    0.910604    0.920766    0.915729
         52                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    1.646381    1.977418    1.769254
         52                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
         52                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (2)     Low (1)
         52                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
         52                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (2)     Low (1)
         52                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
         52                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         52                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
         52             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         52           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (2)     Low (2)     Low (2)
         52           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         52                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
         52                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.90%       9.25%       6.97%
         52                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
         52                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
         52                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
         52       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
         52            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
         52                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_volume_gradients_of_laplacians__20
        155                             inst_per_warp                                 Instructions per warp  3.7010e+03  3.7010e+03  3.7010e+03
        155                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        155                 warp_execution_efficiency                             Warp Execution Efficiency      96.43%      96.43%      96.43%
        155         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      91.83%      91.83%      91.83%
        155                      inst_replay_overhead                           Instruction Replay Overhead    0.006148    0.014152    0.009441
        155      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    1.567256    1.604154    1.582187
        155     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    1.122444    1.124571    1.123434
        155       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155              gld_transactions_per_request                  Global Load Transactions Per Request    3.449172    3.488410    3.466619
        155              gst_transactions_per_request                 Global Store Transactions Per Request    3.857143    3.857143    3.857143
        155                 shared_store_transactions                             Shared Store Transactions       35357       35424       35388
        155                  shared_load_transactions                              Shared Load Transactions      427861      437934      431937
        155                   local_load_transactions                               Local Load Transactions           0           0           0
        155                  local_store_transactions                              Local Store Transactions           0           0           0
        155                          gld_transactions                              Global Load Transactions      235406      238084      236596
        155                          gst_transactions                             Global Store Transactions      243000      243000      243000
        155                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        155                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        155                      l2_read_transactions                                  L2 Read Transactions      223475      225859      224461
        155                     l2_write_transactions                                 L2 Write Transactions      243021      266923      252933
        155                    dram_read_transactions                       Device Memory Read Transactions      225020      225444      225190
        155                   dram_write_transactions                      Device Memory Write Transactions      238936      264018      251926
        155                           global_hit_rate                     Global Hit Rate in unified l1/tex      25.50%      25.71%      25.61%
        155                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
        155                  gld_requested_throughput                      Requested Global Load Throughput  255.04GB/s  282.98GB/s  272.55GB/s
        155                  gst_requested_throughput                     Requested Global Store Throughput  251.16GB/s  278.68GB/s  268.41GB/s
        155                            gld_throughput                                Global Load Throughput  245.31GB/s  272.05GB/s  261.34GB/s
        155                            gst_throughput                               Global Store Throughput  251.16GB/s  278.68GB/s  268.41GB/s
        155                     local_memory_overhead                                 Local Memory Overhead      23.22%      23.80%      23.46%
        155                        tex_cache_hit_rate                                Unified Cache Hit Rate       5.94%       5.95%       5.95%
        155                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      13.65%      13.69%      13.67%
        155                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      38.81%      38.97%      38.90%
        155                      dram_read_throughput                         Device Memory Read Throughput  232.98GB/s  258.51GB/s  248.74GB/s
        155                     dram_write_throughput                        Device Memory Write Throughput  259.15GB/s  300.07GB/s  278.27GB/s
        155                      tex_cache_throughput                        Unified cache to SM throughput  1487.0GB/s  1649.7GB/s  1589.1GB/s
        155                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  230.79GB/s  256.04GB/s  246.62GB/s
        155                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  251.16GB/s  278.68GB/s  268.41GB/s
        155                        l2_read_throughput                                 L2 Throughput (Reads)  233.41GB/s  258.12GB/s  247.93GB/s
        155                       l2_write_throughput                                L2 Throughput (Writes)  251.19GB/s  304.23GB/s  279.38GB/s
        155                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   sysmem_write_throughput                        System Memory Write Throughput  5.2919MB/s  5.8717MB/s  5.6554MB/s
        155                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    shared_load_throughput                         Shared Memory Load Throughput  1776.6GB/s  1983.2GB/s  1908.4GB/s
        155                   shared_store_throughput                        Shared Memory Store Throughput  146.29GB/s  162.44GB/s  156.36GB/s
        155                            gld_efficiency                         Global Memory Load Efficiency     103.64%     104.82%     104.29%
        155                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
        155                    tex_cache_transactions                      Unified cache to SM transactions      359613      360008      359664
        155                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        155                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        155                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        155                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        155                             flop_count_sp           Floating Point Operations(Single Precision)    57672000    57672000    57672000
        155                         flop_count_sp_add       Floating Point Operations(Single Precision Add)     2106000     2106000     2106000
        155                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)    24462000    24462000    24462000
        155                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)     6642000     6642000     6642000
        155                     flop_count_sp_special   Floating Point Operations(Single Precision Special)     2430000     2430000     2430000
        155                             inst_executed                                 Instructions Executed     4168500    19430250    12931698
        155                               inst_issued                                   Instructions Issued     4194502     4227494     4206364
        155                          dram_utilization                             Device Memory Utilization     Mid (6)    High (7)     Mid (6)
        155                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        155                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       5.89%      24.55%      14.49%
        155                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)      10.80%      16.27%      13.45%
        155                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      29.17%      47.96%      38.68%
        155                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.02%       0.00%
        155                                stall_sync                 Issue Stall Reasons (Synchronization)       3.39%       6.66%       5.20%
        155                               stall_other                           Issue Stall Reasons (Other)       1.41%       2.38%       1.88%
        155          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       1.47%       9.14%       4.06%
        155                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       6.68%      13.11%       9.97%
        155                         shared_efficiency                              Shared Memory Efficiency      21.67%      22.14%      21.95%
        155                                inst_fp_32                               FP Instructions(Single)    37746000    37746000    37746000
        155                                inst_fp_64                               FP Instructions(Double)           0           0           0
        155                              inst_integer                                  Integer Instructions    46332000    46332000    46332000
        155                          inst_bit_convert                              Bit-Convert Instructions      648000      648000      648000
        155                              inst_control                             Control-Flow Instructions    10530000    10530000    10530000
        155                        inst_compute_ld_st                               Load/Store Instructions    13446000    13446000    13446000
        155                                 inst_misc                                     Misc Instructions    12474000    12474000    12474000
        155           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        155                               issue_slots                                           Issue Slots     4194502     4227494     4206364
        155                                 cf_issued                      Issued Control-Flow Instructions      435750      435750      435750
        155                               cf_executed                    Executed Control-Flow Instructions      435750      435750      435750
        155                               ldst_issued                        Issued Load/Store Instructions      546000      546000      546000
        155                             ldst_executed                      Executed Load/Store Instructions      546000      546000      546000
        155                       atomic_transactions                                   Atomic Transactions           0           0           0
        155           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        155                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        155                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        155                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)      223240      223311      223274
        155                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.53%       5.60%       2.44%
        155                        stall_not_selected                    Issue Stall Reasons (Not Selected)       7.63%      12.79%       9.81%
        155                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)      243000      243000      243000
        155             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        155                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        155              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        155                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        155          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        155             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        155      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        155       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        155       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        155        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        155       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        155        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        155                nvlink_transmit_throughput                            NVLink Transmit Throughput  38.102MB/s  42.276MB/s  40.719MB/s
        155                 nvlink_receive_throughput                             NVLink Receive Throughput  28.576MB/s  31.707MB/s  30.539MB/s
        155       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        155        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        155                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        155                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        155                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        155                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        155                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        155                                       ipc                                          Executed IPC    0.502848    1.759222    1.104721
        155                                issued_ipc                                            Issued IPC    1.313268    1.781347    1.529956
        155                    issue_slot_utilization                                Issue Slot Utilization      32.83%      44.53%      38.25%
        155                             sm_efficiency                               Multiprocessor Activity      59.15%      88.42%      80.87%
        155                        achieved_occupancy                                    Achieved Occupancy    0.500672    0.524115    0.514806
        155                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    3.789821    5.122976    4.493437
        155                        shared_utilization                             Shared Memory Utilization     Low (1)     Low (1)     Low (1)
        155                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (2)     Low (1)
        155                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (2)     Low (1)
        155                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (2)     Low (3)     Low (2)
        155                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                    special_fu_utilization                     Special Function Unit Utilization     Low (2)     Low (2)     Low (2)
        155             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (3)     Mid (5)     Low (3)
        155           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        155                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       1.01%      15.26%      12.03%
        155                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        155                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        155                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        155       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        155            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        155                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_reduce_kernel_15
         17                             inst_per_warp                                 Instructions per warp  1.3257e+05  1.3257e+05  1.3257e+05
         17                         branch_efficiency                                     Branch Efficiency      99.99%      99.99%      99.99%
         17                 warp_execution_efficiency                             Warp Execution Efficiency      99.99%      99.99%      99.99%
         17         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      94.74%      94.74%      94.74%
         17                      inst_replay_overhead                           Instruction Replay Overhead    0.000214    0.000295    0.000243
         17      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    1.000000    1.000500    1.000029
         17     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    1.000000    1.002500    1.001213
         17       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
         17      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
         17              gld_transactions_per_request                  Global Load Transactions Per Request    3.876269    3.891769    3.885582
         17              gst_transactions_per_request                 Global Store Transactions Per Request    1.000000    1.000000    1.000000
         17                 shared_store_transactions                             Shared Store Transactions        1600        1604        1601
         17                  shared_load_transactions                              Shared Load Transactions        2000        2001        2000
         17                   local_load_transactions                               Local Load Transactions           0           0           0
         17                  local_store_transactions                              Local Store Transactions           0           0           0
         17                          gld_transactions                              Global Load Transactions     1452151     1457958     1455639
         17                          gst_transactions                             Global Store Transactions          80          80          80
         17                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
         17                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
         17                      l2_read_transactions                                  L2 Read Transactions     1384705     1385594     1384921
         17                     l2_write_transactions                                 L2 Write Transactions         110       23916       15115
         17                    dram_read_transactions                       Device Memory Read Transactions      769507      769673      769553
         17                   dram_write_transactions                      Device Memory Write Transactions       85800      109458       96146
         17                           global_hit_rate                     Global Hit Rate in unified l1/tex       7.81%       7.84%       7.82%
         17                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
         17                  gld_requested_throughput                      Requested Global Load Throughput  196.55GB/s  221.44GB/s  209.56GB/s
         17                  gst_requested_throughput                     Requested Global Store Throughput  1.3431MB/s  1.5132MB/s  1.4320MB/s
         17                            gld_throughput                                Global Load Throughput  191.09GB/s  215.45GB/s  203.57GB/s
         17                            gst_throughput                               Global Store Throughput  10.745MB/s  12.106MB/s  11.456MB/s
         17                     local_memory_overhead                                 Local Memory Overhead       2.90%       3.38%       3.08%
         17                        tex_cache_hit_rate                                Unified Cache Hit Rate       7.60%       7.60%       7.60%
         17                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      44.73%      44.76%      44.74%
         17                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      96.25%      96.25%      96.25%
         17                      dram_read_throughput                         Device Memory Read Throughput  100.93GB/s  113.73GB/s  107.62GB/s
         17                     dram_write_throughput                        Device Memory Write Throughput  11.774GB/s  16.085GB/s  13.446GB/s
         17                      tex_cache_throughput                        Unified cache to SM throughput  213.09GB/s  240.46GB/s  227.21GB/s
         17                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  181.61GB/s  204.61GB/s  193.63GB/s
         17                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  10.745MB/s  12.106MB/s  11.456MB/s
         17                        l2_read_throughput                                 L2 Throughput (Reads)  181.62GB/s  204.75GB/s  193.68GB/s
         17                       l2_write_throughput                                L2 Throughput (Writes)  15.358MB/s  3.4761GB/s  2.1139GB/s
         17                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         17                   sysmem_write_throughput                        System Memory Write Throughput  687.67KB/s  774.75KB/s  733.20KB/s
         17                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         17                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         17                    shared_load_throughput                         Shared Memory Load Throughput  1.0493GB/s  1.1822GB/s  1.1188GB/s
         17                   shared_store_throughput                        Shared Memory Store Throughput  861.20MB/s  968.56MB/s  917.61MB/s
         17                            gld_efficiency                         Global Memory Load Efficiency     102.78%     103.19%     102.94%
         17                            gst_efficiency                        Global Memory Store Efficiency      12.50%      12.50%      12.50%
         17                    tex_cache_transactions                      Unified cache to SM transactions      405173      407770      406169
         17                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
         17                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
         17                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
         17                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
         17                             flop_count_sp           Floating Point Operations(Single Precision)    18002400    18002400    18002400
         17                         flop_count_sp_add       Floating Point Operations(Single Precision Add)       20400       20400       20400
         17                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)     5994000     5994000     5994000
         17                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)     5994000     5994000     5994000
         17                     flop_count_sp_special   Floating Point Operations(Single Precision Special)    11988000    11988000    11988000
         17                             inst_executed                                 Instructions Executed    24302018    84844052    67037571
         17                               inst_issued                                   Instructions Issued    24307218    24309194    24307996
         17                          dram_utilization                             Device Memory Utilization     Low (2)     Low (2)     Low (2)
         17                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
         17                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       3.60%       4.50%       4.19%
         17                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)      39.16%      41.32%      40.17%
         17                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      49.14%      51.81%      50.44%
         17                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
         17                                stall_sync                 Issue Stall Reasons (Synchronization)       0.55%       0.75%       0.64%
         17                               stall_other                           Issue Stall Reasons (Other)       1.53%       1.65%       1.60%
         17          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.28%       0.35%       0.31%
         17                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       1.15%       1.27%       1.21%
         17                         shared_efficiency                              Shared Memory Efficiency      70.89%      70.97%      70.93%
         17                                inst_fp_32                               FP Instructions(Single)    23996400    23996400    23996400
         17                                inst_fp_64                               FP Instructions(Double)           0           0           0
         17                              inst_integer                                  Integer Instructions   530431626   530431626   530431626
         17                          inst_bit_convert                              Bit-Convert Instructions    23976000    23976000    23976000
         17                              inst_control                             Control-Flow Instructions    42162800    42162800    42162800
         17                        inst_compute_ld_st                               Load/Store Instructions    12069840    12069840    12069840
         17                                 inst_misc                                     Misc Instructions    72624320    72624320    72624320
         17           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
         17                               issue_slots                                           Issue Slots    24307218    24309194    24307996
         17                                 cf_issued                      Issued Control-Flow Instructions     1883451     1883451     1883451
         17                               cf_executed                    Executed Control-Flow Instructions     1883451     1883451     1883451
         17                               ldst_issued                        Issued Load/Store Instructions      585139      585139      585139
         17                             ldst_executed                      Executed Load/Store Instructions      585139      585139      585139
         17                       atomic_transactions                                   Atomic Transactions           0           0           0
         17           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
         17                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
         17                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
         17                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     1384603     1384636     1384619
         17                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.00%       0.00%       0.00%
         17                        stall_not_selected                    Issue Stall Reasons (Not Selected)       1.34%       1.55%       1.44%
         17                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)          80          80          80
         17             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
         17                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
         17              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
         17                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
         17          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
         17             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
         17      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
         17       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
         17       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
         17        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
         17       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
         17        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
         17                nvlink_transmit_throughput                            NVLink Transmit Throughput  4.8352MB/s  5.4475MB/s  5.1553MB/s
         17                 nvlink_receive_throughput                             NVLink Receive Throughput  3.6264MB/s  4.0856MB/s  3.8665MB/s
         17       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
         17        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
         17                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
         17                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
         17                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
         17                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
         17                                inst_fp_16                                 HP Instructions(Half)           0           0           0
         17                                       ipc                                          Executed IPC    0.542897    1.053843    0.904493
         17                                issued_ipc                                            Issued IPC    1.009397    1.054096    1.030456
         17                    issue_slot_utilization                                Issue Slot Utilization      25.23%      26.35%      25.76%
         17                             sm_efficiency                               Multiprocessor Activity      79.60%      96.70%      94.29%
         17                        achieved_occupancy                                    Achieved Occupancy    0.124956    0.124967    0.124962
         17                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    1.104713    1.162003    1.132825
         17                        shared_utilization                             Shared Memory Utilization     Low (1)     Low (1)     Low (1)
         17                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
         17                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
         17                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
         17                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
         17                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         17                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
         17             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         17           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (3)     Low (3)     Low (3)
         17           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         17                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
         17                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.07%       0.59%       0.43%
         17                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
         17                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
         17                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
         17       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
         17            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
         17                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_kernel_nodal_update_auxiliary_state__6
        155                             inst_per_warp                                 Instructions per warp  1.4570e+03  1.4570e+03  1.4570e+03
        155                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        155                 warp_execution_efficiency                             Warp Execution Efficiency      96.43%      96.43%      96.43%
        155         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      92.34%      92.34%      92.34%
        155                      inst_replay_overhead                           Instruction Replay Overhead    0.001757    0.002737    0.001949
        155      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155              gld_transactions_per_request                  Global Load Transactions Per Request    3.484399    3.503443    3.492991
        155              gst_transactions_per_request                 Global Store Transactions Per Request    3.857143    3.857143    3.857143
        155                 shared_store_transactions                             Shared Store Transactions           0           0           0
        155                  shared_load_transactions                              Shared Load Transactions           0           0           0
        155                   local_load_transactions                               Local Load Transactions           0           0           0
        155                  local_store_transactions                              Local Store Transactions           0           0           0
        155                          gld_transactions                              Global Load Transactions      951241      956440      953586
        155                          gst_transactions                             Global Store Transactions      972000      972000      972000
        155                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        155                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        155                      l2_read_transactions                                  L2 Read Transactions     1038567     1040403     1039379
        155                     l2_write_transactions                                 L2 Write Transactions      972025      995930      982210
        155                    dram_read_transactions                       Device Memory Read Transactions     1037623     1039333     1038860
        155                   dram_write_transactions                      Device Memory Write Transactions      969012      993763      982481
        155                           global_hit_rate                     Global Hit Rate in unified l1/tex      49.38%      49.52%      49.45%
        155                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
        155                  gld_requested_throughput                      Requested Global Load Throughput  348.81GB/s  358.76GB/s  354.87GB/s
        155                  gst_requested_throughput                     Requested Global Store Throughput  326.69GB/s  336.01GB/s  332.37GB/s
        155                            gld_throughput                                Global Load Throughput  320.56GB/s  330.06GB/s  326.07GB/s
        155                            gst_throughput                               Global Store Throughput  326.69GB/s  336.01GB/s  332.37GB/s
        155                     local_memory_overhead                                 Local Memory Overhead      51.48%      51.70%      51.58%
        155                        tex_cache_hit_rate                                Unified Cache Hit Rate       0.17%       0.17%       0.17%
        155                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      13.09%      13.17%      13.11%
        155                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      94.67%      96.05%      95.38%
        155                      dram_read_throughput                         Device Memory Read Throughput  349.10GB/s  359.25GB/s  355.23GB/s
        155                     dram_write_throughput                        Device Memory Write Throughput  328.44GB/s  343.32GB/s  335.95GB/s
        155                      tex_cache_throughput                        Unified cache to SM throughput  391.18GB/s  402.34GB/s  397.97GB/s
        155                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  349.00GB/s  358.96GB/s  355.06GB/s
        155                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  326.69GB/s  336.01GB/s  332.37GB/s
        155                        l2_read_throughput                                 L2 Throughput (Reads)  349.68GB/s  359.57GB/s  355.41GB/s
        155                       l2_write_throughput                                L2 Throughput (Writes)  328.01GB/s  344.20GB/s  335.86GB/s
        155                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   sysmem_write_throughput                        System Memory Write Throughput  1.7209MB/s  1.7699MB/s  1.7507MB/s
        155                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                            gld_efficiency                         Global Memory Load Efficiency     108.51%     109.10%     108.83%
        155                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
        155                    tex_cache_transactions                      Unified cache to SM transactions      290966      290966      290966
        155                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        155                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        155                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        155                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        155                             flop_count_sp           Floating Point Operations(Single Precision)    20088000    20088000    20088000
        155                         flop_count_sp_add       Floating Point Operations(Single Precision Add)     4131000     4131000     4131000
        155                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)     6966000     6966000     6966000
        155                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)     2025000     2025000     2025000
        155                     flop_count_sp_special   Floating Point Operations(Single Precision Special)     1134000     1134000     1134000
        155                             inst_executed                                 Instructions Executed     2488500     7649250     5218703
        155                               inst_issued                                   Instructions Issued     2492872     2495312     2493368
        155                          dram_utilization                             Device Memory Utilization    High (9)    High (9)    High (9)
        155                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        155                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       2.24%      11.92%       6.24%
        155                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       8.04%      10.73%       9.27%
        155                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      41.85%      50.95%      46.67%
        155                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        155                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
        155                               stall_other                           Issue Stall Reasons (Other)       0.29%       0.39%       0.34%
        155          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.94%       1.49%       1.17%
        155                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.48%       0.65%       0.55%
        155                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
        155                                inst_fp_32                               FP Instructions(Single)    17334000    17334000    17334000
        155                                inst_fp_64                               FP Instructions(Double)           0           0           0
        155                              inst_integer                                  Integer Instructions    32400000    32400000    32400000
        155                          inst_bit_convert                              Bit-Convert Instructions      324000      324000      324000
        155                              inst_control                             Control-Flow Instructions     4050000     4050000     4050000
        155                        inst_compute_ld_st                               Load/Store Instructions    16200000    16200000    16200000
        155                                 inst_misc                                     Misc Instructions     3402000     3402000     3402000
        155           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        155                               issue_slots                                           Issue Slots     2492872     2495312     2493368
        155                                 cf_issued                      Issued Control-Flow Instructions      173250      173250      173250
        155                               cf_executed                    Executed Control-Flow Instructions      173250      173250      173250
        155                               ldst_issued                        Issued Load/Store Instructions      556500      556500      556500
        155                             ldst_executed                      Executed Load/Store Instructions      556500      556500      556500
        155                       atomic_transactions                                   Atomic Transactions           0           0           0
        155           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        155                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        155                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        155                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     1038375     1038375     1038375
        155                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)      31.65%      38.22%      34.79%
        155                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.81%       1.15%       0.98%
        155                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)      972000      972000      972000
        155             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        155                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        155              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        155                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        155          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        155             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        155      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        155       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        155       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        155        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        155       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        155        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        155                nvlink_transmit_throughput                            NVLink Transmit Throughput  12.390MB/s  12.744MB/s  12.605MB/s
        155                 nvlink_receive_throughput                             NVLink Receive Throughput  9.2926MB/s  9.5576MB/s  9.4539MB/s
        155       nvlink_total_response_data_received                   NVLink Total Response Data Received         288        1056         298
        155        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        155                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        155                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        155                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        155                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        155                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        155                                       ipc                                          Executed IPC    0.249160    0.543066    0.373083
        155                                issued_ipc                                            Issued IPC    0.248012    0.303364    0.272104
        155                    issue_slot_utilization                                Issue Slot Utilization       6.20%       7.58%       6.80%
        155                             sm_efficiency                               Multiprocessor Activity      81.64%      95.34%      90.83%
        155                        achieved_occupancy                                    Achieved Occupancy    0.105718    0.106256    0.106001
        155                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.302730    0.373152    0.333547
        155                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
        155                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (2)     Low (1)
        155                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        155                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        155                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.91%       1.71%       1.48%
        155                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        155                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        155                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        155       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        155            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        155                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_volume_gradients__16
        155                             inst_per_warp                                 Instructions per warp  1.7590e+04  1.7590e+04  1.7590e+04
        155                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        155                 warp_execution_efficiency                             Warp Execution Efficiency      56.25%      56.25%      56.25%
        155         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      55.76%      55.76%      55.76%
        155                      inst_replay_overhead                           Instruction Replay Overhead    0.000718    0.000851    0.000790
        155      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    1.117239    1.125548    1.121521
        155     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    1.011719    1.016370    1.013733
        155       local_load_transactions_per_request            Local Memory Load Transactions Per Request    2.500000    2.500000    2.500000
        155      local_store_transactions_per_request           Local Memory Store Transactions Per Request    2.500000    2.500000    2.500000
        155              gld_transactions_per_request                  Global Load Transactions Per Request    2.651715    2.655303    2.653849
        155              gst_transactions_per_request                 Global Store Transactions Per Request    2.750000    2.750000    2.750000
        155                 shared_store_transactions                             Shared Store Transactions      374842      376565      375588
        155                  shared_load_transactions                              Shared Load Transactions     3831012     3859505     3845696
        155                   local_load_transactions                               Local Load Transactions     9060000     9060000     9060000
        155                  local_store_transactions                              Local Store Transactions    12765000    12765000    12765000
        155                          gld_transactions                              Global Load Transactions     5039584     5046404     5043640
        155                          gst_transactions                             Global Store Transactions     2920500     2920500     2920500
        155                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        155                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        155                      l2_read_transactions                                  L2 Read Transactions    10820602    10878616    10848083
        155                     l2_write_transactions                                 L2 Write Transactions    17984664    18025907    18003286
        155                    dram_read_transactions                       Device Memory Read Transactions    14345828    14434798    14387443
        155                   dram_write_transactions                      Device Memory Write Transactions    12843700    12874897    12858857
        155                           global_hit_rate                     Global Hit Rate in unified l1/tex      50.20%      50.65%      50.42%
        155                            local_hit_rate                                        Local Hit Rate      22.62%      22.88%      22.74%
        155                  gld_requested_throughput                      Requested Global Load Throughput  63.124GB/s  64.419GB/s  63.834GB/s
        155                  gst_requested_throughput                     Requested Global Store Throughput  35.274GB/s  35.997GB/s  35.671GB/s
        155                            gld_throughput                                Global Load Throughput  74.467GB/s  75.992GB/s  75.292GB/s
        155                            gst_throughput                               Global Store Throughput  43.112GB/s  43.997GB/s  43.597GB/s
        155                     local_memory_overhead                                 Local Memory Overhead      85.02%      85.18%      85.11%
        155                        tex_cache_hit_rate                                Unified Cache Hit Rate      11.46%      11.58%      11.51%
        155                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       7.04%       7.47%       7.26%
        155                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      35.23%      35.89%      35.55%
        155                      dram_read_throughput                         Device Memory Read Throughput  212.30GB/s  217.15GB/s  214.78GB/s
        155                     dram_write_throughput                        Device Memory Write Throughput  189.64GB/s  193.92GB/s  191.96GB/s
        155                      tex_cache_throughput                        Unified cache to SM throughput  534.11GB/s  544.96GB/s  540.07GB/s
        155                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  160.04GB/s  163.32GB/s  161.77GB/s
        155                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  231.55GB/s  236.30GB/s  234.15GB/s
        155                        l2_read_throughput                                 L2 Throughput (Reads)  159.98GB/s  163.77GB/s  161.94GB/s
        155                       l2_write_throughput                                L2 Throughput (Writes)  265.97GB/s  271.07GB/s  268.75GB/s
        155                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   sysmem_write_throughput                        System Memory Write Throughput  77.395KB/s  78.982KB/s  78.265KB/s
        155                     local_load_throughput                          Local Memory Load Throughput  133.74GB/s  136.49GB/s  135.25GB/s
        155                    local_store_throughput                         Local Memory Store Throughput  188.44GB/s  192.30GB/s  190.56GB/s
        155                    shared_load_throughput                         Shared Memory Load Throughput  227.25GB/s  231.62GB/s  229.63GB/s
        155                   shared_store_throughput                        Shared Memory Store Throughput  22.202GB/s  22.625GB/s  22.427GB/s
        155                            gld_efficiency                         Global Memory Load Efficiency      84.74%      84.85%      84.78%
        155                            gst_efficiency                        Global Memory Store Efficiency      81.82%      81.82%      81.82%
        155                    tex_cache_transactions                      Unified cache to SM transactions     9042362     9046345     9044503
        155                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        155                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        155                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        155                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        155                             flop_count_sp           Floating Point Operations(Single Precision)   254826000   254826000   254826000
        155                         flop_count_sp_add       Floating Point Operations(Single Precision Add)     1134000     1134000     1134000
        155                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)   123444000   123444000   123444000
        155                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)     6804000     6804000     6804000
        155                     flop_count_sp_special   Floating Point Operations(Single Precision Special)      513000      513000      513000
        155                             inst_executed                                 Instructions Executed    22279500    26385000    24504416
        155                               inst_issued                                   Instructions Issued    22295393    22298431    22297140
        155                          dram_utilization                             Device Memory Utilization     Mid (5)     Mid (5)     Mid (5)
        155                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        155                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       0.24%       2.17%       0.99%
        155                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       2.10%       2.55%       2.28%
        155                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      63.37%      66.48%      64.92%
        155                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        155                                stall_sync                 Issue Stall Reasons (Synchronization)       0.69%       0.95%       0.81%
        155                               stall_other                           Issue Stall Reasons (Other)       0.06%       0.11%       0.09%
        155          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.04%       0.08%       0.06%
        155                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       1.68%       2.14%       1.90%
        155                         shared_efficiency                              Shared Memory Efficiency      19.14%      19.28%      19.21%
        155                                inst_fp_32                               FP Instructions(Single)   132219000   132219000   132219000
        155                                inst_fp_64                               FP Instructions(Double)           0           0           0
        155                              inst_integer                                  Integer Instructions    21222000    21222000    21222000
        155                          inst_bit_convert                              Bit-Convert Instructions       54000       54000       54000
        155                              inst_control                             Control-Flow Instructions     1998000     1998000     1998000
        155                        inst_compute_ld_st                               Load/Store Instructions   233469000   233469000   233469000
        155                                 inst_misc                                     Misc Instructions    11124000    11124000    11124000
        155           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        155                               issue_slots                                           Issue Slots    22295393    22298431    22297140
        155                                 cf_issued                      Issued Control-Flow Instructions      135000      135000      135000
        155                               cf_executed                    Executed Control-Flow Instructions      135000      135000      135000
        155                               ldst_issued                        Issued Load/Store Instructions    13014000    13014000    13014000
        155                             ldst_executed                      Executed Load/Store Instructions    13014000    13014000    13014000
        155                       atomic_transactions                                   Atomic Transactions           0           0           0
        155           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        155                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        155                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        155                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)    10814596    10850276    10836560
        155                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)      27.44%      30.06%      28.76%
        155                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.19%       0.24%       0.21%
        155                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)    15685500    15685500    15685500
        155             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        155                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        155              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        155                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        155          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        155             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        155      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        155       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        155       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        155        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        155       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        155        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        155                nvlink_transmit_throughput                            NVLink Transmit Throughput  557.24KB/s  568.68KB/s  563.51KB/s
        155                 nvlink_receive_throughput                             NVLink Receive Throughput  417.93KB/s  426.51KB/s  422.63KB/s
        155       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        155        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        155                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        155                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        155                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        155                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        155                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        155                                       ipc                                          Executed IPC    0.096423    0.132104    0.111972
        155                                issued_ipc                                            Issued IPC    0.096503    0.116254    0.105112
        155                    issue_slot_utilization                                Issue Slot Utilization       2.41%       2.91%       2.63%
        155                             sm_efficiency                               Multiprocessor Activity      91.33%      93.08%      92.30%
        155                        achieved_occupancy                                    Achieved Occupancy    0.121941    0.123391    0.122706
        155                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.112371    0.134675    0.121493
        155                        shared_utilization                             Shared Memory Utilization     Low (1)     Low (1)     Low (1)
        155                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
        155                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        155                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        155                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.78%       0.95%       0.86%
        155                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        155                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        155                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        155       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        155            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        155                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_band_forward_kernel__25
        104                             inst_per_warp                                 Instructions per warp  4.0182e+04  4.0182e+04  4.0182e+04
        104                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        104                 warp_execution_efficiency                             Warp Execution Efficiency      56.25%      56.25%      56.25%
        104         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      55.63%      55.63%      55.63%
        104                      inst_replay_overhead                           Instruction Replay Overhead    0.000549    0.000717    0.000631
        104      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        104     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        104       local_load_transactions_per_request            Local Memory Load Transactions Per Request    2.500000    2.500000    2.500000
        104      local_store_transactions_per_request           Local Memory Store Transactions Per Request    2.500000    2.500000    2.500000
        103              gld_transactions_per_request                  Global Load Transactions Per Request    2.643685    2.680684    2.660489
        103              gst_transactions_per_request                 Global Store Transactions Per Request    2.750000    2.750000    2.750000
        103                 shared_store_transactions                             Shared Store Transactions           0           0           0
        103                  shared_load_transactions                              Shared Load Transactions           0           0           0
        103                   local_load_transactions                               Local Load Transactions     2546250     2546250     2546250
        103                  local_store_transactions                              Local Store Transactions     2530500     2530500     2530500
        103                          gld_transactions                              Global Load Transactions     3568975     3618923     3591660
        103                          gst_transactions                             Global Store Transactions      123750      123750      123750
        103                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        103                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        103                      l2_read_transactions                                  L2 Read Transactions     5239255     5266701     5251905
        103                     l2_write_transactions                                 L2 Write Transactions     3156310     3181032     3166212
        103                    dram_read_transactions                       Device Memory Read Transactions     6067090     6081890     6074513
        103                   dram_write_transactions                      Device Memory Write Transactions     2615284     2641375     2628610
        103                           global_hit_rate                     Global Hit Rate in unified l1/tex      19.25%      19.51%      19.39%
        103                            local_hit_rate                                        Local Hit Rate       8.64%       8.86%       8.75%
        103                  gld_requested_throughput                      Requested Global Load Throughput  72.877GB/s  78.486GB/s  76.349GB/s
        103                  gst_requested_throughput                     Requested Global Store Throughput  2.4292GB/s  2.6162GB/s  2.5450GB/s
        103                            gld_throughput                                Global Load Throughput  86.042GB/s  93.069GB/s  90.278GB/s
        103                            gst_throughput                               Global Store Throughput  2.9691GB/s  3.1976GB/s  3.1105GB/s
        103                     local_memory_overhead                                 Local Memory Overhead      61.79%      62.29%      62.07%
        103                        tex_cache_hit_rate                                Unified Cache Hit Rate      11.36%      11.46%      11.40%
        103                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      13.79%      14.09%      13.95%
        103                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)       5.61%       5.78%       5.68%
        103                      dram_read_throughput                         Device Memory Read Throughput  145.73GB/s  157.04GB/s  152.69GB/s
        103                     dram_write_throughput                        Device Memory Write Throughput  62.918GB/s  68.072GB/s  66.071GB/s
        103                      tex_cache_throughput                        Unified cache to SM throughput  243.16GB/s  261.89GB/s  254.72GB/s
        103                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  125.81GB/s  135.50GB/s  131.78GB/s
        103                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  63.682GB/s  68.584GB/s  66.716GB/s
        103                        l2_read_throughput                                 L2 Throughput (Reads)  125.85GB/s  135.80GB/s  132.01GB/s
        103                       l2_write_throughput                                L2 Throughput (Writes)  75.741GB/s  82.082GB/s  79.584GB/s
        103                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                   sysmem_write_throughput                        System Memory Write Throughput  125.79KB/s  135.47KB/s  131.78KB/s
        103                     local_load_throughput                          Local Memory Load Throughput  61.091GB/s  65.793GB/s  64.001GB/s
        103                    local_store_throughput                         Local Memory Store Throughput  60.713GB/s  65.386GB/s  63.605GB/s
        103                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                            gld_efficiency                         Global Memory Load Efficiency      83.93%      85.11%      84.57%
        103                            gst_efficiency                        Global Memory Store Efficiency      81.82%      81.82%      81.82%
        103                    tex_cache_transactions                      Unified cache to SM transactions     2529908     2537878     2533502
        103                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        103                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        103                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        103                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        103                             flop_count_sp           Floating Point Operations(Single Precision)    46980000    46980000    46980000
        103                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
        103                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)    23490000    23490000    23490000
        103                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
        103                     flop_count_sp_special   Floating Point Operations(Single Precision Special)        5400        5400        5400
        103                             inst_executed                                 Instructions Executed     9425100    12054600    10701556
        103                               inst_issued                                   Instructions Issued     9430276     9431857     9431057
        103                          dram_utilization                             Device Memory Utilization     Low (3)     Low (3)     Low (3)
        103                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        103                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       0.09%       0.29%       0.17%
        103                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       3.18%       3.49%       3.30%
        103                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      96.05%      96.59%      96.38%
        103                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        103                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
        103                               stall_other                           Issue Stall Reasons (Other)       0.03%       0.04%       0.04%
        103          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.05%       0.10%       0.07%
        103                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.01%       0.01%       0.01%
        103                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
        103                                inst_fp_32                               FP Instructions(Single)    23495400    23495400    23495400
        103                                inst_fp_64                               FP Instructions(Double)           0           0           0
        103                              inst_integer                                  Integer Instructions    79401600    79401600    79401600
        103                          inst_bit_convert                              Bit-Convert Instructions       10800       10800       10800
        103                              inst_control                             Control-Flow Instructions      199800      199800      199800
        103                        inst_compute_ld_st                               Load/Store Instructions    61630200    61630200    61630200
        103                                 inst_misc                                     Misc Instructions     4179600     4179600     4179600
        103           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        103                               issue_slots                                           Issue Slots     9430276     9431857     9431057
        103                                 cf_issued                      Issued Control-Flow Instructions       63000       63000       63000
        103                               cf_executed                    Executed Control-Flow Instructions       63000       63000       63000
        103                               ldst_issued                        Issued Load/Store Instructions     3426000     3426000     3426000
        103                             ldst_executed                      Executed Load/Store Instructions     3426000     3426000     3426000
        103                       atomic_transactions                                   Atomic Transactions           0           0           0
        103           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        103                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        103                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        103                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     5237522     5245988     5242609
        103                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.01%       0.03%       0.02%
        103                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.01%       0.02%       0.02%
        103                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)     2654250     2654250     2654250
        103             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        103                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        103              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        103                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        103          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        103             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        103      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        103       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        103       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        103        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        103       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        103        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        103                nvlink_transmit_throughput                            NVLink Transmit Throughput  905.68KB/s  975.39KB/s  948.83KB/s
        103                 nvlink_receive_throughput                             NVLink Receive Throughput  679.26KB/s  731.55KB/s  711.62KB/s
        103       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         480         291
        103        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        103                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        103                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        103                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        103                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        103                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        103                                       ipc                                          Executed IPC    0.068260    0.090479    0.078439
        103                                issued_ipc                                            Issued IPC    0.068197    0.074746    0.070693
        103                    issue_slot_utilization                                Issue Slot Utilization       1.70%       1.87%       1.77%
        103                             sm_efficiency                               Multiprocessor Activity      96.12%      97.51%      96.99%
        103                        achieved_occupancy                                    Achieved Occupancy    0.058575    0.058678    0.058625
        103                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.068824    0.075366    0.071368
        103                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
        103                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
        103                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        103                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        103                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        103                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        103                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
        103             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        103           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
        103           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        103                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        103                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.25%       0.28%       0.27%
        103                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        103                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        103                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        103       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        103            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        103                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_volume_divergence_of_gradients__18
        155                             inst_per_warp                                 Instructions per warp  1.0280e+03  1.0280e+03  1.0280e+03
        155                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        155                 warp_execution_efficiency                             Warp Execution Efficiency      96.43%      96.43%      96.43%
        155         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      93.85%      93.85%      93.85%
        155                      inst_replay_overhead                           Instruction Replay Overhead    0.005245    0.011328    0.008139
        155      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    1.560938    1.577383    1.568644
        155     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    1.054945    1.054945    1.054945
        155       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155              gld_transactions_per_request                  Global Load Transactions Per Request    3.604080    3.634386    3.617958
        155              gst_transactions_per_request                 Global Store Transactions Per Request    3.857143    3.857143    3.857143
        155                 shared_store_transactions                             Shared Store Transactions       72000       72000       72000
        155                  shared_load_transactions                              Shared Load Transactions      983391      993751      988245
        155                   local_load_transactions                               Local Load Transactions           0           0           0
        155                  local_store_transactions                              Local Store Transactions           0           0           0
        155                          gld_transactions                              Global Load Transactions      359507      362530      360891
        155                          gst_transactions                             Global Store Transactions       81000       81000       81000
        155                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        155                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        155                      l2_read_transactions                                  L2 Read Transactions      365246      367318      366195
        155                     l2_write_transactions                                 L2 Write Transactions       81025      104883       91831
        155                    dram_read_transactions                       Device Memory Read Transactions      366015      366441      366174
        155                   dram_write_transactions                      Device Memory Write Transactions      103259      138183      120506
        155                           global_hit_rate                     Global Hit Rate in unified l1/tex      12.13%      12.47%      12.30%
        155                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
        155                  gld_requested_throughput                      Requested Global Load Throughput  406.27GB/s  436.85GB/s  423.26GB/s
        155                  gst_requested_throughput                     Requested Global Store Throughput  85.530GB/s  91.968GB/s  89.106GB/s
        155                            gld_throughput                                Global Load Throughput  381.04GB/s  409.63GB/s  397.01GB/s
        155                            gst_throughput                               Global Store Throughput  85.530GB/s  91.968GB/s  89.106GB/s
        155                     local_memory_overhead                                 Local Memory Overhead      12.82%      13.45%      13.13%
        155                        tex_cache_hit_rate                                Unified Cache Hit Rate       5.13%       5.15%       5.14%
        155                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      12.88%      12.91%      12.89%
        155                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      38.76%      38.97%      38.87%
        155                      dram_read_throughput                         Device Memory Read Throughput  386.49GB/s  416.00GB/s  402.82GB/s
        155                     dram_write_throughput                        Device Memory Write Throughput  115.57GB/s  155.01GB/s  132.57GB/s
        155                      tex_cache_throughput                        Unified cache to SM throughput  3122.7GB/s  3357.6GB/s  3253.4GB/s
        155                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  385.52GB/s  414.51GB/s  401.62GB/s
        155                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  85.530GB/s  91.968GB/s  89.106GB/s
        155                        l2_read_throughput                                 L2 Throughput (Reads)  385.72GB/s  415.71GB/s  402.84GB/s
        155                       l2_write_throughput                                L2 Throughput (Writes)  85.727GB/s  117.88GB/s  101.02GB/s
        155                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   sysmem_write_throughput                        System Memory Write Throughput  5.4063MB/s  5.8133MB/s  5.6324MB/s
        155                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    shared_load_throughput                         Shared Memory Load Throughput  4169.7GB/s  4488.3GB/s  4348.6GB/s
        155                   shared_store_throughput                        Shared Memory Store Throughput  304.11GB/s  327.00GB/s  316.82GB/s
        155                            gld_efficiency                         Global Memory Load Efficiency     106.13%     107.02%     106.61%
        155                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
        155                    tex_cache_transactions                      Unified cache to SM transactions      739256      739493      739358
        155                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        155                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        155                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        155                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        155                             flop_count_sp           Floating Point Operations(Single Precision)    53784000    53784000    53784000
        155                         flop_count_sp_add       Floating Point Operations(Single Precision Add)      648000      648000      648000
        155                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)    25920000    25920000    25920000
        155                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)     1296000     1296000     1296000
        155                     flop_count_sp_special   Floating Point Operations(Single Precision Special)      324000      324000      324000
        155                             inst_executed                                 Instructions Executed     3155250     5397000     4268893
        155                               inst_issued                                   Instructions Issued     3172077     3190992     3180698
        155                          dram_utilization                             Device Memory Utilization    High (7)    High (7)    High (7)
        155                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        155                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       0.38%      13.92%       6.23%
        155                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       5.52%       8.64%       6.90%
        155                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      38.73%      52.36%      44.45%
        155                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        155                                stall_sync                 Issue Stall Reasons (Synchronization)       4.36%       9.57%       6.96%
        155                               stall_other                           Issue Stall Reasons (Other)       1.40%       2.31%       1.80%
        155          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       1.12%       5.52%       2.81%
        155                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)      10.28%      17.87%      13.62%
        155                         shared_efficiency                              Shared Memory Efficiency      24.67%      24.92%      24.80%
        155                                inst_fp_32                               FP Instructions(Single)    28188000    28188000    28188000
        155                                inst_fp_64                               FP Instructions(Double)           0           0           0
        155                              inst_integer                                  Integer Instructions    32400000    32400000    32400000
        155                          inst_bit_convert                              Bit-Convert Instructions      648000      648000      648000
        155                              inst_control                             Control-Flow Instructions      972000      972000      972000
        155                        inst_compute_ld_st                               Load/Store Instructions    25272000    25272000    25272000
        155                                 inst_misc                                     Misc Instructions     7938000     7938000     7938000
        155           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        155                               issue_slots                                           Issue Slots     3172077     3190992     3180698
        155                                 cf_issued                      Issued Control-Flow Instructions       52500       52500       52500
        155                               cf_executed                    Executed Control-Flow Instructions       52500       52500       52500
        155                               ldst_issued                        Issued Load/Store Instructions      840000      840000      840000
        155                             ldst_executed                      Executed Load/Store Instructions      840000      840000      840000
        155                       atomic_transactions                                   Atomic Transactions           0           0           0
        155           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        155                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        155                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        155                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)      365026      365149      365081
        155                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       7.51%      12.79%       9.90%
        155                        stall_not_selected                    Issue Stall Reasons (Not Selected)       5.59%       9.50%       7.34%
        155                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)       81000       81000       81000
        155             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        155                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        155              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        155                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        155          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        155             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        155      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        155       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        155       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        155        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        155       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        155        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        155                nvlink_transmit_throughput                            NVLink Transmit Throughput  38.925MB/s  41.856MB/s  40.553MB/s
        155                 nvlink_receive_throughput                             NVLink Receive Throughput  29.194MB/s  31.392MB/s  30.415MB/s
        155       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        155        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        155                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        155                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        155                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        155                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        155                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        155                                       ipc                                          Executed IPC    0.760504    1.347159    1.005393
        155                                issued_ipc                                            Issued IPC    0.966492    1.353029    1.163890
        155                    issue_slot_utilization                                Issue Slot Utilization      24.16%      33.83%      29.10%
        155                             sm_efficiency                               Multiprocessor Activity      54.43%      92.58%      82.26%
        155                        achieved_occupancy                                    Achieved Occupancy    0.563138    0.580315    0.570436
        155                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    3.157054    4.310169    3.713334
        155                        shared_utilization                             Shared Memory Utilization     Low (2)     Low (2)     Low (2)
        155                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
        155                           tex_utilization                             Unified Cache Utilization     Low (2)     Low (3)     Low (2)
        155                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (3)     Mid (4)     Low (3)
        155                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (2)     Low (3)     Low (2)
        155           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        155                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       5.60%      14.13%      11.42%
        155                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        155                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        155                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        155       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        155            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        155                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_kernel_apply_filter__30
         51                             inst_per_warp                                 Instructions per warp  2.1140e+03  2.1140e+03  2.1140e+03
         51                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
         51                 warp_execution_efficiency                             Warp Execution Efficiency      96.43%      96.43%      96.43%
         51         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      95.18%      95.18%      95.18%
         51                      inst_replay_overhead                           Instruction Replay Overhead    0.000416    0.000625    0.000465
         51      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    1.567679    1.568684    1.568007
         51     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    1.019736    1.027548    1.022827
         51       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
         51      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
         51              gld_transactions_per_request                  Global Load Transactions Per Request    3.561599    3.582441    3.573183
         51              gst_transactions_per_request                 Global Store Transactions Per Request    3.857143    3.857143    3.857143
         51                 shared_store_transactions                             Shared Store Transactions      599605      604198      601422
         51                  shared_load_transactions                              Shared Load Transactions     4715970     4718995     4716957
         51                   local_load_transactions                               Local Load Transactions           0           0           0
         51                  local_store_transactions                              Local Store Transactions           0           0           0
         51                          gld_transactions                              Global Load Transactions      710539      714697      712849
         51                          gst_transactions                             Global Store Transactions      749250      749250      749250
         51                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
         51                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
         51                      l2_read_transactions                                  L2 Read Transactions      749938      755052      752039
         51                     l2_write_transactions                                 L2 Write Transactions      749288      773071      759131
         51                    dram_read_transactions                       Device Memory Read Transactions      749294      750171      749671
         51                   dram_write_transactions                      Device Memory Write Transactions      746599      770097      757856
         51                           global_hit_rate                     Global Hit Rate in unified l1/tex      52.81%      52.95%      52.88%
         51                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
         51                  gld_requested_throughput                      Requested Global Load Throughput  199.34GB/s  218.93GB/s  209.75GB/s
         51                  gst_requested_throughput                     Requested Global Store Throughput  194.09GB/s  213.17GB/s  204.23GB/s
         51                            gld_throughput                                Global Load Throughput  185.04GB/s  202.92GB/s  194.31GB/s
         51                            gst_throughput                               Global Store Throughput  194.09GB/s  213.17GB/s  204.23GB/s
         51                     local_memory_overhead                                 Local Memory Overhead      53.91%      54.18%      54.04%
         51                        tex_cache_hit_rate                                Unified Cache Hit Rate       1.60%       1.60%       1.60%
         51                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      13.01%      13.01%      13.01%
         51                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      99.82%      99.95%      99.90%
         51                      dram_read_throughput                         Device Memory Read Throughput  194.32GB/s  213.43GB/s  204.35GB/s
         51                     dram_write_throughput                        Device Memory Write Throughput  195.40GB/s  217.48GB/s  206.58GB/s
         51                      tex_cache_throughput                        Unified cache to SM throughput  3338.9GB/s  3667.1GB/s  3513.4GB/s
         51                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  194.20GB/s  213.29GB/s  204.34GB/s
         51                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  194.09GB/s  213.17GB/s  204.23GB/s
         51                        l2_read_throughput                                 L2 Throughput (Reads)  194.27GB/s  214.82GB/s  204.99GB/s
         51                       l2_write_throughput                                L2 Throughput (Writes)  196.16GB/s  218.07GB/s  206.93GB/s
         51                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         51                   sysmem_write_throughput                        System Memory Write Throughput  1.3263MB/s  1.4567MB/s  1.3956MB/s
         51                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         51                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         51                    shared_load_throughput                         Shared Memory Load Throughput  4886.7GB/s  5368.3GB/s  5143.1GB/s
         51                   shared_store_throughput                        Shared Memory Store Throughput  625.30GB/s  684.23GB/s  655.75GB/s
         51                            gld_efficiency                         Global Memory Load Efficiency     107.67%     108.30%     107.95%
         51                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
         51                    tex_cache_transactions                      Unified cache to SM transactions     3222250     3222289     3222262
         51                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
         51                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
         51                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
         51                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
         51                             flop_count_sp           Floating Point Operations(Single Precision)   215784000   215784000   215784000
         51                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
         51                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)   107892000   107892000   107892000
         51                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
         51                     flop_count_sp_special   Floating Point Operations(Single Precision Special)      324000      324000      324000
         51                             inst_executed                                 Instructions Executed     8856750    11098500     9911691
         51                               inst_issued                                   Instructions Issued     8860430     8862337     8860759
         51                          dram_utilization                             Device Memory Utilization     Mid (5)     Mid (6)     Mid (5)
         51                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
         51                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       0.92%      13.90%       7.18%
         51                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)      27.36%      35.20%      31.23%
         51                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      18.00%      30.02%      24.04%
         51                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
         51                                stall_sync                 Issue Stall Reasons (Synchronization)       6.75%       8.53%       7.61%
         51                               stall_other                           Issue Stall Reasons (Other)       0.41%       0.52%       0.46%
         51          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.57%       0.85%       0.68%
         51                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)      15.83%      21.10%      18.28%
         51                         shared_efficiency                              Shared Memory Efficiency      42.78%      42.82%      42.80%
         51                                inst_fp_32                               FP Instructions(Single)   108216000   108216000   108216000
         51                                inst_fp_64                               FP Instructions(Double)           0           0           0
         51                              inst_integer                                  Integer Instructions    30618000    30618000    30618000
         51                          inst_bit_convert                              Bit-Convert Instructions      648000      648000      648000
         51                              inst_control                             Control-Flow Instructions      972000      972000      972000
         51                        inst_compute_ld_st                               Load/Store Instructions   123120000   123120000   123120000
         51                                 inst_misc                                     Misc Instructions     7776000     7776000     7776000
         51           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
         51                               issue_slots                                           Issue Slots     8860430     8862337     8860759
         51                                 cf_issued                      Issued Control-Flow Instructions       52500       52500       52500
         51                               cf_executed                    Executed Control-Flow Instructions       52500       52500       52500
         51                               ldst_issued                        Issued Load/Store Instructions     4032000     4032000     4032000
         51                             ldst_executed                      Executed Load/Store Instructions     4032000     4032000     4032000
         51                       atomic_transactions                                   Atomic Transactions           0           0           0
         51           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
         51                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
         51                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
         51                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)      749650      749672      749656
         51                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       5.43%       7.84%       6.56%
         51                        stall_not_selected                    Issue Stall Reasons (Not Selected)       3.48%       4.49%       3.95%
         51                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)      749250      749250      749250
         51             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
         51                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
         51              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
         51                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
         51          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
         51             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
         51      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
         51       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
         51       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
         51        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
         51       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
         51        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
         51                nvlink_transmit_throughput                            NVLink Transmit Throughput  9.5495MB/s  10.488MB/s  10.049MB/s
         51                 nvlink_receive_throughput                             NVLink Receive Throughput  7.1621MB/s  7.8663MB/s  7.5364MB/s
         51       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
         51        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
         51                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
         51                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
         51                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
         51                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
         51                                inst_fp_16                                 HP Instructions(Half)           0           0           0
         51                                       ipc                                          Executed IPC    0.652266    0.807078    0.758255
         51                                issued_ipc                                            Issued IPC    0.652537    0.820166    0.728063
         51                    issue_slot_utilization                                Issue Slot Utilization      16.31%      20.50%      18.20%
         51                             sm_efficiency                               Multiprocessor Activity      83.94%      93.95%      92.44%
         51                        achieved_occupancy                                    Achieved Occupancy    0.109066    0.109089    0.109078
         51                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.856441    1.101611    0.983864
         51                        shared_utilization                             Shared Memory Utilization     Mid (4)     Mid (4)     Mid (4)
         51                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
         51                           tex_utilization                             Unified Cache Utilization     Low (3)     Low (3)     Low (3)
         51                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (3)     Mid (4)     Low (3)
         51                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
         51                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         51                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
         51             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         51           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (2)     Low (2)     Low (2)
         51           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         51                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
         51                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)      10.91%      14.25%      12.63%
         51                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
         51                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
         51                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
         51       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
         51            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
         51                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_volume_tendency__22
        155                             inst_per_warp                                 Instructions per warp  5.7836e+04  7.4062e+04  5.7956e+04
        155                         branch_efficiency                                     Branch Efficiency      98.81%      99.21%      98.82%
        155                 warp_execution_efficiency                             Warp Execution Efficiency      54.96%      55.26%      54.97%
        155         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      53.13%      53.21%      53.14%
        155                      inst_replay_overhead                           Instruction Replay Overhead    0.000932    0.001200    0.001000
        155      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    1.134168    1.143866    1.138681
        155     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    1.510191    1.511871    1.510877
        155       local_load_transactions_per_request            Local Memory Load Transactions Per Request    2.500000    2.500000    2.500000
        155      local_store_transactions_per_request           Local Memory Store Transactions Per Request    2.500000    2.500000    2.500000
        155              gld_transactions_per_request                  Global Load Transactions Per Request    2.582710    2.609019    2.595801
        155              gst_transactions_per_request                 Global Store Transactions Per Request    2.750000    2.750000    2.750000
        155                 shared_store_transactions                             Shared Store Transactions     1497354     1499020     1498034
        155                  shared_load_transactions                              Shared Load Transactions     5032302     5075335     5052329
        155                   local_load_transactions                               Local Load Transactions     7376250     7376250     7376250
        155                  local_store_transactions                              Local Store Transactions     6918750     6918750     6918750
        155                          gld_transactions                              Global Load Transactions     4838707     4887997     4863232
        155                          gst_transactions                             Global Store Transactions      915750      915750      915750
        155                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        155                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        155                      l2_read_transactions                                  L2 Read Transactions    11788429    11841602    11813449
        155                     l2_write_transactions                                 L2 Write Transactions     8918267     8977715     8946567
        155                    dram_read_transactions                       Device Memory Read Transactions    14696023    14801572    14745806
        155                   dram_write_transactions                      Device Memory Write Transactions     7411133     7475465     7440813
        155                           global_hit_rate                     Global Hit Rate in unified l1/tex      11.89%      12.93%      12.47%
        155                            local_hit_rate                                        Local Hit Rate       8.48%       8.74%       8.59%
        155                  gld_requested_throughput                      Requested Global Load Throughput  87.610GB/s  90.272GB/s  89.013GB/s
        155                  gst_requested_throughput                     Requested Global Store Throughput  15.572GB/s  16.045GB/s  15.821GB/s
        155                            gld_throughput                                Global Load Throughput  100.89GB/s  104.27GB/s  102.69GB/s
        155                            gst_throughput                               Global Store Throughput  19.033GB/s  19.611GB/s  19.337GB/s
        155                     local_memory_overhead                                 Local Memory Overhead      74.00%      74.51%      74.27%
        155                        tex_cache_hit_rate                                Unified Cache Hit Rate       3.43%       3.65%       3.55%
        155                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       6.26%       6.65%       6.42%
        155                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      29.26%      30.78%      30.05%
        155                      dram_read_throughput                         Device Memory Read Throughput  306.98GB/s  316.10GB/s  311.38GB/s
        155                     dram_write_throughput                        Device Memory Write Throughput  154.83GB/s  159.20GB/s  157.12GB/s
        155                      tex_cache_throughput                        Unified cache to SM throughput  791.84GB/s  816.30GB/s  804.78GB/s
        155                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  245.40GB/s  253.09GB/s  249.29GB/s
        155                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  162.83GB/s  167.78GB/s  165.44GB/s
        155                        l2_read_throughput                                 L2 Throughput (Reads)  245.35GB/s  253.48GB/s  249.46GB/s
        155                       l2_write_throughput                                L2 Throughput (Writes)  185.78GB/s  191.32GB/s  188.92GB/s
        155                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   sysmem_write_throughput                        System Memory Write Throughput  108.96KB/s  112.28KB/s  110.71KB/s
        155                     local_load_throughput                          Local Memory Load Throughput  153.30GB/s  157.96GB/s  155.76GB/s
        155                    local_store_throughput                         Local Memory Store Throughput  143.80GB/s  148.16GB/s  146.10GB/s
        155                    shared_load_throughput                         Shared Memory Load Throughput  420.42GB/s  431.87GB/s  426.75GB/s
        155                   shared_store_throughput                        Shared Memory Store Throughput  124.53GB/s  128.33GB/s  126.53GB/s
        155                            gld_efficiency                         Global Memory Load Efficiency      86.24%      87.12%      86.68%
        155                            gst_efficiency                        Global Memory Store Efficiency      81.82%      81.82%      81.82%
        155                    tex_cache_transactions                      Unified cache to SM transactions     9524828     9532180     9527950
        155                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        155                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        155                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        155                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        155                             flop_count_sp           Floating Point Operations(Single Precision)   504360511   512469380   512412526
        155                         flop_count_sp_add       Floating Point Operations(Single Precision Add)    12143911    12793178    12788340
        155                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)   194618926   198185460   198161153
        155                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)   102978748   103305282   103301878
        155                     flop_count_sp_special   Floating Point Operations(Single Precision Special)     2934217     3583484     3578646
        155                             inst_executed                                 Instructions Executed    39677249   111092290    58067329
        155                               inst_issued                                   Instructions Issued    39715215    41013783    39726511
        155                          dram_utilization                             Device Memory Utilization     Mid (6)     Mid (6)     Mid (6)
        155                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        155                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       2.58%       5.73%       4.08%
        155                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       5.19%       6.37%       5.65%
        155                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      36.00%      40.97%      38.55%
        155                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        155                                stall_sync                 Issue Stall Reasons (Synchronization)       1.57%       2.22%       1.89%
        155                               stall_other                           Issue Stall Reasons (Other)       2.93%       4.30%       3.70%
        155          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.09%       0.16%       0.11%
        155                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       1.91%       2.46%       2.10%
        155                         shared_efficiency                              Shared Memory Efficiency      30.97%      31.18%      31.08%
        155                                inst_fp_32                               FP Instructions(Single)   327768619   329720208   329703700
        155                                inst_fp_64                               FP Instructions(Double)           0           0           0
        155                              inst_integer                                  Integer Instructions    97647980   106235247    97704030
        155                          inst_bit_convert                              Bit-Convert Instructions     2278992     2278992     2278992
        155                              inst_control                             Control-Flow Instructions    21379944    27218279    21420852
        155                        inst_compute_ld_st                               Load/Store Instructions   217161000   217161000   217161000
        155                                 inst_misc                                     Misc Instructions    28322678    32862479    28353912
        155           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        155                               issue_slots                                           Issue Slots    39715215    41013783    39726511
        155                                 cf_issued                      Issued Control-Flow Instructions     1427210     1994516     1431489
        155                               cf_executed                    Executed Control-Flow Instructions     1427210     1994516     1431489
        155                               ldst_issued                        Issued Load/Store Instructions    12353541    12398587    12353910
        155                             ldst_executed                      Executed Load/Store Instructions    12353541    12398587    12353910
        155                       atomic_transactions                                   Atomic Transactions           0           0           0
        155           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        155                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        155                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        155                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)    11785232    11828904    11805734
        155                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)      39.91%      46.25%      43.35%
        155                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.51%       0.68%       0.58%
        155                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)     7834500     7834500     7834500
        155             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        155                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        155              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        155                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        155          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        155             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        155      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        155       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        155       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        155        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        155       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        155        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        155                nvlink_transmit_throughput                            NVLink Transmit Throughput  784.55KB/s  808.39KB/s  797.11KB/s
        155                 nvlink_receive_throughput                             NVLink Receive Throughput  588.41KB/s  606.29KB/s  597.83KB/s
        155       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        155        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        155                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        155                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        155                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        155                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        155                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        155                                       ipc                                          Executed IPC    0.244187    0.570360    0.424228
        155                                issued_ipc                                            Issued IPC    0.244445    0.298409    0.264295
        155                    issue_slot_utilization                                Issue Slot Utilization       6.11%       7.46%       6.61%
        155                             sm_efficiency                               Multiprocessor Activity      84.41%      93.51%      90.65%
        155                        achieved_occupancy                                    Achieved Occupancy    0.118197    0.122005    0.119778
        155                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.285144    0.345793    0.308377
        155                        shared_utilization                             Shared Memory Utilization     Low (1)     Low (1)     Low (1)
        155                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
        155                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        155                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        155                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       1.79%       2.73%       2.44%
        155                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        155                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        155                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        155       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        155            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        155                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_interface_tendency__23
        155                             inst_per_warp                                 Instructions per warp  5.6220e+04  8.7998e+04  5.6467e+04
        155                         branch_efficiency                                     Branch Efficiency      98.21%      99.09%      98.23%
        155                 warp_execution_efficiency                             Warp Execution Efficiency      54.26%      55.01%      54.28%
        155         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      52.58%      52.92%      52.61%
        155                      inst_replay_overhead                           Instruction Replay Overhead    0.001664    0.002169    0.001733
        155      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155       local_load_transactions_per_request            Local Memory Load Transactions Per Request    2.500000    2.500000    2.500000
        155      local_store_transactions_per_request           Local Memory Store Transactions Per Request    2.500000    2.500000    2.500000
        155              gld_transactions_per_request                  Global Load Transactions Per Request    7.221265    7.227326    7.224633
        155              gst_transactions_per_request                 Global Store Transactions Per Request    7.083333    7.083333    7.083333
        155                 shared_store_transactions                             Shared Store Transactions           0           0           0
        155                  shared_load_transactions                              Shared Load Transactions           0           0           0
        155                   local_load_transactions                               Local Load Transactions     5875500     5875500     5875500
        155                  local_store_transactions                              Local Store Transactions     5248500     5248500     5248500
        155                          gld_transactions                              Global Load Transactions    26687629    26710027    26700077
        155                          gst_transactions                             Global Store Transactions     2358750     2358750     2358750
        155                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        155                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        155                      l2_read_transactions                                  L2 Read Transactions    30655390    30762288    30714960
        155                     l2_write_transactions                                 L2 Write Transactions     8658848     8695666     8675063
        155                    dram_read_transactions                       Device Memory Read Transactions    31885169    32452956    32190176
        155                   dram_write_transactions                      Device Memory Write Transactions     7548880     7576679     7563710
        155                           global_hit_rate                     Global Hit Rate in unified l1/tex      12.71%      12.93%      12.80%
        155                            local_hit_rate                                        Local Hit Rate       7.53%       7.92%       7.69%
        155                  gld_requested_throughput                      Requested Global Load Throughput  92.171GB/s  95.723GB/s  94.333GB/s
        155                  gst_requested_throughput                     Requested Global Store Throughput  8.2856GB/s  8.6049GB/s  8.4799GB/s
        155                            gld_throughput                                Global Load Throughput  295.29GB/s  306.70GB/s  302.19GB/s
        155                            gst_throughput                               Global Store Throughput  26.084GB/s  27.089GB/s  26.696GB/s
        155                     local_memory_overhead                                 Local Memory Overhead      33.76%      34.00%      33.87%
        155                        tex_cache_hit_rate                                Unified Cache Hit Rate       5.42%       5.64%       5.51%
        155                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      15.83%      17.56%      16.73%
        155                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      30.82%      30.94%      30.88%
        155                      dram_read_throughput                         Device Memory Read Throughput  354.28GB/s  371.81GB/s  364.32GB/s
        155                     dram_write_throughput                        Device Memory Write Throughput  83.581GB/s  87.016GB/s  85.605GB/s
        155                      tex_cache_throughput                        Unified cache to SM throughput  362.99GB/s  377.21GB/s  371.66GB/s
        155                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  339.45GB/s  352.62GB/s  347.53GB/s
        155                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  84.125GB/s  87.367GB/s  86.098GB/s
        155                        l2_read_throughput                                 L2 Throughput (Reads)  339.44GB/s  352.80GB/s  347.63GB/s
        155                       l2_write_throughput                                L2 Throughput (Writes)  95.832GB/s  99.730GB/s  98.183GB/s
        155                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   sysmem_write_throughput                        System Memory Write Throughput  57.978KB/s  60.212KB/s  59.337KB/s
        155                     local_load_throughput                          Local Memory Load Throughput  64.974GB/s  67.478GB/s  66.498GB/s
        155                    local_store_throughput                         Local Memory Store Throughput  58.041GB/s  60.277GB/s  59.402GB/s
        155                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                            gld_efficiency                         Global Memory Load Efficiency      31.20%      31.23%      31.22%
        155                            gst_efficiency                        Global Memory Store Efficiency      31.76%      31.76%      31.76%
        155                    tex_cache_transactions                      Unified cache to SM transactions     8199960     8220353     8209511
        155                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        155                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        155                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        155                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        155                             flop_count_sp           Floating Point Operations(Single Precision)   334506426   350619944   350502251
        155                         flop_count_sp_add       Floating Point Operations(Single Precision Add)    18899718    20177192    20166988
        155                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)    95104236   102216784   102166972
        155                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)   125398236   126009184   126001318
        155                     flop_count_sp_special   Floating Point Operations(Single Precision Special)     2332518     3609992     3599788
        155                             inst_executed                                 Instructions Executed    35415834    84420353    57552148
        155                               inst_issued                                   Instructions Issued    35475522    37982705    35496032
        155                          dram_utilization                             Device Memory Utilization     Mid (6)     Mid (6)     Mid (6)
        155                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        155                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       0.91%       2.32%       1.49%
        155                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       2.78%       3.64%       2.97%
        155                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      91.03%      93.20%      92.13%
        155                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        155                                stall_sync                 Issue Stall Reasons (Synchronization)       0.50%       0.63%       0.55%
        155                               stall_other                           Issue Stall Reasons (Other)       0.61%       0.72%       0.67%
        155          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.06%       0.13%       0.08%
        155                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.11%       0.14%       0.12%
        155                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
        155                                inst_fp_32                               FP Instructions(Single)   254370708   258277152   258240176
        155                                inst_fp_64                               FP Instructions(Double)           0           0           0
        155                              inst_integer                                  Integer Instructions   166851808   183594882   166961789
        155                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
        155                              inst_control                             Control-Flow Instructions    20581640    32001810    20665128
        155                        inst_compute_ld_st                               Load/Store Instructions   150697800   150697800   150697800
        155                                 inst_misc                                     Misc Instructions    28660040    37164210    28724715
        155           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        155                               issue_slots                                           Issue Slots    35475522    37982705    35496032
        155                                 cf_issued                      Issued Control-Flow Instructions     1437616     2546926     1446439
        155                               cf_executed                    Executed Control-Flow Instructions     1437616     2546926     1446439
        155                               ldst_issued                        Issued Load/Store Instructions     8639377     8727067     8640125
        155                             ldst_executed                      Executed Load/Store Instructions     8639377     8727067     8640125
        155                       atomic_transactions                                   Atomic Transactions           0           0           0
        155           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        155                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        155                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        155                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)    30654539    30742851    30705896
        155                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       1.38%       2.09%       1.80%
        155                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.16%       0.20%       0.17%
        155                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)     7607250     7607250     7607250
        155             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        155                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        155              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        155                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        155          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        155             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        155      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        155       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        155       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        155        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        155       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        155        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        155                nvlink_transmit_throughput                            NVLink Transmit Throughput  417.44KB/s  433.53KB/s  427.23KB/s
        155                 nvlink_receive_throughput                             NVLink Receive Throughput  313.08KB/s  325.15KB/s  320.43KB/s
        155       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        155        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        155                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        155                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        155                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        155                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        155                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        155                                       ipc                                          Executed IPC    0.124833    0.340911    0.205249
        155                                issued_ipc                                            Issued IPC    0.124244    0.152726    0.133009
        155                    issue_slot_utilization                                Issue Slot Utilization       3.11%       3.82%       3.33%
        155                             sm_efficiency                               Multiprocessor Activity      84.53%      89.31%      87.18%
        155                        achieved_occupancy                                    Achieved Occupancy    0.118253    0.121083    0.119840
        155                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.136061    0.168959    0.145864
        155                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
        155                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
        155                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        155                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        155                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.70%       0.97%       0.88%
        155                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        155                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        155                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        155       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        155            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        155                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_band_back_kernel__26
        103                             inst_per_warp                                 Instructions per warp  7.4554e+04  7.4554e+04  7.4554e+04
        103                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        103                 warp_execution_efficiency                             Warp Execution Efficiency      56.25%      56.25%      56.25%
        103         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      54.82%      54.82%      54.82%
        103                      inst_replay_overhead                           Instruction Replay Overhead    0.000404    0.000487    0.000439
        103      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        103     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        103       local_load_transactions_per_request            Local Memory Load Transactions Per Request    2.500000    2.500000    2.500000
        103      local_store_transactions_per_request           Local Memory Store Transactions Per Request    2.500000    2.500000    2.500000
        103              gld_transactions_per_request                  Global Load Transactions Per Request    2.621333    2.647758    2.632854
        103              gst_transactions_per_request                 Global Store Transactions Per Request    2.750000    2.750000    2.750000
        103                 shared_store_transactions                             Shared Store Transactions           0           0           0
        103                  shared_load_transactions                              Shared Load Transactions           0           0           0
        103                   local_load_transactions                               Local Load Transactions     1905000     1905000     1905000
        103                  local_store_transactions                              Local Store Transactions     1891500     1891500     1891500
        103                          gld_transactions                              Global Load Transactions     3656759     3693623     3672831
        103                          gst_transactions                             Global Store Transactions      123750      123750      123750
        103                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        103                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        103                      l2_read_transactions                                  L2 Read Transactions     4993036     5069547     5037472
        103                     l2_write_transactions                                 L2 Write Transactions     2397028     2421927     2407366
        103                    dram_read_transactions                       Device Memory Read Transactions     5675340     5879775     5770164
        103                   dram_write_transactions                      Device Memory Write Transactions     2028850     2052293     2041506
        103                           global_hit_rate                     Global Hit Rate in unified l1/tex      14.08%      15.81%      14.84%
        103                            local_hit_rate                                        Local Hit Rate       6.28%       7.07%       6.60%
        103                  gld_requested_throughput                      Requested Global Load Throughput  87.116GB/s  93.356GB/s  90.869GB/s
        103                  gst_requested_throughput                     Requested Global Store Throughput  2.8102GB/s  3.0115GB/s  2.9313GB/s
        103                            gld_throughput                                Global Load Throughput  101.61GB/s  109.55GB/s  106.33GB/s
        103                            gst_throughput                               Global Store Throughput  3.4347GB/s  3.6807GB/s  3.5826GB/s
        103                     local_memory_overhead                                 Local Memory Overhead      53.52%      54.42%      53.94%
        103                        tex_cache_hit_rate                                Unified Cache Hit Rate       8.95%      10.06%       9.45%
        103                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      14.26%      15.36%      14.80%
        103                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)       4.16%       4.42%       4.26%
        103                      dram_read_throughput                         Device Memory Read Throughput  158.35GB/s  174.36GB/s  167.05GB/s
        103                     dram_write_throughput                        Device Memory Write Throughput  56.460GB/s  60.961GB/s  59.103GB/s
        103                      tex_cache_throughput                        Unified cache to SM throughput  256.81GB/s  275.67GB/s  268.15GB/s
        103                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  138.33GB/s  149.82GB/s  145.00GB/s
        103                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  55.933GB/s  59.940GB/s  58.343GB/s
        103                        l2_read_throughput                                 L2 Throughput (Reads)  139.12GB/s  150.39GB/s  145.84GB/s
        103                       l2_write_throughput                                L2 Throughput (Writes)  66.634GB/s  71.915GB/s  69.695GB/s
        103                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                   sysmem_write_throughput                        System Memory Write Throughput  145.52KB/s  155.94KB/s  151.78KB/s
        103                     local_load_throughput                          Local Memory Load Throughput  52.873GB/s  56.661GB/s  55.151GB/s
        103                    local_store_throughput                         Local Memory Store Throughput  52.499GB/s  56.259GB/s  54.760GB/s
        103                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        103                            gld_efficiency                         Global Memory Load Efficiency      84.98%      85.83%      85.46%
        103                            gst_efficiency                        Global Memory Store Efficiency      81.82%      81.82%      81.82%
        103                    tex_cache_transactions                      Unified cache to SM transactions     2310600     2320466     2315582
        103                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        103                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        103                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        103                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        103                             flop_count_sp           Floating Point Operations(Single Precision)    59130000    59130000    59130000
        103                         flop_count_sp_add       Floating Point Operations(Single Precision Add)      810000      810000      810000
        103                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)    29160000    29160000    29160000
        103                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
        103                     flop_count_sp_special   Floating Point Operations(Single Precision Special)      815400      815400      815400
        103                             inst_executed                                 Instructions Executed    12221700    22366200    17540175
        103                               inst_issued                                   Instructions Issued    12226667    12227719    12227076
        103                          dram_utilization                             Device Memory Utilization     Low (3)     Low (3)     Low (3)
        103                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        103                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       1.11%       1.90%       1.42%
        103                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       5.22%       5.80%       5.44%
        103                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      92.12%      93.38%      92.88%
        103                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        103                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
        103                               stall_other                           Issue Stall Reasons (Other)       0.09%       0.10%       0.09%
        103          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.10%       0.13%       0.11%
        103                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.01%       0.02%       0.01%
        103                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
        103                                inst_fp_32                               FP Instructions(Single)    31595400    31595400    31595400
        103                                inst_fp_64                               FP Instructions(Double)           0           0           0
        103                              inst_integer                                  Integer Instructions   119988000   119988000   119988000
        103                          inst_bit_convert                              Bit-Convert Instructions       10800       10800       10800
        103                              inst_control                             Control-Flow Instructions     3439800     3439800     3439800
        103                        inst_compute_ld_st                               Load/Store Instructions    53254800    53254800    53254800
        103                                 inst_misc                                     Misc Instructions     9347400     9347400     9347400
        103           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        103                               issue_slots                                           Issue Slots    12226667    12227719    12227076
        103                                 cf_issued                      Issued Control-Flow Instructions      279000      279000      279000
        103                               cf_executed                    Executed Control-Flow Instructions      279000      279000      279000
        103                               ldst_issued                        Issued Load/Store Instructions     3005700     3005700     3005700
        103                             ldst_executed                      Executed Load/Store Instructions     3005700     3005700     3005700
        103                       atomic_transactions                                   Atomic Transactions           0           0           0
        103           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        103                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        103                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        103                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     4960919     5047296     5008558
        103                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.00%       0.01%       0.01%
        103                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.02%       0.04%       0.03%
        103                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)     2015250     2015250     2015250
        103             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        103                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        103              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        103                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        103          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        103             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        103      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        103       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        103       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        103        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        103       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        103        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        103                nvlink_transmit_throughput                            NVLink Transmit Throughput  1.0232MB/s  1.0964MB/s  1.0672MB/s
        103                 nvlink_receive_throughput                             NVLink Receive Throughput  785.79KB/s  842.07KB/s  819.64KB/s
        103       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        103        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        103                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        103                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        103                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        103                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        103                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        103                                       ipc                                          Executed IPC    0.104562    0.175170    0.136577
        103                                issued_ipc                                            Issued IPC    0.104607    0.114902    0.108720
        103                    issue_slot_utilization                                Issue Slot Utilization       2.62%       2.87%       2.72%
        103                             sm_efficiency                               Multiprocessor Activity      93.17%      96.42%      94.72%
        103                        achieved_occupancy                                    Achieved Occupancy    0.058455    0.058861    0.058625
        103                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.105543    0.116186    0.109803
        103                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
        103                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (1)     Low (1)
        103                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        103                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        103                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        103                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        103                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
        103             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        103           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
        103           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        103                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        103                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.32%       0.41%       0.38%
        103                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        103                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        103                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        103       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        103            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        103                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_interface_gradients_of_laplacians__21
        155                             inst_per_warp                                 Instructions per warp  3.6900e+03  3.6900e+03  3.6900e+03
        155                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        155                 warp_execution_efficiency                             Warp Execution Efficiency      56.25%      56.25%      56.25%
        155         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      53.93%      53.93%      53.93%
        155                      inst_replay_overhead                           Instruction Replay Overhead    0.014868    0.020550    0.016656
        155      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        155      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        155              gld_transactions_per_request                  Global Load Transactions Per Request    9.191900    9.197652    9.194650
        155              gst_transactions_per_request                 Global Store Transactions Per Request    9.250000    9.250000    9.250000
        155                 shared_store_transactions                             Shared Store Transactions           0           0           0
        155                  shared_load_transactions                              Shared Load Transactions           0           0           0
        155                   local_load_transactions                               Local Load Transactions           0           0           0
        155                  local_store_transactions                              Local Store Transactions           0           0           0
        155                          gld_transactions                              Global Load Transactions     1558027     1559002     1558493
        155                          gst_transactions                             Global Store Transactions      666000      666000      666000
        155                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        155                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        155                      l2_read_transactions                                  L2 Read Transactions     1121736     1129357     1125262
        155                     l2_write_transactions                                 L2 Write Transactions      757344      784877      769611
        155                    dram_read_transactions                       Device Memory Read Transactions     1197193     1211628     1205184
        155                   dram_write_transactions                      Device Memory Write Transactions      696473      721863      709764
        155                           global_hit_rate                     Global Hit Rate in unified l1/tex      48.79%      49.16%      48.96%
        155                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
        155                  gld_requested_throughput                      Requested Global Load Throughput  120.42GB/s  123.54GB/s  122.12GB/s
        155                  gst_requested_throughput                     Requested Global Store Throughput  48.122GB/s  49.371GB/s  48.803GB/s
        155                            gld_throughput                                Global Load Throughput  462.91GB/s  475.12GB/s  469.50GB/s
        155                            gst_throughput                               Global Store Throughput  197.84GB/s  202.97GB/s  200.64GB/s
        155                     local_memory_overhead                                 Local Memory Overhead      36.36%      36.72%      36.57%
        155                        tex_cache_hit_rate                                Unified Cache Hit Rate      19.69%      20.01%      19.84%
        155                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)      16.97%      18.43%      17.66%
        155                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)      85.44%      86.43%      85.94%
        155                      dram_read_throughput                         Device Memory Read Throughput  357.12GB/s  367.86GB/s  363.07GB/s
        155                     dram_write_throughput                        Device Memory Write Throughput  208.12GB/s  219.00GB/s  213.82GB/s
        155                      tex_cache_throughput                        Unified cache to SM throughput  373.23GB/s  383.92GB/s  379.00GB/s
        155                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  333.64GB/s  342.61GB/s  338.70GB/s
        155                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  197.84GB/s  202.97GB/s  200.64GB/s
        155                        l2_read_throughput                                 L2 Throughput (Reads)  333.72GB/s  343.62GB/s  338.99GB/s
        155                       l2_write_throughput                                L2 Throughput (Writes)  226.39GB/s  237.74GB/s  231.85GB/s
        155                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   sysmem_write_throughput                        System Memory Write Throughput  1.5209MB/s  1.5604MB/s  1.5424MB/s
        155                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        155                            gld_efficiency                         Global Memory Load Efficiency      26.00%      26.02%      26.01%
        155                            gst_efficiency                        Global Memory Store Efficiency      24.32%      24.32%      24.32%
        155                    tex_cache_transactions                      Unified cache to SM transactions      313827      315061      314515
        155                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        155                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        155                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        155                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        155                             flop_count_sp           Floating Point Operations(Single Precision)     7884000     7884000     7884000
        155                         flop_count_sp_add       Floating Point Operations(Single Precision Add)      540000      540000      540000
        155                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)     2052000     2052000     2052000
        155                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)     3240000     3240000     3240000
        155                     flop_count_sp_special   Floating Point Operations(Single Precision Special)      108000      108000      108000
        155                             inst_executed                                 Instructions Executed     1806000     5535000     3706587
        155                               inst_issued                                   Instructions Issued     1832832     1840144     1836104
        155                          dram_utilization                             Device Memory Utilization    High (7)    High (8)    High (7)
        155                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        155                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       0.50%       2.11%       1.26%
        155                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       1.49%       1.79%       1.60%
        155                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      89.63%      92.75%      91.27%
        155                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        155                                stall_sync                 Issue Stall Reasons (Synchronization)       2.43%       2.98%       2.61%
        155                               stall_other                           Issue Stall Reasons (Other)       0.16%       0.21%       0.18%
        155          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.72%       1.65%       1.09%
        155                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.15%       0.26%       0.20%
        155                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
        155                                inst_fp_32                               FP Instructions(Single)     6048000     6048000     6048000
        155                                inst_fp_64                               FP Instructions(Double)           0           0           0
        155                              inst_integer                                  Integer Instructions    17523000    17523000    17523000
        155                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
        155                              inst_control                             Control-Flow Instructions     1350000     1350000     1350000
        155                        inst_compute_ld_st                               Load/Store Instructions     4347000     4347000     4347000
        155                                 inst_misc                                     Misc Instructions     1917000     1917000     1917000
        155           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        155                               issue_slots                                           Issue Slots     1832832     1840144     1836104
        155                                 cf_issued                      Issued Control-Flow Instructions       94500       94500       94500
        155                               cf_executed                    Executed Control-Flow Instructions       94500       94500       94500
        155                               ldst_issued                        Issued Load/Store Instructions      283500      283500      283500
        155                             ldst_executed                      Executed Load/Store Instructions      283500      283500      283500
        155                       atomic_transactions                                   Atomic Transactions           0           0           0
        155           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        155                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        155                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        155                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     1120563     1127677     1124298
        155                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.94%       2.84%       1.35%
        155                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.37%       0.51%       0.43%
        155                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)      666000      666000      666000
        155             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        155                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        155              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        155                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        155          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        155             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        155      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        155       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        155       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        155        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        155       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        155        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        155                nvlink_transmit_throughput                            NVLink Transmit Throughput  10.951MB/s  11.235MB/s  11.105MB/s
        155                 nvlink_receive_throughput                             NVLink Receive Throughput  8.2129MB/s  8.4260MB/s  8.3291MB/s
        155       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        155        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        155                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        155                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        155                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        155                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        155                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        155                                       ipc                                          Executed IPC    0.155944    0.440132    0.280529
        155                                issued_ipc                                            Issued IPC    0.158479    0.189305    0.171594
        155                    issue_slot_utilization                                Issue Slot Utilization       3.96%       4.73%       4.29%
        155                             sm_efficiency                               Multiprocessor Activity      83.05%      94.29%      91.73%
        155                        achieved_occupancy                                    Achieved Occupancy    0.282278    0.286532    0.283861
        155                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.231217    0.276409    0.249759
        155                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
        155                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (2)     Low (1)
        155                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        155                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                    special_fu_utilization                     Special Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
        155           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        155                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        155                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.38%       0.59%       0.52%
        155                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        155                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        155                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        155       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        155            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        155                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_gpu_solution_update__29
         51                             inst_per_warp                                 Instructions per warp  146.996861  146.996861  146.996861
         51                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
         51                 warp_execution_efficiency                             Warp Execution Efficiency     100.00%     100.00%     100.00%
         51         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      96.00%      96.00%      96.00%
         51                      inst_replay_overhead                           Instruction Replay Overhead    0.001421    0.002164    0.001742
         51      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
         51     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
         51       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
         51      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
         51              gld_transactions_per_request                  Global Load Transactions Per Request    3.999989    3.999989    3.999989
         51              gst_transactions_per_request                 Global Store Transactions Per Request    3.999989    3.999989    3.999989
         51                 shared_store_transactions                             Shared Store Transactions           0           0           0
         51                  shared_load_transactions                              Shared Load Transactions           0           0           0
         51                   local_load_transactions                               Local Load Transactions           0           0           0
         51                  local_store_transactions                              Local Store Transactions           0           0           0
         51                          gld_transactions                              Global Load Transactions     2997000     2997000     2997000
         51                          gst_transactions                             Global Store Transactions     2247750     2247750     2247750
         51                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
         51                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
         51                      l2_read_transactions                                  L2 Read Transactions     2997096     2997712     2997235
         51                     l2_write_transactions                                 L2 Write Transactions     2247782     2271570     2258118
         51                    dram_read_transactions                       Device Memory Read Transactions     2997005     2997075     2997030
         51                   dram_write_transactions                      Device Memory Write Transactions      781631      804089      792464
         51                           global_hit_rate                     Global Hit Rate in unified l1/tex      42.78%      42.81%      42.80%
         51                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
         51                  gld_requested_throughput                      Requested Global Load Throughput  569.92GB/s  576.75GB/s  574.71GB/s
         51                  gst_requested_throughput                     Requested Global Store Throughput  427.44GB/s  432.56GB/s  431.03GB/s
         51                            gld_throughput                                Global Load Throughput  569.92GB/s  576.75GB/s  574.71GB/s
         51                            gst_throughput                               Global Store Throughput  427.44GB/s  432.56GB/s  431.03GB/s
         51                     local_memory_overhead                                 Local Memory Overhead      42.76%      42.82%      42.80%
         51                        tex_cache_hit_rate                                Unified Cache Hit Rate       0.00%       0.00%       0.00%
         51                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
         51                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)     100.00%     100.00%     100.00%
         51                      dram_read_throughput                         Device Memory Read Throughput  569.92GB/s  576.76GB/s  574.72GB/s
         51                     dram_write_throughput                        Device Memory Write Throughput  148.75GB/s  154.60GB/s  151.96GB/s
         51                      tex_cache_throughput                        Unified cache to SM throughput  712.41GB/s  720.95GB/s  718.40GB/s
         51                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  569.92GB/s  576.75GB/s  574.71GB/s
         51                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  427.44GB/s  432.56GB/s  431.03GB/s
         51                        l2_read_throughput                                 L2 Throughput (Reads)  569.94GB/s  576.82GB/s  574.76GB/s
         51                       l2_write_throughput                                L2 Throughput (Writes)  429.41GB/s  436.70GB/s  433.02GB/s
         51                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         51                   sysmem_write_throughput                        System Memory Write Throughput  997.01KB/s  0.9853MB/s  0.9818MB/s
         51                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         51                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         51                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         51                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         51                            gld_efficiency                         Global Memory Load Efficiency     100.00%     100.00%     100.00%
         51                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
         51                    tex_cache_transactions                      Unified cache to SM transactions      936572      936572      936572
         51                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
         51                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
         51                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
         51                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
         51                             flop_count_sp           Floating Point Operations(Single Precision)    53946000    53946000    53946000
         51                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
         51                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)    17982000    17982000    17982000
         51                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)    17982000    17982000    17982000
         51                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           0           0           0
         51                             inst_executed                                 Instructions Executed     7492590    27535452    18889511
         51                               inst_issued                                   Instructions Issued     7502946     7508803     7505542
         51                          dram_utilization                             Device Memory Utilization    High (9)    High (9)    High (9)
         51                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
         51                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       0.27%       0.43%       0.33%
         51                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)       2.28%       2.72%       2.48%
         51                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      94.67%      95.69%      95.28%
         51                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
         51                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
         51                               stall_other                           Issue Stall Reasons (Other)       0.10%       0.13%       0.11%
         51          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.09%       0.28%       0.15%
         51                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)       0.23%       0.32%       0.27%
         51                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
         51                                inst_fp_32                               FP Instructions(Single)    35964000    35964000    35964000
         51                                inst_fp_64                               FP Instructions(Double)           0           0           0
         51                              inst_integer                                  Integer Instructions   113887440   113887440   113887440
         51                          inst_bit_convert                              Bit-Convert Instructions           0           0           0
         51                              inst_control                             Control-Flow Instructions     5994240     5994240     5994240
         51                        inst_compute_ld_st                               Load/Store Instructions    41958000    41958000    41958000
         51                                 inst_misc                                     Misc Instructions    29970480    29970480    29970480
         51           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
         51                               issue_slots                                           Issue Slots     7502946     7508803     7505542
         51                                 cf_issued                      Issued Control-Flow Instructions      561953      561953      561953
         51                               cf_executed                    Executed Control-Flow Instructions      561953      561953      561953
         51                               ldst_issued                        Issued Load/Store Instructions     1685831     1685831     1685831
         51                             ldst_executed                      Executed Load/Store Instructions     1685831     1685831     1685831
         51                       atomic_transactions                                   Atomic Transactions           0           0           0
         51           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
         51                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
         51                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
         51                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)     2997000     2997000     2997000
         51                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.86%       1.04%       0.94%
         51                        stall_not_selected                    Issue Stall Reasons (Not Selected)       0.40%       0.51%       0.45%
         51                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)     2247750     2247750     2247750
         51             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
         51                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
         51              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
         51                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
         51          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
         51             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
         51      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
         51       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
         51       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
         51        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
         51       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
         51        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
         51                nvlink_transmit_throughput                            NVLink Transmit Throughput  7.0102MB/s  7.0942MB/s  7.0691MB/s
         51                 nvlink_receive_throughput                             NVLink Receive Throughput  5.2577MB/s  5.3207MB/s  5.3018MB/s
         51       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
         51        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
         51                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
         51                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
         51                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
         51                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
         51                                inst_fp_16                                 HP Instructions(Half)           0           0           0
         51                                       ipc                                          Executed IPC    0.346575    0.474004    0.408312
         51                                issued_ipc                                            Issued IPC    0.402231    0.475105    0.434735
         51                    issue_slot_utilization                                Issue Slot Utilization      10.06%      11.88%      10.87%
         51                             sm_efficiency                               Multiprocessor Activity      96.02%      99.38%      97.18%
         51                        achieved_occupancy                                    Achieved Occupancy    0.892468    0.896008    0.894539
         51                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle    0.644478    0.779110    0.703747
         51                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
         51                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (3)     Low (1)
         51                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
         51                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (2)     Low (1)
         51                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
         51                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         51                    special_fu_utilization                     Special Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         51             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         51           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
         51           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
         51                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
         51                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.51%       2.57%       2.04%
         51                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
         51                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
         51                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
         51       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
         51            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
         51                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
    Kernel: ptxcall_copy_kernel__5
        104                             inst_per_warp                                 Instructions per warp  562.981353  562.981353  562.981353
        104                         branch_efficiency                                     Branch Efficiency     100.00%     100.00%     100.00%
        104                 warp_execution_efficiency                             Warp Execution Efficiency     100.00%     100.00%     100.00%
        104         warp_nonpred_execution_efficiency              Warp Non-Predicated Execution Efficiency      95.13%      95.13%      95.13%
        104                      inst_replay_overhead                           Instruction Replay Overhead    0.000559    0.001271    0.000821
        104      shared_load_transactions_per_request           Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        104     shared_store_transactions_per_request          Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        104       local_load_transactions_per_request            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
        104      local_store_transactions_per_request           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
        104              gld_transactions_per_request                  Global Load Transactions Per Request    3.999989    3.999989    3.999989
        104              gst_transactions_per_request                 Global Store Transactions Per Request    3.999989    3.999989    3.999989
        104                 shared_store_transactions                             Shared Store Transactions           0           0           0
        104                  shared_load_transactions                              Shared Load Transactions           0           0           0
        104                   local_load_transactions                               Local Load Transactions           0           0           0
        104                  local_store_transactions                              Local Store Transactions           0           0           0
        104                          gld_transactions                              Global Load Transactions      749250      749250      749250
        104                          gst_transactions                             Global Store Transactions      749250      749250      749250
        104                  sysmem_read_transactions                       System Memory Read Transactions           0           0           0
        104                 sysmem_write_transactions                      System Memory Write Transactions           5           5           5
        104                      l2_read_transactions                                  L2 Read Transactions      749394      750078      749678
        104                     l2_write_transactions                                 L2 Write Transactions      749270      773201      758698
        104                    dram_read_transactions                       Device Memory Read Transactions      749262      749401      749322
        104                   dram_write_transactions                      Device Memory Write Transactions      739825      765772      754082
        104                           global_hit_rate                     Global Hit Rate in unified l1/tex       0.00%       0.00%       0.00%
        104                            local_hit_rate                                        Local Hit Rate       0.00%       0.00%       0.00%
        104                  gld_requested_throughput                      Requested Global Load Throughput  269.99GB/s  311.11GB/s  295.07GB/s
        104                  gst_requested_throughput                     Requested Global Store Throughput  269.99GB/s  311.11GB/s  295.07GB/s
        104                            gld_throughput                                Global Load Throughput  269.99GB/s  311.11GB/s  295.07GB/s
        104                            gst_throughput                               Global Store Throughput  269.99GB/s  311.11GB/s  295.07GB/s
        104                     local_memory_overhead                                 Local Memory Overhead       0.00%       0.00%       0.00%
        104                        tex_cache_hit_rate                                Unified Cache Hit Rate       0.00%       0.00%       0.00%
        104                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
        104                     l2_tex_write_hit_rate                          L2 Hit Rate (Texture Writes)       0.00%       0.00%       0.00%
        104                      dram_read_throughput                         Device Memory Read Throughput  269.99GB/s  311.11GB/s  295.09GB/s
        104                     dram_write_throughput                        Device Memory Write Throughput  268.76GB/s  316.40GB/s  296.97GB/s
        104                      tex_cache_throughput                        Unified cache to SM throughput  539.99GB/s  622.23GB/s  590.14GB/s
        104                    l2_tex_read_throughput                         L2 Throughput (Texture Reads)  269.99GB/s  311.11GB/s  295.07GB/s
        104                   l2_tex_write_throughput                        L2 Throughput (Texture Writes)  269.99GB/s  311.11GB/s  295.07GB/s
        104                        l2_read_throughput                                 L2 Throughput (Reads)  270.28GB/s  311.31GB/s  295.23GB/s
        104                       l2_write_throughput                                L2 Throughput (Writes)  270.00GB/s  320.89GB/s  298.79GB/s
        104                    sysmem_read_throughput                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        104                   sysmem_write_throughput                        System Memory Write Throughput  1.8450MB/s  2.1259MB/s  2.0163MB/s
        104                     local_load_throughput                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        104                    local_store_throughput                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        104                    shared_load_throughput                         Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        104                   shared_store_throughput                        Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
        104                            gld_efficiency                         Global Memory Load Efficiency     100.00%     100.00%     100.00%
        104                            gst_efficiency                        Global Memory Store Efficiency     100.00%     100.00%     100.00%
        104                    tex_cache_transactions                      Unified cache to SM transactions      374633      374633      374633
        104                             flop_count_dp           Floating Point Operations(Double Precision)           0           0           0
        104                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
        104                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
        104                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           0           0           0
        104                             flop_count_sp           Floating Point Operations(Single Precision)           0           0           0
        104                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
        104                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)           0           0           0
        104                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
        104                     flop_count_sp_special   Floating Point Operations(Single Precision Special)    11988000    11988000    11988000
        104                             inst_executed                                 Instructions Executed    25474645   105457667    69311493
        104                               inst_issued                                   Instructions Issued    25489110    25507044    25495751
        104                          dram_utilization                             Device Memory Utilization    High (7)    High (8)    High (7)
        104                        sysmem_utilization                             System Memory Utilization     Low (1)     Low (1)     Low (1)
        104                          stall_inst_fetch              Issue Stall Reasons (Instructions Fetch)       2.48%       3.29%       2.75%
        104                     stall_exec_dependency            Issue Stall Reasons (Execution Dependency)      19.61%      20.54%      20.06%
        104                   stall_memory_dependency                    Issue Stall Reasons (Data Request)      26.96%      34.25%      31.29%
        104                             stall_texture                         Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
        104                                stall_sync                 Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
        104                               stall_other                           Issue Stall Reasons (Other)       5.19%       6.43%       5.71%
        104          stall_constant_memory_dependency              Issue Stall Reasons (Immediate constant)       0.53%       1.39%       0.82%
        104                           stall_pipe_busy                       Issue Stall Reasons (Pipe Busy)      15.62%      18.00%      16.58%
        104                         shared_efficiency                              Shared Memory Efficiency       0.00%       0.00%       0.00%
        104                                inst_fp_32                               FP Instructions(Single)    11988000    11988000    11988000
        104                                inst_fp_64                               FP Instructions(Double)           0           0           0
        104                              inst_integer                                  Integer Instructions   617762746   617762746   617762746
        104                          inst_bit_convert                              Bit-Convert Instructions    23976000    23976000    23976000
        104                              inst_control                             Control-Flow Instructions    53946240    53946240    53946240
        104                        inst_compute_ld_st                               Load/Store Instructions    11988000    11988000    11988000
        104                                 inst_misc                                     Misc Instructions    23976480    23976480    23976480
        104           inst_inter_thread_communication                             Inter-Thread Instructions           0           0           0
        104                               issue_slots                                           Issue Slots    25489110    25507044    25495751
        104                                 cf_issued                      Issued Control-Flow Instructions     2435083     2435083     2435083
        104                               cf_executed                    Executed Control-Flow Instructions     2435083     2435083     2435083
        104                               ldst_issued                        Issued Load/Store Instructions      749266      749266      749266
        104                             ldst_executed                      Executed Load/Store Instructions      749266      749266      749266
        104                       atomic_transactions                                   Atomic Transactions           0           0           0
        104           atomic_transactions_per_request                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
        104                      l2_atomic_throughput                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
        104                    l2_atomic_transactions                     L2 Transactions (Atomic requests)           0           0           0
        104                  l2_tex_read_transactions                       L2 Transactions (Texture Reads)      749250      749250      749250
        104                     stall_memory_throttle                 Issue Stall Reasons (Memory Throttle)       0.83%       0.85%       0.84%
        104                        stall_not_selected                    Issue Stall Reasons (Not Selected)      20.34%      24.04%      21.95%
        104                 l2_tex_write_transactions                      L2 Transactions (Texture Writes)      749250      749250      749250
        104             nvlink_total_data_transmitted                         NVLink Total Data Transmitted        1152        1152        1152
        104                nvlink_total_data_received                            NVLink Total Data Received         864         864         864
        104              nvlink_user_data_transmitted                          NVLink User Data Transmitted           0           0           0
        104                 nvlink_user_data_received                             NVLink User Data Received           0           0           0
        104          nvlink_overhead_data_transmitted                      NVLink Overhead Data Transmitted       1.00%       1.00%       1.00%
        104             nvlink_overhead_data_received                         NVLink Overhead Data Received       1.00%       1.00%       1.00%
        104      nvlink_total_nratom_data_transmitted                  NVLink Total Nratom Data Transmitted           0           0           0
        104       nvlink_user_nratom_data_transmitted                   NVLink User Nratom Data Transmitted           0           0           0
        104       nvlink_total_ratom_data_transmitted                   NVLink Total Ratom Data Transmitted           0           0           0
        104        nvlink_user_ratom_data_transmitted                    NVLink User Ratom Data Transmitted           0           0           0
        104       nvlink_total_write_data_transmitted                   NVLink Total Write Data Transmitted           0           0           0
        104        nvlink_user_write_data_transmitted                    NVLink User Write Data Transmitted           0           0           0
        104                nvlink_transmit_throughput                            NVLink Transmit Throughput  13.284MB/s  15.307MB/s  14.518MB/s
        104                 nvlink_receive_throughput                             NVLink Receive Throughput  9.9628MB/s  11.480MB/s  10.888MB/s
        104       nvlink_total_response_data_received                   NVLink Total Response Data Received         288         288         288
        104        nvlink_user_response_data_received                    NVLink User Response Data Received           0           0           0
        104                             flop_count_hp             Floating Point Operations(Half Precision)           0           0           0
        104                         flop_count_hp_add         Floating Point Operations(Half Precision Add)           0           0           0
        104                         flop_count_hp_mul          Floating Point Operation(Half Precision Mul)           0           0           0
        104                         flop_count_hp_fma         Floating Point Operations(Half Precision FMA)           0           0           0
        104                                inst_fp_16                                 HP Instructions(Half)           0           0           0
        104                                       ipc                                          Executed IPC    0.473342    3.118486    1.914070
        104                                issued_ipc                                            Issued IPC    3.028094    3.120229    3.072590
        104                    issue_slot_utilization                                Issue Slot Utilization      75.70%      78.01%      76.81%
        104                             sm_efficiency                               Multiprocessor Activity      82.84%      99.61%      95.29%
        104                        achieved_occupancy                                    Achieved Occupancy    0.912039    0.922041    0.916920
        104                  eligible_warps_per_cycle                       Eligible Warps Per Active Cycle   13.953308   15.857738   14.756881
        104                        shared_utilization                             Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
        104                            l2_utilization                                  L2 Cache Utilization     Low (1)     Low (2)     Low (1)
        104                           tex_utilization                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
        104                       ldst_fu_utilization                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
        104                         cf_fu_utilization                Control-Flow Function Unit Utilization     Low (2)     Low (2)     Low (2)
        104                        tex_fu_utilization                     Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        104                    special_fu_utilization                     Special Function Unit Utilization     Low (3)     Low (3)     Low (3)
        104             half_precision_fu_utilization              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        104           single_precision_fu_utilization            Single-Precision Function Unit Utilization     Mid (6)    High (7)     Mid (6)
        104           double_precision_fu_utilization            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
        104                        flop_hp_efficiency                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
        104                        flop_sp_efficiency                          FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
        104                        flop_dp_efficiency                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
        104                   sysmem_read_utilization                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
        104                  sysmem_write_utilization                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
        104       nvlink_data_transmission_efficiency                   NVLink Data Transmission Efficiency       0.00%       0.00%       0.00%
        104            nvlink_data_receive_efficiency                        NVLink Data Receive Efficiency       0.00%       0.00%       0.00%
        104                            stall_sleeping                        Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
======== Error: Application returned non-zero code 130

lucas@ip-172-31-35-196 ~/research/code/ClimateMachine.jl lcw/diff_nstate* 23m 51s
❯

master

The following kernels have local loads and stores

ptxcall_gpu_interface_gradients__17 ptxcall_gpu_interface_tendency__23 ptxcall_gpu_volume_tendency__22 ptxcall_gpu_volume_gradients__16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment