We are experimenting with the number of tracers in ClimateMachine. We are starting
with ClimateMachine branch lcw/diff_nstate
(aka
f25ed2a9c93674ae27c3e0e5280c1aadab64807e
). The following change is made
reduce the overall runtime of the simulation.
diff --git a/experiments/AtmosGCM/heldsuarez.jl b/experiments/AtmosGCM/heldsuarez.jl
index e2d980ee1..a1537958b 100755
--- a/experiments/AtmosGCM/heldsuarez.jl
+++ b/experiments/AtmosGCM/heldsuarez.jl
@@ -1,6 +1,7 @@
#!/usr/bin/env julia --project
using ClimateMachine
using ArgParse
+using CUDAdrv
s = ArgParseSettings()
@add_arg_table! s begin
@@ -208,7 +209,7 @@ function main()
poly_order = 5 # discontinuous Galerkin polynomial order
n_horz = 5 # horizontal element number
n_vert = 5 # vertical element number
- n_days = 120 # experiment day number
+ n_days = 0.1 # experiment day number
timestart = FT(0) # start time (s)
timeend = FT(n_days * day(param_set)) # end time (s)
@@ -243,12 +244,14 @@ function main()
end
# Run the model
- result = ClimateMachine.invoke!(
- solver_config;
- diagnostics_config = dgn_config,
- user_callbacks = (cbfilter,),
- check_euclidean_distance = true,
- )
+ CUDAdrv.@profile begin
+ result = ClimateMachine.invoke!(
+ solver_config;
+ diagnostics_config = dgn_config,
+ user_callbacks = (cbfilter,),
+ check_euclidean_distance = true,
+ )
+ end
end
main()
For the first table, we vary the number of tracers and see what the time per time step is. We are running with Julia v1.3.1 on an AWS instance with a V100 and use the following command to compute the wall-clock time per time-step.
for m = 2, n = 0:5
number_of_ranks = m
number_of_tracers = 2^n
@info "Starting...." number_of_ranks number_of_tracers
run(`mpirun -np $(m) julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers $(2^n)`)
end
We get the following performance characteristics
Ranks Number of Tracers Wall-Time Per Step (s)
before changes after changes
------------------------------------------------------------------------
1 1 5.8616807999999996e-03 5.7307617999999994e-03
1 2 6.8119575000000002e-03 6.0563041999999994e-03
1 4 8.2639034999999993e-03 6.8886373999999997e-03
1 8 1.2300703499999999e-02 8.7006017000000012e-03
1 16 1.3299454180000000e-01 1.4463497600000000e-02
1 32 5.1850621520000006e-01 2.6536463500000003e-02
2 1 1.7634898700000002e-02
2 2 1.7925797700000003e-02
2 4 1.7990231900000000e-02
2 8 1.8047714700000002e-02
2 16 1.8218115100000001e-02
2 32 2.2956788700000001e-02
I just grabbed a random output of the time per time-step and put it into
wall_clock_time
. The timings did bounce around a little.
Below is a profile from a single GPU run with 2 tracers.
❯ /usr/local/cuda-9.0/bin/nvprof julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 2
==8410== NVPROF is profiling process 8410, command: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 2
┌ Info: Model composition
│ param_set = EarthParameterSet()
│ orientation = SphericalOrientation()
│ ref_state = HydrostaticState{DecayingTemperatureProfile{Float32},Float32}(DecayingTemperatureProfile{Float32}(290.0f0, 220.0f0, 8484.2705f0), 0.0f0)
│ turbulence = SmagorinskyLilly{Float32}(0.21f0)
│ hyperdiffusion = StandardHyperDiffusion{Float32}(14400.0f0)
│ moisture = DryModel()
│ precipitation = NoPrecipitation()
│ radiation = NoRadiation()
│ source = (Gravity(), Coriolis(), held_suarez_forcing!, RayleighSponge{Float32}(30000.0f0, 12000.0f0, 0.0011111111f0, Float32[0.0, 0.0, 0.0], 2.0f0))
│ tracers = NTracers{2,Float32}(Float32[1.0, 2.0])
│ boundarycondition = AtmosBC{Impenetrable{FreeSlip},Insulating,Impermeable,ImpermeableTracer}(Impenetrable{FreeSlip}(FreeSlip()), Insulating(), Impermeable(), ImpermeableTracer())
│ init_state_conservative = init_heldsuarez!
└ data_config = HeldSuarezDataConfig{Float32}(255.0f0)
┌ Info: Establishing Atmos GCM configuration for HeldSuarez
│ precision = Float32
│ polynomial order = 5
│ #horiz elems = 5
│ #vert elems = 5
│ domain height = 3.00e+04 m
│ MPI ranks = 1
│ min(Δ_horz) = 167863.59 m
└ min(Δ_vert) = 703.00 m
[ Info: Initializing HeldSuarez
┌ Info: Starting HeldSuarez
│ dt = 9.81818e+01
│ timeend = 8640.00
│ number of steps = 88
└ norm(Q) = 7.6343697997824000e+13
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 5.5339992551999995e+00
│ minimum (s) = 5.5339992551999995e+00
│ median (s) = 5.5339992551999995e+00
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 1.0385914899999999e-02
│ minimum (s) = 1.0385914899999999e-02
│ median (s) = 1.0385914899999999e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 1.0211630600000000e-02
│ minimum (s) = 1.0211630600000000e-02
│ median (s) = 1.0211630600000000e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 1.0108128700000000e-02
│ minimum (s) = 1.0108128700000000e-02
│ median (s) = 1.0108128700000000e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 9.9687229000000009e-03
│ minimum (s) = 9.9687229000000009e-03
│ median (s) = 9.9687229000000009e-03
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 1.0053819799999999e-02
│ minimum (s) = 1.0053819799999999e-02
│ median (s) = 1.0053819799999999e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 1.0057780700000001e-02
│ minimum (s) = 1.0057780700000001e-02
│ median (s) = 1.0057780700000001e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 1.0062114200000000e-02
│ minimum (s) = 1.0062114200000000e-02
│ median (s) = 1.0062114200000000e-02
└ std (s) = NaN
┌ Info: Finished
│ norm(Q) = 7.6379534131200000e+13
│ norm(Q) / norm(Q₀) = 1.0004694461822510e+00
└ norm(Q) - norm(Q₀) = 3.5836133376000000e+10
┌ Info: Euclidean distance
│ norm(Q - Qe) = 4.3016309964800000e+11
└ norm(Q - Qe) / norm(Qe) = 5.6345746852457523e-03
==8410== Profiling application: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 2
==8410== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 24.17% 213.23ms 178 1.1979ms 1.1633ms 1.3229ms ptxcall_gpu_band_forward_kernel__25
20.83% 183.82ms 178 1.0327ms 1.0061ms 1.0921ms ptxcall_gpu_band_back_kernel__26
10.80% 95.254ms 267 356.76us 342.05us 370.69us ptxcall_gpu_interface_gradients__17
8.69% 76.704ms 267 287.28us 278.46us 312.86us ptxcall_gpu_interface_tendency__23
8.57% 75.619ms 267 283.22us 258.62us 299.68us ptxcall_gpu_volume_gradients__16
5.95% 52.507ms 267 196.65us 187.55us 214.30us ptxcall_gpu_volume_tendency__22
5.80% 51.167ms 1 51.167ms 51.167ms 51.167ms ptxcall_gpu_band_lu_kernel__14
3.50% 30.844ms 20 1.5422ms 1.6320us 6.6805ms [CUDA memcpy DtoH]
2.86% 25.259ms 267 94.603us 93.247us 97.087us ptxcall_gpu_interface_gradients_of_laplacians__21
2.07% 18.307ms 267 68.564us 67.263us 69.888us ptxcall_gpu_interface_divergence_of_gradients__19
1.09% 9.5809ms 268 35.749us 33.280us 37.792us ptxcall_gpu_kernel_nodal_update_auxiliary_state__6
0.81% 7.1712ms 267 26.858us 25.376us 29.056us ptxcall_gpu_volume_divergence_of_gradients__18
0.75% 6.6457ms 267 24.890us 23.552us 26.688us ptxcall_gpu_volume_gradients_of_laplacians__20
0.52% 4.5928ms 178 25.802us 24.480us 28.320us ptxcall_anonymous25_27
0.50% 4.4443ms 150 29.628us 28.671us 38.368us ptxcall_gpu_volume_tendency__10
0.47% 4.1114ms 150 27.409us 27.232us 27.648us ptxcall_anonymous25_12
0.40% 3.5553ms 89 39.946us 39.072us 41.344us ptxcall_gpu_stage_update__28
0.37% 3.2247ms 182 17.718us 15.296us 41.568us ptxcall_copy_kernel__5
0.35% 3.0527ms 59 51.741us 1.4080us 960.35us [CUDA memcpy HtoD]
0.31% 2.7744ms 150 18.496us 17.983us 19.936us ptxcall_gpu_interface_tendency__11
0.30% 2.6561ms 89 29.843us 29.280us 30.592us ptxcall_gpu_solution_update__29
0.28% 2.5039ms 89 28.133us 27.456us 29.440us ptxcall_gpu_stage_update__24
0.19% 1.6869ms 89 18.954us 17.888us 20.639us ptxcall_gpu_kernel_apply_filter__30
0.12% 1.0773ms 150 7.1820us 5.9200us 8.0000us ptxcall_gpu_kernel_set_banded_matrix__13
0.09% 799.32us 150 5.3280us 4.8640us 6.5280us ptxcall_gpu_kernel_set_banded_data__9
0.06% 517.66us 2 258.83us 254.94us 262.72us ptxcall_mapreducedim_kernel_parallel_2
0.04% 323.52us 2 161.76us 153.89us 169.63us ptxcall_gpu_kernel_min_neighbor_distance__1
0.03% 229.44us 158 1.4520us 1.3750us 2.1120us [CUDA memset]
0.02% 209.09us 1 209.09us 209.09us 209.09us ptxcall_mapreducedim_kernel_parallel_8
0.02% 159.74us 1 159.74us 159.74us 159.74us ptxcall_gpu_kernel_init_state_auxiliary__4
0.01% 127.39us 3 42.464us 38.464us 47.520us ptxcall_reduce_kernel_15
0.01% 79.359us 1 79.359us 79.359us 79.359us ptxcall_gpu_kernel_min_neighbor_distance__3
0.01% 45.183us 1 45.183us 45.183us 45.183us ptxcall_reduce_kernel_31
0.00% 18.912us 1 18.912us 18.912us 18.912us ptxcall_gpu_kernel_local_courant__7
API calls: 76.03% 1.62968s 31 52.570ms 311.95us 347.94ms cuModuleLoadDataEx
9.46% 202.81ms 1 202.81ms 202.81ms 202.81ms cuDevicePrimaryCtxRetain
5.50% 117.88ms 73834 1.5960us 1.3390us 554.10us cuEventQuery
3.25% 69.764ms 4239 16.457us 11.149us 2.7691ms cuLaunchKernel
1.80% 38.620ms 20 1.9310ms 41.680us 7.4451ms cuMemcpyDtoH
0.81% 17.359ms 6758 2.5680us 1.6340us 18.214us cuStreamWaitEvent
0.61% 13.100ms 3910 3.3500us 1.4250us 548.71us cuStreamQuery
0.50% 10.745ms 6494 1.6540us 765ns 80.689us cuEventCreate
0.39% 8.3102ms 6494 1.2790us 769ns 15.404us cuEventRecord
0.34% 7.2781ms 10976 663ns 511ns 15.944us cuCtxGetCurrent
0.28% 6.1039ms 53 115.17us 6.1060us 514.12us cuMemAlloc
0.25% 5.4598ms 6494 840ns 618ns 12.691us cuEventDestroy
0.19% 4.1454ms 59 70.260us 7.8060us 1.0051ms cuMemcpyHtoD
0.19% 4.0027ms 31 129.12us 44.343us 461.31us cuModuleUnload
0.16% 3.3973ms 22 154.42us 19.973us 2.8421ms cuMemHostAlloc
0.13% 2.8140ms 158 17.809us 15.331us 60.788us cuMemsetD32
0.03% 716.18us 1 716.18us 716.18us 716.18us cuMemFree
0.02% 489.72us 17 28.807us 5.9450us 257.48us cuStreamCreate
0.01% 250.53us 17 14.736us 6.8140us 49.941us cuCtxSynchronize
0.01% 147.18us 17 8.6570us 4.0860us 26.257us cuStreamDestroy
0.00% 80.993us 31 2.6120us 1.9100us 3.3070us cuCtxPushCurrent
0.00% 66.736us 74 901ns 530ns 1.6470us cuDeviceGetAttribute
0.00% 54.457us 31 1.7560us 1.3620us 7.0930us cuModuleGetFunction
0.00% 51.857us 31 1.6720us 1.1190us 2.6930us cuModuleGetGlobal
0.00% 48.341us 22 2.1970us 1.1900us 7.4510us cuMemHostGetDevicePointer
0.00% 35.389us 34 1.0400us 679ns 5.3470us cuCtxGetDevice
0.00% 25.389us 31 819ns 718ns 1.3540us cuCtxPopCurrent
0.00% 8.0520us 3 2.6840us 1.5450us 3.4050us cuFuncGetAttribute
0.00% 6.9880us 1 6.9880us 6.9880us 6.9880us cuInit
0.00% 6.6200us 5 1.3240us 549ns 2.5390us cuDriverGetVersion
0.00% 5.2570us 4 1.3140us 533ns 2.0740us cuDeviceGetCount
0.00% 4.9140us 1 4.9140us 4.9140us 4.9140us cuCtxSetCurrent
0.00% 2.8130us 2 1.4060us 1.3840us 1.4290us cuDeviceGet
And below is a profile from a single GPU run with 32 tracers.
❯ /usr/local/cuda-9.0/bin/nvprof julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
==8630== NVPROF is profiling process 8630, command: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
┌ Info: Model composition
│ param_set = EarthParameterSet()
│ orientation = SphericalOrientation()
│ ref_state = HydrostaticState{DecayingTemperatureProfile{Float32},Float32}(DecayingTemperatureProfile{Float32}(290.0f0, 220.0f0, 8484.2705f0), 0.0f0)
│ turbulence = SmagorinskyLilly{Float32}(0.21f0)
│ hyperdiffusion = StandardHyperDiffusion{Float32}(14400.0f0)
│ moisture = DryModel()
│ precipitation = NoPrecipitation()
│ radiation = NoRadiation()
│ source = (Gravity(), Coriolis(), held_suarez_forcing!, RayleighSponge{Float32}(30000.0f0, 12000.0f0, 0.0011111111f0, Float32[0.0, 0.0, 0.0], 2.0f0))
│ tracers = NTracers{32,Float32}(Float32[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0])
│ boundarycondition = AtmosBC{Impenetrable{FreeSlip},Insulating,Impermeable,ImpermeableTracer}(Impenetrable{FreeSlip}(FreeSlip()), Insulating(), Impermeable(), ImpermeableTracer())
│ init_state_conservative = init_heldsuarez!
└ data_config = HeldSuarezDataConfig{Float32}(255.0f0)
┌ Info: Establishing Atmos GCM configuration for HeldSuarez
│ precision = Float32
│ polynomial order = 5
│ #horiz elems = 5
│ #vert elems = 5
│ domain height = 3.00e+04 m
│ MPI ranks = 1
│ min(Δ_horz) = 167863.59 m
└ min(Δ_vert) = 703.00 m
[ Info: Initializing HeldSuarez
┌ Info: Starting HeldSuarez
│ dt = 9.81818e+01
│ timeend = 8640.00
│ number of steps = 88
└ norm(Q) = 7.6345023397888000e+13
┌ Info: Update
│ simtime = 98.18 / 8640.00
│ runtime = 00:01:33
└ norm(Q) = 7.6345367330816000e+13
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 9.5916184646999998e+00
│ minimum (s) = 9.5916184646999998e+00
│ median (s) = 9.5916184646999998e+00
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 3.0826927699999999e-02
│ minimum (s) = 3.0826927699999999e-02
│ median (s) = 3.0826927699999999e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 2.9374440600000003e-02
│ minimum (s) = 2.9374440600000003e-02
│ median (s) = 2.9374440600000003e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 2.9341548200000001e-02
│ minimum (s) = 2.9341548200000001e-02
│ median (s) = 2.9341548200000001e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 3.0509419700000002e-02
│ minimum (s) = 3.0509419700000002e-02
│ median (s) = 3.0509419700000002e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 2.9288231099999999e-02
│ minimum (s) = 2.9288231099999999e-02
│ median (s) = 2.9288231099999999e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 2.9377836400000003e-02
│ minimum (s) = 2.9377836400000003e-02
│ median (s) = 2.9377836400000003e-02
└ std (s) = NaN
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 2.9326005700000001e-02
│ minimum (s) = 2.9326005700000001e-02
│ median (s) = 2.9326005700000001e-02
└ std (s) = NaN
┌ Info: Finished
│ norm(Q) = 7.6380867919872000e+13
│ norm(Q) / norm(Q₀) = 1.0004695653915405e+00
└ norm(Q) - norm(Q₀) = 3.5844521984000000e+10
┌ Info: Euclidean distance
│ norm(Q - Qe) = 4.2996259225600000e+11
└ norm(Q - Qe) / norm(Qe) = 5.6318440474569798e-03
==8630== Profiling application: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
==8630== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 25.95% 695.69ms 267 2.6056ms 2.5800ms 2.7157ms ptxcall_gpu_interface_tendency__23
19.54% 523.92ms 267 1.9623ms 1.9260ms 2.1054ms ptxcall_gpu_volume_gradients__16
14.65% 392.67ms 267 1.4707ms 1.4535ms 1.5205ms ptxcall_gpu_interface_gradients__17
13.43% 360.04ms 267 1.3485ms 1.3212ms 1.3967ms ptxcall_gpu_volume_tendency__22
7.83% 209.89ms 178 1.1792ms 1.1660ms 1.2611ms ptxcall_gpu_band_forward_kernel__25
6.71% 179.89ms 178 1.0106ms 998.62us 1.0796ms ptxcall_gpu_band_back_kernel__26
2.14% 57.354ms 21 2.7311ms 1.5360us 18.825ms [CUDA memcpy DtoH]
1.91% 51.307ms 1 51.307ms 51.307ms 51.307ms ptxcall_gpu_band_lu_kernel__14
0.94% 25.221ms 267 94.459us 93.279us 97.023us ptxcall_gpu_interface_gradients_of_laplacians__21
0.86% 22.977ms 268 85.734us 84.256us 87.328us ptxcall_gpu_kernel_nodal_update_auxiliary_state__6
0.78% 20.971ms 178 117.81us 116.26us 135.17us ptxcall_anonymous25_27
0.75% 20.140ms 150 134.27us 134.02us 134.75us ptxcall_anonymous25_12
0.70% 18.751ms 89 210.69us 209.63us 211.84us ptxcall_gpu_stage_update__28
0.68% 18.288ms 267 68.492us 67.135us 70.591us ptxcall_gpu_interface_divergence_of_gradients__19
0.51% 13.778ms 89 154.80us 154.14us 155.71us ptxcall_gpu_stage_update__24
0.51% 13.585ms 89 152.64us 151.87us 153.28us ptxcall_gpu_solution_update__29
0.51% 13.567ms 182 74.546us 73.056us 107.10us ptxcall_copy_kernel__5
0.37% 9.9749ms 89 112.08us 109.98us 123.10us ptxcall_gpu_kernel_apply_filter__30
0.27% 7.1531ms 267 26.790us 25.376us 28.672us ptxcall_gpu_volume_divergence_of_gradients__18
0.25% 6.7055ms 59 113.65us 1.4080us 2.3613ms [CUDA memcpy HtoD]
0.24% 6.5529ms 267 24.542us 23.552us 26.848us ptxcall_gpu_volume_gradients_of_laplacians__20
0.18% 4.8596ms 150 32.397us 31.231us 40.160us ptxcall_gpu_volume_tendency__10
0.11% 2.8805ms 150 19.203us 18.399us 21.504us ptxcall_gpu_interface_tendency__11
0.04% 1.1954ms 150 7.9690us 6.5600us 9.2160us ptxcall_gpu_kernel_set_banded_matrix__13
0.03% 933.69us 150 6.2240us 5.8240us 6.9760us ptxcall_gpu_kernel_set_banded_data__9
0.03% 856.03us 4 214.01us 200.54us 227.97us ptxcall_reduce_kernel_15
0.02% 524.13us 2 262.06us 254.46us 269.66us ptxcall_mapreducedim_kernel_parallel_2
0.01% 323.58us 2 161.79us 155.20us 168.38us ptxcall_gpu_kernel_min_neighbor_distance__1
0.01% 245.28us 1 245.28us 245.28us 245.28us ptxcall_gpu_kernel_init_state_auxiliary__4
0.01% 231.97us 159 1.4580us 1.3760us 2.8160us [CUDA memset]
0.01% 212.03us 1 212.03us 212.03us 212.03us ptxcall_reduce_kernel_31
0.01% 210.49us 1 210.49us 210.49us 210.49us ptxcall_mapreducedim_kernel_parallel_8
0.00% 83.040us 1 83.040us 83.040us 83.040us ptxcall_gpu_kernel_min_neighbor_distance__3
0.00% 18.624us 1 18.624us 18.624us 18.624us ptxcall_gpu_kernel_local_courant__7
API calls: 74.34% 3.20947s 31 103.53ms 307.36us 822.88ms cuModuleLoadDataEx
15.11% 652.31ms 415374 1.5700us 1.1730us 573.33us cuEventQuery
4.53% 195.37ms 1 195.37ms 195.37ms 195.37ms cuDevicePrimaryCtxRetain
1.65% 71.253ms 4240 16.805us 10.758us 3.8295ms cuLaunchKernel
1.64% 70.886ms 21 3.3755ms 43.558us 19.948ms cuMemcpyDtoH
0.60% 25.794ms 17 1.5173ms 7.2710us 3.7073ms cuCtxSynchronize
0.41% 17.799ms 6758 2.6330us 1.5980us 544.43us cuStreamWaitEvent
0.34% 14.529ms 4964 2.9260us 1.4640us 26.409us cuStreamQuery
0.26% 11.093ms 6494 1.7080us 747ns 125.90us cuEventCreate
0.19% 8.0869ms 6494 1.2450us 781ns 15.386us cuEventRecord
0.18% 7.8925ms 59 133.77us 8.4800us 2.4269ms cuMemcpyHtoD
0.17% 7.1780ms 10978 653ns 509ns 16.014us cuCtxGetCurrent
0.16% 7.0254ms 6494 1.0810us 623ns 16.530us cuEventDestroy
0.14% 6.2573ms 53 118.06us 6.3470us 521.02us cuMemAlloc
0.10% 4.1826ms 31 134.92us 45.632us 491.68us cuModuleUnload
0.08% 3.3178ms 22 150.81us 19.702us 2.7680ms cuMemHostAlloc
0.07% 2.8385ms 159 17.852us 14.124us 62.253us cuMemsetD32
0.02% 743.41us 2 371.71us 41.525us 701.89us cuMemFree
0.01% 480.70us 17 28.276us 6.2170us 260.83us cuStreamCreate
0.00% 149.65us 17 8.8020us 4.0660us 31.065us cuStreamDestroy
0.00% 76.700us 31 2.4740us 1.8700us 3.3520us cuCtxPushCurrent
0.00% 76.556us 74 1.0340us 532ns 4.6820us cuDeviceGetAttribute
0.00% 51.206us 31 1.6510us 1.0450us 2.3150us cuModuleGetGlobal
0.00% 48.767us 31 1.5730us 1.3080us 2.0740us cuModuleGetFunction
0.00% 44.572us 22 2.0260us 1.2210us 6.8370us cuMemHostGetDevicePointer
0.00% 28.014us 34 823ns 662ns 1.1920us cuCtxGetDevice
0.00% 24.982us 31 805ns 715ns 1.3550us cuCtxPopCurrent
0.00% 8.1770us 1 8.1770us 8.1770us 8.1770us cuInit
0.00% 7.6880us 3 2.5620us 1.3500us 3.2670us cuFuncGetAttribute
0.00% 7.4370us 5 1.4870us 566ns 3.1340us cuDriverGetVersion
0.00% 5.5980us 4 1.3990us 592ns 1.9070us cuDeviceGetCount
0.00% 4.6710us 1 4.6710us 4.6710us 4.6710us cuCtxSetCurrent
0.00% 2.6270us 2 1.3130us 1.2830us 1.3440us cuDeviceGet
Here are the detailed kernel metrics for 32 tracers
❯ /usr/local/cuda-9.0/bin/nvprof --profile-from-start off --metrics all julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
==8711== NVPROF is profiling process 8711, command: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
┌ Warning: You are using CUDNN 7.6.2 for CUDA 10.1.0 with CUDA toolkit 9.0.176; these might be incompatible.
└ @ CuArrays ~/.julia/packages/CuArrays/A6GUx/src/CuArrays.jl:128
[1590103948.401436] [ip-172-31-35-196:8711 :0] parser.c:1310 UCX WARN unused env variable: UCX_MEMTYPE_CACHE (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
┌ Info: Model composition
│ param_set = EarthParameterSet()
│ orientation = SphericalOrientation()
│ ref_state = HydrostaticState{DecayingTemperatureProfile{Float32},Float32}(DecayingTemperatureProfile{Float32}(290.0f0, 220.0f0, 8484.2705f0), 0.0f0)
│ turbulence = SmagorinskyLilly{Float32}(0.21f0)
│ hyperdiffusion = StandardHyperDiffusion{Float32}(14400.0f0)
│ moisture = DryModel()
│ precipitation = NoPrecipitation()
│ radiation = NoRadiation()
│ source = (Gravity(), Coriolis(), held_suarez_forcing!, RayleighSponge{Float32}(30000.0f0, 12000.0f0, 0.0011111111f0, Float32[0.0, 0.0, 0.0], 2.0f0))
│ tracers = NTracers{32,Float32}(Float32[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0])
│ boundarycondition = AtmosBC{Impenetrable{FreeSlip},Insulating,Impermeable,ImpermeableTracer}(Impenetrable{FreeSlip}(FreeSlip()), Insulating(), Impermeable(), ImpermeableTracer())
│ init_state_conservative = init_heldsuarez!
└ data_config = HeldSuarezDataConfig{Float32}(255.0f0)
┌ Info: Establishing Atmos GCM configuration for HeldSuarez
│ precision = Float32
│ polynomial order = 5
│ #horiz elems = 5
│ #vert elems = 5
│ domain height = 3.00e+04 m
│ MPI ranks = 1
│ min(Δ_horz) = 167863.59 m
└ min(Δ_vert) = 703.00 m
[ Info: Initializing HeldSuarez
┌ Warning: Calling CUDAdrv.@profile only informs an external profiler to start.
│ The user is responsible for launching Julia under a CUDA profiler like `nvprof`.
│
│ For improved usability, launch Julia under the Nsight Systems profiler:
│ $ nsys launch -t cuda,cublas,cudnn,nvtx julia
└ @ CUDAdrv.Profile ~/.julia/packages/CUDAdrv/b1mvw/src/profile.jl:42
==8711== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Starting HeldSuarez
│ dt = 9.81818e+01
│ timeend = 8640.00
│ number of steps = 88
└ norm(Q) = 7.6344847237120000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│ simtime = 98.18 / 8640.00
│ runtime = 00:01:48
└ norm(Q) = 7.6345182781440000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│ simtime = 490.91 / 8640.00
│ runtime = 00:02:49
└ norm(Q) = 7.6346701119488000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│ simtime = 883.64 / 8640.00
│ runtime = 00:03:56
└ norm(Q) = 7.6348211068928000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 2.5658203417399999e+01
│ minimum (s) = 2.5658203417399999e+01
│ median (s) = 2.5658203417399999e+01
└ std (s) = NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│ simtime = 1276.36 / 8640.00
│ runtime = 00:05:08
└ norm(Q) = 7.6349737795584000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updateernal events
│ simtime = 1669.09 / 8640.00
│ runtime = 00:06:24
└ norm(Q) = 7.6351298076672000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│ simtime = 1963.64 / 8640.00
│ runtime = 00:07:25
└ norm(Q) = 7.6352464093184000e+13
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 1.9160640540700001e+01
│ minimum (s) = 1.9160640540700001e+01
│ median (s) = 1.9160640540700001e+01
└ std (s) = NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│ simtime = 2258.18 / 8640.00
│ runtime = 00:08:28
└ norm(Q) = 7.6353655275520000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│ simtime = 2552.73 / 8640.00
│ runtime = 00:09:34
└ norm(Q) = 7.6354838069248000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│ simtime = 2847.27 / 8640.00
│ runtime = 00:10:43
└ norm(Q) = 7.6356029251584000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 2.2144820997299998e+01
│ minimum (s) = 2.2144820997299998e+01
│ median (s) = 2.2144820997299998e+01
└ std (s) = NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updateernal events
│ simtime = 3141.82 / 8640.00
│ runtime = 00:11:55
└ norm(Q) = 7.6357220433920000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│ simtime = 3436.36 / 8640.00
│ runtime = 00:13:11
└ norm(Q) = 7.6358436782080000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│ simtime = 3730.91 / 8640.00
│ runtime = 00:14:29
└ norm(Q) = 7.6359644741632000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 2.5581604319499998e+01
│ minimum (s) = 2.5581604319499998e+01
│ median (s) = 2.5581604319499998e+01
└ std (s) = NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Update
│ simtime = 4025.46 / 8640.00
│ runtime = 00:15:50
└ norm(Q) = 7.6360852701184000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updaternal events
│ simtime = 4320.00 / 8640.00
│ runtime = 00:17:14
└ norm(Q) = 7.6362077437952000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updateernal events
│ simtime = 4614.55 / 8640.00
│ runtime = 00:18:41
└ norm(Q) = 7.6363293786112000e+13
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_reduce_kernel_15" (done)
┌ Info: Updateernal events
│ simtime = 4909.09 / 8640.00
│ runtime = 00:20:10
└ norm(Q) = 7.6364510134272000e+13
┌ Info: Wall-clock time per time-step (statistics across MPI ranks)
│ maximum (s) = 2.8795298310000000e+01
│ minimum (s) = 2.8795298310000000e+01
│ median (s) = 2.8795298310000000e+01
└ std (s) = NaN
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_solution_update__29" (done)
Replaying kernel "ptxcall_gpu_kernel_apply_filter__30" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__24" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (done)
Replaying kernel "ptxcall_gpu_band_back_kernel__26" (done)
Replaying kernel "ptxcall_anonymous25_27" (done)
Replaying kernel "ptxcall_gpu_kernel_nodal_update_auxiliary_state__6" (done)
Replaying kernel "ptxcall_gpu_volume_gradients__16" (done)
Replaying kernel "ptxcall_gpu_interface_gradients__17" (done)
Replaying kernel "ptxcall_gpu_volume_divergence_of_gradients__18" (done)
Replaying kernel "ptxcall_gpu_interface_divergence_of_gradients__19" (done)
Replaying kernel "ptxcall_gpu_volume_gradients_of_laplacians__20" (done)
Replaying kernel "ptxcall_gpu_interface_gradients_of_laplacians__21" (done)
Replaying kernel "ptxcall_gpu_volume_tendency__22" (done)
Replaying kernel "ptxcall_gpu_interface_tendency__23" (done)
Replaying kernel "ptxcall_gpu_stage_update__28" (done)
Replaying kernel "ptxcall_copy_kernel__5" (done)
Replaying kernel "ptxcall_gpu_band_forward_kernel__25" (22 of 22)...
l2_subp0_write_tex_hit_sectors
l2_subp1_write_tex_hit_sectors
7 internal events
^C
signal (2): Interrupt
in expression starting at /home/lucas/research/code/ClimateMachine.jl/experiments/AtmosGCM/heldsuarez.jl:257
unknown function (ip: 0x7f14dfa5d12c)
unknown function (ip: 0x7ffd8f48e13f)
unknown function (ip: 0x7f14df785a0a)
unknown function (ip: 0x88)
unknown function (ip: 0xffffffffffffffff)
Allocations: 579745084 (Pool: 579653498; Big: 91586); GC: 437
==8711== Profiling application: julia --project experiments/AtmosGCM/heldsuarez.jl --monitor-timestep-duration 10steps --number-of-tracers 32
==8711== Profiling result:
==8711== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla V100-SXM2-16GB (0)"
Kernel: ptxcall_anonymous25_27
103 inst_per_warp Instructions per warp 652.978401 652.978401 652.978401
103 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
103 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
103 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 95.80% 95.80% 95.80%
103 inst_replay_overhead Instruction Replay Overhead 0.000371 0.000963 0.000583
103 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
103 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
103 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
103 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
103 gld_transactions_per_request Global Load Transactions Per Request 3.999989 3.999989 3.999989
103 gst_transactions_per_request Global Store Transactions Per Request 3.999989 3.999989 3.999989
103 shared_store_transactions Shared Store Transactions 0 0 0
103 shared_load_transactions Shared Load Transactions 0 0 0
103 local_load_transactions Local Load Transactions 0 0 0
103 local_store_transactions Local Store Transactions 0 0 0
103 gld_transactions Global Load Transactions 1498500 1498500 1498500
103 gst_transactions Global Store Transactions 749250 749250 749250
103 sysmem_read_transactions System Memory Read Transactions 0 0 0
103 sysmem_write_transactions System Memory Write Transactions 5 5 5
103 l2_read_transactions L2 Read Transactions 1498596 1499604 1499040
103 l2_write_transactions L2 Write Transactions 749287 775859 758273
103 dram_read_transactions Device Memory Read Transactions 1498506 1498922 1498591
103 dram_write_transactions Device Memory Write Transactions 736920 767566 753743
103 global_hit_rate Global Hit Rate in unified l1/tex 33.09% 33.31% 33.22%
103 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
103 gld_requested_throughput Requested Global Load Throughput 329.94GB/s 383.44GB/s 360.17GB/s
103 gst_requested_throughput Requested Global Store Throughput 164.97GB/s 191.72GB/s 180.09GB/s
103 gld_throughput Global Load Throughput 329.94GB/s 383.44GB/s 360.17GB/s
103 gst_throughput Global Store Throughput 164.97GB/s 191.72GB/s 180.09GB/s
103 local_memory_overhead Local Memory Overhead 33.11% 33.32% 33.22%
103 tex_cache_hit_rate Unified Cache Hit Rate 0.00% 0.00% 0.00%
103 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
103 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 100.00% 100.00% 100.00%
103 dram_read_throughput Device Memory Read Throughput 329.97GB/s 383.48GB/s 360.19GB/s
103 dram_write_throughput Device Memory Write Throughput 162.28GB/s 195.81GB/s 181.17GB/s
103 tex_cache_throughput Unified cache to SM throughput 494.91GB/s 575.16GB/s 540.26GB/s
103 l2_tex_read_throughput L2 Throughput (Texture Reads) 329.94GB/s 383.44GB/s 360.17GB/s
103 l2_tex_write_throughput L2 Throughput (Texture Writes) 164.97GB/s 191.72GB/s 180.09GB/s
103 l2_read_throughput L2 Throughput (Reads) 330.16GB/s 383.71GB/s 360.30GB/s
103 l2_write_throughput L2 Throughput (Writes) 164.98GB/s 197.51GB/s 182.25GB/s
103 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 sysmem_write_throughput System Memory Write Throughput 1.1273MB/s 1.3101MB/s 1.2306MB/s
103 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
103 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
103 tex_cache_transactions Unified cache to SM transactions 561946 561946 561946
103 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
103 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
103 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
103 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
103 flop_count_sp Floating Point Operations(Single Precision) 5994000 5994000 5994000
103 flop_count_sp_add Floating Point Operations(Single Precision Add) 5994000 5994000 5994000
103 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
103 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
103 flop_count_sp_special Floating Point Operations(Single Precision Special) 11988000 11988000 11988000
103 inst_executed Instructions Executed 42332892 122315914 81936135
103 inst_issued Instructions Issued 42348613 42373678 42358014
103 dram_utilization Device Memory Utilization Mid (6) High (7) Mid (6)
103 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
103 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 2.03% 2.68% 2.30%
103 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 16.07% 16.62% 16.35%
103 stall_memory_dependency Issue Stall Reasons (Data Request) 19.29% 25.51% 23.00%
103 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
103 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
103 stall_other Issue Stall Reasons (Other) 7.77% 8.75% 8.18%
103 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.18% 0.62% 0.37%
103 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 20.92% 23.41% 21.85%
103 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
103 inst_fp_32 FP Instructions(Single) 17982000 17982000 17982000
103 inst_fp_64 FP Instructions(Double) 0 0 0
103 inst_integer Integer Instructions 965417386 965417386 965417386
103 inst_bit_convert Bit-Convert Instructions 23976000 23976000 23976000
103 inst_control Control-Flow Instructions 53946240 53946240 53946240
103 inst_compute_ld_st Load/Store Instructions 53946000 53946000 53946000
103 inst_misc Misc Instructions 167832480 167832480 167832480
103 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
103 issue_slots Issue Slots 42348613 42373678 42358014
103 cf_issued Issued Control-Flow Instructions 3558961 3558961 3558961
103 cf_executed Executed Control-Flow Instructions 3558961 3558961 3558961
103 ldst_issued Issued Load/Store Instructions 936579 936579 936579
103 ldst_executed Executed Load/Store Instructions 936579 936579 936579
103 atomic_transactions Atomic Transactions 0 0 0
103 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
103 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
103 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
103 l2_tex_read_transactions L2 Transactions (Texture Reads) 1498500 1498500 1498500
103 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.63% 0.65% 0.64%
103 stall_not_selected Issue Stall Reasons (Not Selected) 25.93% 29.23% 27.32%
103 l2_tex_write_transactions L2 Transactions (Texture Writes) 749250 749250 749250
103 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
103 nvlink_total_data_received NVLink Total Data Received 864 864 864
103 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
103 nvlink_user_data_received NVLink User Data Received 0 0 0
103 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
103 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
103 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
103 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
103 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
103 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
103 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
103 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
103 nvlink_transmit_throughput NVLink Transmit Throughput 8.1166MB/s 9.4328MB/s 8.8604MB/s
103 nvlink_receive_throughput NVLink Receive Throughput 6.0875MB/s 7.0746MB/s 6.6453MB/s
103 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
103 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
103 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
103 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
103 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
103 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
103 inst_fp_16 HP Instructions(Half) 0 0 0
103 ipc Executed IPC 0.541810 3.099902 1.818507
103 issued_ipc Issued IPC 3.034042 3.101053 3.069795
103 issue_slot_utilization Issue Slot Utilization 75.85% 77.53% 76.74%
103 sm_efficiency Multiprocessor Activity 92.25% 99.71% 96.86%
103 achieved_occupancy Achieved Occupancy 0.686338 0.694919 0.691002
103 eligible_warps_per_cycle Eligible Warps Per Active Cycle 13.383984 14.656893 13.914015
103 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
103 l2_utilization L2 Cache Utilization Low (1) Low (2) Low (1)
103 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
103 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
103 cf_fu_utilization Control-Flow Function Unit Utilization Low (2) Low (2) Low (2)
103 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
103 special_fu_utilization Special Function Unit Utilization Low (2) Low (2) Low (2)
103 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
103 single_precision_fu_utilization Single-Precision Function Unit Utilization Mid (6) Mid (6) Mid (6)
103 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
103 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
103 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.02% 0.33% 0.28%
103 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
103 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
103 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
103 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
103 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
103 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_interface_gradients__17
155 inst_per_warp Instructions per warp 1.1750e+04 1.1750e+04 1.1750e+04
155 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
155 warp_execution_efficiency Warp Execution Efficiency 56.25% 56.25% 56.25%
155 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 55.21% 55.21% 55.21%
155 inst_replay_overhead Instruction Replay Overhead 0.001535 0.002052 0.001742
155 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 local_load_transactions_per_request Local Memory Load Transactions Per Request 2.500000 2.500000 2.500000
155 local_store_transactions_per_request Local Memory Store Transactions Per Request 2.500000 2.500000 2.500000
155 gld_transactions_per_request Global Load Transactions Per Request 9.201834 9.202861 9.202299
155 gst_transactions_per_request Global Store Transactions Per Request 9.250000 9.250000 9.250000
155 shared_store_transactions Shared Store Transactions 0 0 0
155 shared_load_transactions Shared Load Transactions 0 0 0
155 local_load_transactions Local Load Transactions 4260000 4260000 4260000
155 local_store_transactions Local Store Transactions 4020000 4020000 4020000
155 gld_transactions Global Load Transactions 11442480 11443758 11443058
155 gst_transactions Global Store Transactions 6549000 6549000 6549000
155 sysmem_read_transactions System Memory Read Transactions 0 0 0
155 sysmem_write_transactions System Memory Write Transactions 5 5 5
155 l2_read_transactions L2 Read Transactions 13529123 13647031 13592782
155 l2_write_transactions L2 Write Transactions 11393200 11426356 11409493
155 dram_read_transactions Device Memory Read Transactions 15091535 15293009 15176859
155 dram_write_transactions Device Memory Write Transactions 10452264 10476862 10465751
155 global_hit_rate Global Hit Rate in unified l1/tex 38.55% 38.68% 38.63%
155 local_hit_rate Local Hit Rate 18.94% 20.44% 19.72%
155 gld_requested_throughput Requested Global Load Throughput 54.246GB/s 56.252GB/s 55.384GB/s
155 gst_requested_throughput Requested Global Store Throughput 30.754GB/s 31.891GB/s 31.399GB/s
155 gld_throughput Global Load Throughput 220.91GB/s 229.09GB/s 225.55GB/s
155 gst_throughput Global Store Throughput 126.43GB/s 131.11GB/s 129.08GB/s
155 local_memory_overhead Local Memory Overhead 54.18% 54.45% 54.30%
155 tex_cache_hit_rate Unified Cache Hit Rate 7.80% 8.28% 8.05%
155 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 11.49% 12.90% 12.31%
155 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 61.69% 61.81% 61.76%
155 dram_read_throughput Device Memory Read Throughput 291.66GB/s 304.74GB/s 299.14GB/s
155 dram_write_throughput Device Memory Write Throughput 201.96GB/s 209.73GB/s 206.29GB/s
155 tex_cache_throughput Unified cache to SM throughput 303.94GB/s 315.25GB/s 310.39GB/s
155 l2_tex_read_throughput L2 Throughput (Texture Reads) 261.98GB/s 272.62GB/s 267.96GB/s
155 l2_tex_write_throughput L2 Throughput (Texture Writes) 204.04GB/s 211.59GB/s 208.32GB/s
155 l2_read_throughput L2 Throughput (Reads) 262.35GB/s 272.73GB/s 267.92GB/s
155 l2_write_throughput L2 Throughput (Writes) 220.30GB/s 228.66GB/s 224.89GB/s
155 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 sysmem_write_throughput System Memory Write Throughput 101.22KB/s 104.96KB/s 103.34KB/s
155 local_load_throughput Local Memory Load Throughput 82.241GB/s 85.283GB/s 83.967GB/s
155 local_store_throughput Local Memory Store Throughput 77.608GB/s 80.479GB/s 79.236GB/s
155 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 gld_efficiency Global Memory Load Efficiency 24.55% 24.56% 24.56%
155 gst_efficiency Global Memory Store Efficiency 24.32% 24.32% 24.32%
155 tex_cache_transactions Unified cache to SM transactions 3931356 3940408 3936815
155 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
155 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
155 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
155 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
155 flop_count_sp Floating Point Operations(Single Precision) 87263985 87264000 87263999
155 flop_count_sp_add Floating Point Operations(Single Precision Add) 14255999 14256000 14255999
155 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 21923993 21924000 21923999
155 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 29160000 29160000 29160000
155 flop_count_sp_special Floating Point Operations(Single Precision Special) 647999 648000 647999
155 inst_executed Instructions Executed 11538000 17625690 14954583
155 inst_issued Instructions Issued 11554816 11561673 11557941
155 dram_utilization Device Memory Utilization Mid (6) High (7) Mid (6)
155 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
155 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.21% 1.14% 0.51%
155 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 1.43% 1.68% 1.54%
155 stall_memory_dependency Issue Stall Reasons (Data Request) 91.78% 93.85% 92.87%
155 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
155 stall_sync Issue Stall Reasons (Synchronization) 0.73% 0.93% 0.85%
155 stall_other Issue Stall Reasons (Other) 0.02% 0.02% 0.02%
155 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.09% 0.29% 0.13%
155 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.03% 0.04% 0.03%
155 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
155 inst_fp_32 FP Instructions(Single) 66419996 66420000 66419999
155 inst_fp_64 FP Instructions(Double) 0 0 0
155 inst_integer Integer Instructions 68391000 68391012 68391000
155 inst_bit_convert Bit-Convert Instructions 0 0 0
155 inst_control Control-Flow Instructions 2646000 2646008 2646000
155 inst_compute_ld_st Load/Store Instructions 61128000 61128000 61128000
155 inst_misc Misc Instructions 7884000 7884005 7884000
155 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
155 issue_slots Issue Slots 11554816 11561673 11557941
155 cf_issued Issued Control-Flow Instructions 184500 184516 184500
155 cf_executed Executed Control-Flow Instructions 184500 184516 184500
155 ldst_issued Issued Load/Store Instructions 3432000 3432001 3432000
155 ldst_executed Executed Load/Store Instructions 3432000 3432001 3432000
155 atomic_transactions Atomic Transactions 0 0 0
155 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
155 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
155 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
155 l2_tex_read_transactions L2 Transactions (Texture Reads) 13534372 13660443 13594768
155 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 3.27% 4.54% 3.97%
155 stall_not_selected Issue Stall Reasons (Not Selected) 0.06% 0.08% 0.07%
155 l2_tex_write_transactions L2 Transactions (Texture Writes) 10569000 10569000 10569000
155 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1920 1156
155 nvlink_total_data_received NVLink Total Data Received 864 1440 867
155 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
155 nvlink_user_data_received NVLink User Data Received 0 0 0
155 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
155 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
155 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
155 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
155 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
155 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
155 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
155 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
155 nvlink_transmit_throughput NVLink Transmit Throughput 728.76KB/s 1.1962MB/s 747.25KB/s
155 nvlink_receive_throughput NVLink Receive Throughput 546.57KB/s 918.67KB/s 560.43KB/s
155 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
155 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
155 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
155 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
155 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
155 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
155 inst_fp_16 HP Instructions(Half) 0 0 0
155 ipc Executed IPC 0.071277 0.120621 0.091372
155 issued_ipc Issued IPC 0.071367 0.083051 0.076398
155 issue_slot_utilization Issue Slot Utilization 1.78% 2.08% 1.91%
155 sm_efficiency Multiprocessor Activity 85.71% 90.37% 87.35%
155 achieved_occupancy Achieved Occupancy 0.115959 0.123401 0.120441
155 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.076379 0.089024 0.081801
155 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
155 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
155 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
155 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
155 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
155 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
155 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
155 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
155 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.36% 0.42% 0.39%
155 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
155 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
155 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
155 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
155 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
155 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_interface_divergence_of_gradients__19
155 inst_per_warp Instructions per warp 1.6060e+03 1.6060e+03 1.6060e+03
155 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
155 warp_execution_efficiency Warp Execution Efficiency 56.25% 56.25% 56.25%
155 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 55.27% 55.27% 55.27%
155 inst_replay_overhead Instruction Replay Overhead 0.012460 0.014881 0.013454
155 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 gld_transactions_per_request Global Load Transactions Per Request 9.198303 9.204780 9.201584
155 gst_transactions_per_request Global Store Transactions Per Request 9.250000 9.250000 9.250000
155 shared_store_transactions Shared Store Transactions 0 0 0
155 shared_load_transactions Shared Load Transactions 0 0 0
155 local_load_transactions Local Load Transactions 0 0 0
155 local_store_transactions Local Store Transactions 0 0 0
155 gld_transactions Global Load Transactions 1945441 1946811 1946134
155 gst_transactions Global Store Transactions 222000 222000 222000
155 sysmem_read_transactions System Memory Read Transactions 0 0 0
155 sysmem_write_transactions System Memory Write Transactions 5 5 5
155 l2_read_transactions L2 Read Transactions 1553181 1560584 1558571
155 l2_write_transactions L2 Write Transactions 260048 286923 272299
155 dram_read_transactions Device Memory Read Transactions 1218056 1248356 1231080
155 dram_write_transactions Device Memory Write Transactions 273989 299321 287240
155 global_hit_rate Global Hit Rate in unified l1/tex 28.13% 28.38% 28.24%
155 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
155 gld_requested_throughput Requested Global Load Throughput 202.88GB/s 207.89GB/s 205.79GB/s
155 gst_requested_throughput Requested Global Store Throughput 21.916GB/s 22.457GB/s 22.231GB/s
155 gld_throughput Global Load Throughput 789.78GB/s 809.53GB/s 801.21GB/s
155 gst_throughput Global Store Throughput 90.101GB/s 92.325GB/s 91.395GB/s
155 local_memory_overhead Local Memory Overhead 12.28% 12.87% 12.58%
155 tex_cache_hit_rate Unified Cache Hit Rate 18.16% 18.48% 18.27%
155 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 36.53% 38.13% 37.43%
155 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 80.40% 81.52% 80.90%
155 dram_read_throughput Device Memory Read Throughput 495.39GB/s 514.32GB/s 506.82GB/s
155 dram_write_throughput Device Memory Write Throughput 112.84GB/s 124.10GB/s 118.25GB/s
155 tex_cache_throughput Unified cache to SM throughput 631.52GB/s 647.28GB/s 640.82GB/s
155 l2_tex_read_throughput L2 Throughput (Texture Reads) 632.16GB/s 647.90GB/s 641.35GB/s
155 l2_tex_write_throughput L2 Throughput (Texture Writes) 90.101GB/s 92.325GB/s 91.395GB/s
155 l2_read_throughput L2 Throughput (Reads) 632.61GB/s 648.60GB/s 641.65GB/s
155 l2_write_throughput L2 Throughput (Writes) 106.26GB/s 118.80GB/s 112.10GB/s
155 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 sysmem_write_throughput System Memory Write Throughput 2.0780MB/s 2.1293MB/s 2.1078MB/s
155 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 gld_efficiency Global Memory Load Efficiency 25.68% 25.69% 25.69%
155 gst_efficiency Global Memory Store Efficiency 24.32% 24.32% 24.32%
155 tex_cache_transactions Unified cache to SM transactions 388327 389841 389136
155 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
155 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
155 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
155 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
155 flop_count_sp Floating Point Operations(Single Precision) 4752000 4752000 4752000
155 flop_count_sp_add Floating Point Operations(Single Precision Add) 1296000 1296000 1296000
155 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 1296000 1296000 1296000
155 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 864000 864000 864000
155 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
155 inst_executed Instructions Executed 1608000 2409000 1974909
155 inst_issued Instructions Issued 1627903 1630802 1629543
155 dram_utilization Device Memory Utilization High (8) High (8) High (8)
155 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
155 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.11% 3.32% 0.93%
155 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 1.51% 1.83% 1.64%
155 stall_memory_dependency Issue Stall Reasons (Data Request) 90.64% 94.33% 92.76%
155 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
155 stall_sync Issue Stall Reasons (Synchronization) 1.20% 1.51% 1.34%
155 stall_other Issue Stall Reasons (Other) 0.20% 0.28% 0.24%
155 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.82% 1.59% 1.11%
155 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.22% 0.35% 0.28%
155 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
155 inst_fp_32 FP Instructions(Single) 3456000 3456000 3456000
155 inst_fp_64 FP Instructions(Double) 0 0 0
155 inst_integer Integer Instructions 19521000 19521000 19521000
155 inst_bit_convert Bit-Convert Instructions 0 0 0
155 inst_control Control-Flow Instructions 162000 162000 162000
155 inst_compute_ld_st Load/Store Instructions 4239000 4239000 4239000
155 inst_misc Misc Instructions 1323000 1323000 1323000
155 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
155 issue_slots Issue Slots 1627903 1630802 1629543
155 cf_issued Issued Control-Flow Instructions 16500 16500 16500
155 cf_executed Executed Control-Flow Instructions 16500 16500 16500
155 ldst_issued Issued Load/Store Instructions 247500 247500 247500
155 ldst_executed Executed Load/Store Instructions 247500 247500 247500
155 atomic_transactions Atomic Transactions 0 0 0
155 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
155 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
155 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
155 l2_tex_read_transactions L2 Transactions (Texture Reads) 1553162 1560156 1557847
155 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.70% 1.73% 1.13%
155 stall_not_selected Issue Stall Reasons (Not Selected) 0.50% 0.65% 0.56%
155 l2_tex_write_transactions L2 Transactions (Texture Writes) 222000 222000 222000
155 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
155 nvlink_total_data_received NVLink Total Data Received 864 864 864
155 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
155 nvlink_user_data_received NVLink User Data Received 0 0 0
155 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
155 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
155 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
155 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
155 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
155 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
155 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
155 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
155 nvlink_transmit_throughput NVLink Transmit Throughput 14.962MB/s 15.331MB/s 15.177MB/s
155 nvlink_receive_throughput NVLink Receive Throughput 11.221MB/s 11.498MB/s 11.382MB/s
155 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
155 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
155 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
155 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
155 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
155 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
155 inst_fp_16 HP Instructions(Half) 0 0 0
155 ipc Executed IPC 0.189691 0.291671 0.237549
155 issued_ipc Issued IPC 0.192304 0.232761 0.209994
155 issue_slot_utilization Issue Slot Utilization 4.81% 5.82% 5.25%
155 sm_efficiency Multiprocessor Activity 78.14% 96.19% 92.22%
155 achieved_occupancy Achieved Occupancy 0.286913 0.288411 0.287798
155 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.286789 0.346164 0.313476
155 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
155 l2_utilization L2 Cache Utilization Low (2) Low (2) Low (2)
155 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
155 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
155 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
155 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
155 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
155 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.36% 0.49% 0.43%
155 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
155 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
155 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
155 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
155 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
155 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_stage_update__24
52 inst_per_warp Instructions per warp 334.989836 334.989836 334.989836
52 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
52 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
52 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 95.32% 95.32% 95.32%
52 inst_replay_overhead Instruction Replay Overhead 0.001949 0.003757 0.002618
52 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
52 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
52 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
52 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
52 gld_transactions_per_request Global Load Transactions Per Request 3.999989 3.999989 3.999989
52 gst_transactions_per_request Global Store Transactions Per Request 3.999989 3.999989 3.999989
52 shared_store_transactions Shared Store Transactions 0 0 0
52 shared_load_transactions Shared Load Transactions 0 0 0
52 local_load_transactions Local Load Transactions 0 0 0
52 local_store_transactions Local Store Transactions 0 0 0
52 gld_transactions Global Load Transactions 2247750 2247750 2247750
52 gst_transactions Global Store Transactions 2247750 2247750 2247750
52 sysmem_read_transactions System Memory Read Transactions 0 0 0
52 sysmem_write_transactions System Memory Write Transactions 5 5 5
52 l2_read_transactions L2 Read Transactions 1560492 1568084 1563760
52 l2_write_transactions L2 Write Transactions 2247776 2271656 2262412
52 dram_read_transactions Device Memory Read Transactions 1498511 1498923 1498587
52 dram_write_transactions Device Memory Write Transactions 2230304 2254581 2242227
52 global_hit_rate Global Hit Rate in unified l1/tex 15.16% 15.28% 15.23%
52 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
52 gld_requested_throughput Requested Global Load Throughput 426.48GB/s 429.97GB/s 428.68GB/s
52 gst_requested_throughput Requested Global Store Throughput 426.48GB/s 429.97GB/s 428.68GB/s
52 gld_throughput Global Load Throughput 426.48GB/s 429.97GB/s 428.68GB/s
52 gst_throughput Global Store Throughput 426.48GB/s 429.97GB/s 428.68GB/s
52 local_memory_overhead Local Memory Overhead 0.00% 0.13% 0.02%
52 tex_cache_hit_rate Unified Cache Hit Rate 15.13% 15.29% 15.22%
52 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 3.93% 4.39% 4.14%
52 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
52 dram_read_throughput Device Memory Read Throughput 284.34GB/s 286.67GB/s 285.80GB/s
52 dram_write_throughput Device Memory Write Throughput 424.70GB/s 430.73GB/s 427.63GB/s
52 tex_cache_throughput Unified cache to SM throughput 568.64GB/s 573.31GB/s 571.58GB/s
52 l2_tex_read_throughput L2 Throughput (Texture Reads) 296.36GB/s 299.16GB/s 298.15GB/s
52 l2_tex_write_throughput L2 Throughput (Texture Writes) 426.48GB/s 429.97GB/s 428.68GB/s
52 l2_read_throughput L2 Throughput (Reads) 296.43GB/s 299.39GB/s 298.23GB/s
52 l2_write_throughput L2 Throughput (Writes) 427.50GB/s 434.40GB/s 431.48GB/s
52 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 sysmem_write_throughput System Memory Write Throughput 994.75KB/s 0.9794MB/s 999.89KB/s
52 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
52 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
52 tex_cache_transactions Unified cache to SM transactions 749259 749259 749259
52 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
52 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
52 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
52 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
52 flop_count_sp Floating Point Operations(Single Precision) 131868000 131868000 131868000
52 flop_count_sp_add Floating Point Operations(Single Precision Add) 23976000 23976000 23976000
52 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 47952000 47952000 47952000
52 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 11988000 11988000 11988000
52 flop_count_sp_special Floating Point Operations(Single Precision Special) 5994000 5994000 5994000
52 inst_executed Instructions Executed 11238850 62750296 35013363
52 inst_issued Instructions Issued 11260851 11278424 11267820
52 dram_utilization Device Memory Utilization High (9) High (9) High (9)
52 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
52 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 1.17% 1.48% 1.29%
52 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 3.56% 4.28% 3.80%
52 stall_memory_dependency Issue Stall Reasons (Data Request) 89.26% 91.28% 90.52%
52 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
52 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
52 stall_other Issue Stall Reasons (Other) 0.11% 0.14% 0.12%
52 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.21% 0.63% 0.37%
52 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.85% 1.15% 0.95%
52 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
52 inst_fp_32 FP Instructions(Single) 95904000 95904000 95904000
52 inst_fp_64 FP Instructions(Double) 0 0 0
52 inst_integer Integer Instructions 167833440 167833440 167833440
52 inst_bit_convert Bit-Convert Instructions 0 0 0
52 inst_control Control-Flow Instructions 17982240 17982240 17982240
52 inst_compute_ld_st Load/Store Instructions 35964000 35964000 35964000
52 inst_misc Misc Instructions 17982480 17982480 17982480
52 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
52 issue_slots Issue Slots 11260851 11278424 11267820
52 cf_issued Issued Control-Flow Instructions 1123892 1123892 1123892
52 cf_executed Executed Control-Flow Instructions 1123892 1123892 1123892
52 ldst_issued Issued Load/Store Instructions 1685831 1685831 1685831
52 ldst_executed Executed Load/Store Instructions 1685831 1685831 1685831
52 atomic_transactions Atomic Transactions 0 0 0
52 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
52 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
52 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
52 l2_tex_read_transactions L2 Transactions (Texture Reads) 1560476 1567400 1563315
52 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 1.66% 2.16% 1.86%
52 stall_not_selected Issue Stall Reasons (Not Selected) 0.99% 1.27% 1.09%
52 l2_tex_write_transactions L2 Transactions (Texture Writes) 2247750 2247750 2247750
52 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
52 nvlink_total_data_received NVLink Total Data Received 864 864 864
52 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
52 nvlink_user_data_received NVLink User Data Received 0 0 0
52 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
52 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
52 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
52 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
52 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
52 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
52 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
52 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
52 nvlink_transmit_throughput NVLink Transmit Throughput 6.9944MB/s 7.0517MB/s 7.0305MB/s
52 nvlink_receive_throughput NVLink Receive Throughput 5.2458MB/s 5.2888MB/s 5.2729MB/s
52 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
52 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
52 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
52 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
52 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
52 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
52 inst_fp_16 HP Instructions(Half) 0 0 0
52 ipc Executed IPC 0.431233 0.709484 0.549376
52 issued_ipc Issued IPC 0.604061 0.711792 0.639410
52 issue_slot_utilization Issue Slot Utilization 15.10% 17.79% 15.99%
52 sm_efficiency Multiprocessor Activity 95.77% 99.60% 96.82%
52 achieved_occupancy Achieved Occupancy 0.909721 0.917467 0.914479
52 eligible_warps_per_cycle Eligible Warps Per Active Cycle 1.194290 1.440419 1.282381
52 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
52 l2_utilization L2 Cache Utilization Low (1) Low (2) Low (1)
52 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
52 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (2) Low (1)
52 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
52 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
52 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
52 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
52 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (2) Low (2) Low (2)
52 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
52 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
52 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.74% 6.26% 4.85%
52 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
52 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
52 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
52 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
52 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
52 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_stage_update__28
52 inst_per_warp Instructions per warp 534.982362 534.982362 534.982362
52 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
52 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
52 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 95.23% 95.23% 95.23%
52 inst_replay_overhead Instruction Replay Overhead 0.002053 0.002988 0.002480
52 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
52 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
52 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
52 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
52 gld_transactions_per_request Global Load Transactions Per Request 3.999989 3.999989 3.999989
52 gst_transactions_per_request Global Store Transactions Per Request 3.999989 3.999989 3.999989
52 shared_store_transactions Shared Store Transactions 0 0 0
52 shared_load_transactions Shared Load Transactions 0 0 0
52 local_load_transactions Local Load Transactions 0 0 0
52 local_store_transactions Local Store Transactions 0 0 0
52 gld_transactions Global Load Transactions 3746250 3746250 3746250
52 gst_transactions Global Store Transactions 2247750 2247750 2247750
52 sysmem_read_transactions System Memory Read Transactions 0 0 0
52 sysmem_write_transactions System Memory Write Transactions 5 5 5
52 l2_read_transactions L2 Read Transactions 3052890 3058896 3055542
52 l2_write_transactions L2 Write Transactions 2247775 2271552 2256582
52 dram_read_transactions Device Memory Read Transactions 2997078 2997459 2997199
52 dram_write_transactions Device Memory Write Transactions 2235146 2257506 2247535
52 global_hit_rate Global Hit Rate in unified l1/tex 11.48% 11.58% 11.53%
52 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
52 gld_requested_throughput Requested Global Load Throughput 525.20GB/s 528.92GB/s 527.47GB/s
52 gst_requested_throughput Requested Global Store Throughput 315.12GB/s 317.35GB/s 316.48GB/s
52 gld_throughput Global Load Throughput 525.20GB/s 528.92GB/s 527.47GB/s
52 gst_throughput Global Store Throughput 315.12GB/s 317.35GB/s 316.48GB/s
52 local_memory_overhead Local Memory Overhead 0.00% 0.08% 0.01%
52 tex_cache_hit_rate Unified Cache Hit Rate 11.46% 11.58% 11.53%
52 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 1.80% 2.01% 1.89%
52 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
52 dram_read_throughput Device Memory Read Throughput 420.18GB/s 423.16GB/s 422.00GB/s
52 dram_write_throughput Device Memory Write Throughput 313.41GB/s 318.54GB/s 316.45GB/s
52 tex_cache_throughput Unified cache to SM throughput 630.25GB/s 634.71GB/s 632.96GB/s
52 l2_tex_read_throughput L2 Throughput (Texture Reads) 428.16GB/s 431.26GB/s 430.13GB/s
52 l2_tex_write_throughput L2 Throughput (Texture Writes) 315.12GB/s 317.35GB/s 316.48GB/s
52 l2_read_throughput L2 Throughput (Reads) 428.08GB/s 431.29GB/s 430.22GB/s
52 l2_write_throughput L2 Throughput (Writes) 315.46GB/s 320.62GB/s 317.72GB/s
52 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 sysmem_write_throughput System Memory Write Throughput 735.02KB/s 740.23KB/s 738.18KB/s
52 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
52 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
52 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
52 tex_cache_transactions Unified cache to SM transactions 1123885 1123885 1123885
52 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
52 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
52 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
52 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
52 flop_count_sp Floating Point Operations(Single Precision) 263736000 263736000 263736000
52 flop_count_sp_add Floating Point Operations(Single Precision Add) 35964000 35964000 35964000
52 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 101898000 101898000 101898000
52 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 23976000 23976000 23976000
52 flop_count_sp_special Floating Point Operations(Single Precision Special) 11988000 11988000 11988000
52 inst_executed Instructions Executed 17607492 100212896 68441586
52 inst_issued Instructions Issued 17643869 17663490 17652174
52 dram_utilization Device Memory Utilization High (9) High (9) High (9)
52 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
52 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 1.36% 1.70% 1.48%
52 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 3.81% 4.54% 4.08%
52 stall_memory_dependency Issue Stall Reasons (Data Request) 89.03% 90.91% 90.21%
52 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
52 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
52 stall_other Issue Stall Reasons (Other) 0.10% 0.13% 0.12%
52 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.18% 0.47% 0.28%
52 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 1.34% 1.66% 1.47%
52 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
52 inst_fp_32 FP Instructions(Single) 185814000 185814000 185814000
52 inst_fp_64 FP Instructions(Double) 0 0 0
52 inst_integer Integer Instructions 239761440 239761440 239761440
52 inst_bit_convert Bit-Convert Instructions 0 0 0
52 inst_control Control-Flow Instructions 29970240 29970240 29970240
52 inst_compute_ld_st Load/Store Instructions 47952000 47952000 47952000
52 inst_misc Misc Instructions 23976480 23976480 23976480
52 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
52 issue_slots Issue Slots 17643869 17663490 17652174
52 cf_issued Issued Control-Flow Instructions 1685831 1685831 1685831
52 cf_executed Executed Control-Flow Instructions 1685831 1685831 1685831
52 ldst_issued Issued Load/Store Instructions 2247770 2247770 2247770
52 ldst_executed Executed Load/Store Instructions 2247770 2247770 2247770
52 atomic_transactions Atomic Transactions 0 0 0
52 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
52 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
52 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
52 l2_tex_read_transactions L2 Transactions (Texture Reads) 3052248 3059492 3054944
52 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.56% 0.74% 0.63%
52 stall_not_selected Issue Stall Reasons (Not Selected) 1.59% 1.98% 1.74%
52 l2_tex_write_transactions L2 Transactions (Texture Writes) 2247750 2247750 2247750
52 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
52 nvlink_total_data_received NVLink Total Data Received 864 864 864
52 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
52 nvlink_user_data_received NVLink User Data Received 0 0 0
52 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
52 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
52 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
52 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
52 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
52 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
52 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
52 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
52 nvlink_transmit_throughput NVLink Transmit Throughput 5.1681MB/s 5.2047MB/s 5.1904MB/s
52 nvlink_receive_throughput NVLink Receive Throughput 3.8761MB/s 3.9036MB/s 3.8928MB/s
52 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
52 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
52 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
52 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
52 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
52 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
52 inst_fp_16 HP Instructions(Half) 0 0 0
52 ipc Executed IPC 0.462797 0.814774 0.637099
52 issued_ipc Issued IPC 0.691919 0.815913 0.738492
52 issue_slot_utilization Issue Slot Utilization 17.30% 20.40% 18.46%
52 sm_efficiency Multiprocessor Activity 96.88% 99.61% 97.62%
52 achieved_occupancy Achieved Occupancy 0.910604 0.920766 0.915729
52 eligible_warps_per_cycle Eligible Warps Per Active Cycle 1.646381 1.977418 1.769254
52 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
52 l2_utilization L2 Cache Utilization Low (1) Low (2) Low (1)
52 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
52 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (2) Low (1)
52 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
52 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
52 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
52 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
52 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (2) Low (2) Low (2)
52 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
52 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
52 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.90% 9.25% 6.97%
52 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
52 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
52 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
52 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
52 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
52 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_volume_gradients_of_laplacians__20
155 inst_per_warp Instructions per warp 3.7010e+03 3.7010e+03 3.7010e+03
155 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
155 warp_execution_efficiency Warp Execution Efficiency 96.43% 96.43% 96.43%
155 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 91.83% 91.83% 91.83%
155 inst_replay_overhead Instruction Replay Overhead 0.006148 0.014152 0.009441
155 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.567256 1.604154 1.582187
155 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.122444 1.124571 1.123434
155 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 gld_transactions_per_request Global Load Transactions Per Request 3.449172 3.488410 3.466619
155 gst_transactions_per_request Global Store Transactions Per Request 3.857143 3.857143 3.857143
155 shared_store_transactions Shared Store Transactions 35357 35424 35388
155 shared_load_transactions Shared Load Transactions 427861 437934 431937
155 local_load_transactions Local Load Transactions 0 0 0
155 local_store_transactions Local Store Transactions 0 0 0
155 gld_transactions Global Load Transactions 235406 238084 236596
155 gst_transactions Global Store Transactions 243000 243000 243000
155 sysmem_read_transactions System Memory Read Transactions 0 0 0
155 sysmem_write_transactions System Memory Write Transactions 5 5 5
155 l2_read_transactions L2 Read Transactions 223475 225859 224461
155 l2_write_transactions L2 Write Transactions 243021 266923 252933
155 dram_read_transactions Device Memory Read Transactions 225020 225444 225190
155 dram_write_transactions Device Memory Write Transactions 238936 264018 251926
155 global_hit_rate Global Hit Rate in unified l1/tex 25.50% 25.71% 25.61%
155 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
155 gld_requested_throughput Requested Global Load Throughput 255.04GB/s 282.98GB/s 272.55GB/s
155 gst_requested_throughput Requested Global Store Throughput 251.16GB/s 278.68GB/s 268.41GB/s
155 gld_throughput Global Load Throughput 245.31GB/s 272.05GB/s 261.34GB/s
155 gst_throughput Global Store Throughput 251.16GB/s 278.68GB/s 268.41GB/s
155 local_memory_overhead Local Memory Overhead 23.22% 23.80% 23.46%
155 tex_cache_hit_rate Unified Cache Hit Rate 5.94% 5.95% 5.95%
155 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 13.65% 13.69% 13.67%
155 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 38.81% 38.97% 38.90%
155 dram_read_throughput Device Memory Read Throughput 232.98GB/s 258.51GB/s 248.74GB/s
155 dram_write_throughput Device Memory Write Throughput 259.15GB/s 300.07GB/s 278.27GB/s
155 tex_cache_throughput Unified cache to SM throughput 1487.0GB/s 1649.7GB/s 1589.1GB/s
155 l2_tex_read_throughput L2 Throughput (Texture Reads) 230.79GB/s 256.04GB/s 246.62GB/s
155 l2_tex_write_throughput L2 Throughput (Texture Writes) 251.16GB/s 278.68GB/s 268.41GB/s
155 l2_read_throughput L2 Throughput (Reads) 233.41GB/s 258.12GB/s 247.93GB/s
155 l2_write_throughput L2 Throughput (Writes) 251.19GB/s 304.23GB/s 279.38GB/s
155 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 sysmem_write_throughput System Memory Write Throughput 5.2919MB/s 5.8717MB/s 5.6554MB/s
155 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_load_throughput Shared Memory Load Throughput 1776.6GB/s 1983.2GB/s 1908.4GB/s
155 shared_store_throughput Shared Memory Store Throughput 146.29GB/s 162.44GB/s 156.36GB/s
155 gld_efficiency Global Memory Load Efficiency 103.64% 104.82% 104.29%
155 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
155 tex_cache_transactions Unified cache to SM transactions 359613 360008 359664
155 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
155 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
155 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
155 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
155 flop_count_sp Floating Point Operations(Single Precision) 57672000 57672000 57672000
155 flop_count_sp_add Floating Point Operations(Single Precision Add) 2106000 2106000 2106000
155 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 24462000 24462000 24462000
155 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 6642000 6642000 6642000
155 flop_count_sp_special Floating Point Operations(Single Precision Special) 2430000 2430000 2430000
155 inst_executed Instructions Executed 4168500 19430250 12931698
155 inst_issued Instructions Issued 4194502 4227494 4206364
155 dram_utilization Device Memory Utilization Mid (6) High (7) Mid (6)
155 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
155 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 5.89% 24.55% 14.49%
155 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 10.80% 16.27% 13.45%
155 stall_memory_dependency Issue Stall Reasons (Data Request) 29.17% 47.96% 38.68%
155 stall_texture Issue Stall Reasons (Texture) 0.00% 0.02% 0.00%
155 stall_sync Issue Stall Reasons (Synchronization) 3.39% 6.66% 5.20%
155 stall_other Issue Stall Reasons (Other) 1.41% 2.38% 1.88%
155 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 1.47% 9.14% 4.06%
155 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 6.68% 13.11% 9.97%
155 shared_efficiency Shared Memory Efficiency 21.67% 22.14% 21.95%
155 inst_fp_32 FP Instructions(Single) 37746000 37746000 37746000
155 inst_fp_64 FP Instructions(Double) 0 0 0
155 inst_integer Integer Instructions 46332000 46332000 46332000
155 inst_bit_convert Bit-Convert Instructions 648000 648000 648000
155 inst_control Control-Flow Instructions 10530000 10530000 10530000
155 inst_compute_ld_st Load/Store Instructions 13446000 13446000 13446000
155 inst_misc Misc Instructions 12474000 12474000 12474000
155 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
155 issue_slots Issue Slots 4194502 4227494 4206364
155 cf_issued Issued Control-Flow Instructions 435750 435750 435750
155 cf_executed Executed Control-Flow Instructions 435750 435750 435750
155 ldst_issued Issued Load/Store Instructions 546000 546000 546000
155 ldst_executed Executed Load/Store Instructions 546000 546000 546000
155 atomic_transactions Atomic Transactions 0 0 0
155 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
155 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
155 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
155 l2_tex_read_transactions L2 Transactions (Texture Reads) 223240 223311 223274
155 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.53% 5.60% 2.44%
155 stall_not_selected Issue Stall Reasons (Not Selected) 7.63% 12.79% 9.81%
155 l2_tex_write_transactions L2 Transactions (Texture Writes) 243000 243000 243000
155 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
155 nvlink_total_data_received NVLink Total Data Received 864 864 864
155 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
155 nvlink_user_data_received NVLink User Data Received 0 0 0
155 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
155 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
155 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
155 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
155 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
155 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
155 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
155 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
155 nvlink_transmit_throughput NVLink Transmit Throughput 38.102MB/s 42.276MB/s 40.719MB/s
155 nvlink_receive_throughput NVLink Receive Throughput 28.576MB/s 31.707MB/s 30.539MB/s
155 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
155 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
155 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
155 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
155 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
155 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
155 inst_fp_16 HP Instructions(Half) 0 0 0
155 ipc Executed IPC 0.502848 1.759222 1.104721
155 issued_ipc Issued IPC 1.313268 1.781347 1.529956
155 issue_slot_utilization Issue Slot Utilization 32.83% 44.53% 38.25%
155 sm_efficiency Multiprocessor Activity 59.15% 88.42% 80.87%
155 achieved_occupancy Achieved Occupancy 0.500672 0.524115 0.514806
155 eligible_warps_per_cycle Eligible Warps Per Active Cycle 3.789821 5.122976 4.493437
155 shared_utilization Shared Memory Utilization Low (1) Low (1) Low (1)
155 l2_utilization L2 Cache Utilization Low (1) Low (2) Low (1)
155 tex_utilization Unified Cache Utilization Low (1) Low (2) Low (1)
155 ldst_fu_utilization Load/Store Function Unit Utilization Low (2) Low (3) Low (2)
155 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
155 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 special_fu_utilization Special Function Unit Utilization Low (2) Low (2) Low (2)
155 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (3) Mid (5) Low (3)
155 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
155 flop_sp_efficiency FLOP Efficiency(Peak Single) 1.01% 15.26% 12.03%
155 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
155 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
155 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
155 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
155 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
155 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_reduce_kernel_15
17 inst_per_warp Instructions per warp 1.3257e+05 1.3257e+05 1.3257e+05
17 branch_efficiency Branch Efficiency 99.99% 99.99% 99.99%
17 warp_execution_efficiency Warp Execution Efficiency 99.99% 99.99% 99.99%
17 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 94.74% 94.74% 94.74%
17 inst_replay_overhead Instruction Replay Overhead 0.000214 0.000295 0.000243
17 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.000000 1.000500 1.000029
17 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.000000 1.002500 1.001213
17 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
17 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
17 gld_transactions_per_request Global Load Transactions Per Request 3.876269 3.891769 3.885582
17 gst_transactions_per_request Global Store Transactions Per Request 1.000000 1.000000 1.000000
17 shared_store_transactions Shared Store Transactions 1600 1604 1601
17 shared_load_transactions Shared Load Transactions 2000 2001 2000
17 local_load_transactions Local Load Transactions 0 0 0
17 local_store_transactions Local Store Transactions 0 0 0
17 gld_transactions Global Load Transactions 1452151 1457958 1455639
17 gst_transactions Global Store Transactions 80 80 80
17 sysmem_read_transactions System Memory Read Transactions 0 0 0
17 sysmem_write_transactions System Memory Write Transactions 5 5 5
17 l2_read_transactions L2 Read Transactions 1384705 1385594 1384921
17 l2_write_transactions L2 Write Transactions 110 23916 15115
17 dram_read_transactions Device Memory Read Transactions 769507 769673 769553
17 dram_write_transactions Device Memory Write Transactions 85800 109458 96146
17 global_hit_rate Global Hit Rate in unified l1/tex 7.81% 7.84% 7.82%
17 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
17 gld_requested_throughput Requested Global Load Throughput 196.55GB/s 221.44GB/s 209.56GB/s
17 gst_requested_throughput Requested Global Store Throughput 1.3431MB/s 1.5132MB/s 1.4320MB/s
17 gld_throughput Global Load Throughput 191.09GB/s 215.45GB/s 203.57GB/s
17 gst_throughput Global Store Throughput 10.745MB/s 12.106MB/s 11.456MB/s
17 local_memory_overhead Local Memory Overhead 2.90% 3.38% 3.08%
17 tex_cache_hit_rate Unified Cache Hit Rate 7.60% 7.60% 7.60%
17 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 44.73% 44.76% 44.74%
17 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 96.25% 96.25% 96.25%
17 dram_read_throughput Device Memory Read Throughput 100.93GB/s 113.73GB/s 107.62GB/s
17 dram_write_throughput Device Memory Write Throughput 11.774GB/s 16.085GB/s 13.446GB/s
17 tex_cache_throughput Unified cache to SM throughput 213.09GB/s 240.46GB/s 227.21GB/s
17 l2_tex_read_throughput L2 Throughput (Texture Reads) 181.61GB/s 204.61GB/s 193.63GB/s
17 l2_tex_write_throughput L2 Throughput (Texture Writes) 10.745MB/s 12.106MB/s 11.456MB/s
17 l2_read_throughput L2 Throughput (Reads) 181.62GB/s 204.75GB/s 193.68GB/s
17 l2_write_throughput L2 Throughput (Writes) 15.358MB/s 3.4761GB/s 2.1139GB/s
17 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
17 sysmem_write_throughput System Memory Write Throughput 687.67KB/s 774.75KB/s 733.20KB/s
17 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
17 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
17 shared_load_throughput Shared Memory Load Throughput 1.0493GB/s 1.1822GB/s 1.1188GB/s
17 shared_store_throughput Shared Memory Store Throughput 861.20MB/s 968.56MB/s 917.61MB/s
17 gld_efficiency Global Memory Load Efficiency 102.78% 103.19% 102.94%
17 gst_efficiency Global Memory Store Efficiency 12.50% 12.50% 12.50%
17 tex_cache_transactions Unified cache to SM transactions 405173 407770 406169
17 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
17 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
17 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
17 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
17 flop_count_sp Floating Point Operations(Single Precision) 18002400 18002400 18002400
17 flop_count_sp_add Floating Point Operations(Single Precision Add) 20400 20400 20400
17 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 5994000 5994000 5994000
17 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 5994000 5994000 5994000
17 flop_count_sp_special Floating Point Operations(Single Precision Special) 11988000 11988000 11988000
17 inst_executed Instructions Executed 24302018 84844052 67037571
17 inst_issued Instructions Issued 24307218 24309194 24307996
17 dram_utilization Device Memory Utilization Low (2) Low (2) Low (2)
17 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
17 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 3.60% 4.50% 4.19%
17 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 39.16% 41.32% 40.17%
17 stall_memory_dependency Issue Stall Reasons (Data Request) 49.14% 51.81% 50.44%
17 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
17 stall_sync Issue Stall Reasons (Synchronization) 0.55% 0.75% 0.64%
17 stall_other Issue Stall Reasons (Other) 1.53% 1.65% 1.60%
17 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.28% 0.35% 0.31%
17 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 1.15% 1.27% 1.21%
17 shared_efficiency Shared Memory Efficiency 70.89% 70.97% 70.93%
17 inst_fp_32 FP Instructions(Single) 23996400 23996400 23996400
17 inst_fp_64 FP Instructions(Double) 0 0 0
17 inst_integer Integer Instructions 530431626 530431626 530431626
17 inst_bit_convert Bit-Convert Instructions 23976000 23976000 23976000
17 inst_control Control-Flow Instructions 42162800 42162800 42162800
17 inst_compute_ld_st Load/Store Instructions 12069840 12069840 12069840
17 inst_misc Misc Instructions 72624320 72624320 72624320
17 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
17 issue_slots Issue Slots 24307218 24309194 24307996
17 cf_issued Issued Control-Flow Instructions 1883451 1883451 1883451
17 cf_executed Executed Control-Flow Instructions 1883451 1883451 1883451
17 ldst_issued Issued Load/Store Instructions 585139 585139 585139
17 ldst_executed Executed Load/Store Instructions 585139 585139 585139
17 atomic_transactions Atomic Transactions 0 0 0
17 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
17 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
17 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
17 l2_tex_read_transactions L2 Transactions (Texture Reads) 1384603 1384636 1384619
17 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.00% 0.00% 0.00%
17 stall_not_selected Issue Stall Reasons (Not Selected) 1.34% 1.55% 1.44%
17 l2_tex_write_transactions L2 Transactions (Texture Writes) 80 80 80
17 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
17 nvlink_total_data_received NVLink Total Data Received 864 864 864
17 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
17 nvlink_user_data_received NVLink User Data Received 0 0 0
17 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
17 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
17 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
17 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
17 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
17 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
17 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
17 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
17 nvlink_transmit_throughput NVLink Transmit Throughput 4.8352MB/s 5.4475MB/s 5.1553MB/s
17 nvlink_receive_throughput NVLink Receive Throughput 3.6264MB/s 4.0856MB/s 3.8665MB/s
17 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
17 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
17 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
17 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
17 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
17 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
17 inst_fp_16 HP Instructions(Half) 0 0 0
17 ipc Executed IPC 0.542897 1.053843 0.904493
17 issued_ipc Issued IPC 1.009397 1.054096 1.030456
17 issue_slot_utilization Issue Slot Utilization 25.23% 26.35% 25.76%
17 sm_efficiency Multiprocessor Activity 79.60% 96.70% 94.29%
17 achieved_occupancy Achieved Occupancy 0.124956 0.124967 0.124962
17 eligible_warps_per_cycle Eligible Warps Per Active Cycle 1.104713 1.162003 1.132825
17 shared_utilization Shared Memory Utilization Low (1) Low (1) Low (1)
17 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
17 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
17 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
17 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
17 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
17 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
17 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
17 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (3) Low (3) Low (3)
17 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
17 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
17 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.07% 0.59% 0.43%
17 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
17 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
17 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
17 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
17 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
17 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_kernel_nodal_update_auxiliary_state__6
155 inst_per_warp Instructions per warp 1.4570e+03 1.4570e+03 1.4570e+03
155 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
155 warp_execution_efficiency Warp Execution Efficiency 96.43% 96.43% 96.43%
155 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 92.34% 92.34% 92.34%
155 inst_replay_overhead Instruction Replay Overhead 0.001757 0.002737 0.001949
155 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 gld_transactions_per_request Global Load Transactions Per Request 3.484399 3.503443 3.492991
155 gst_transactions_per_request Global Store Transactions Per Request 3.857143 3.857143 3.857143
155 shared_store_transactions Shared Store Transactions 0 0 0
155 shared_load_transactions Shared Load Transactions 0 0 0
155 local_load_transactions Local Load Transactions 0 0 0
155 local_store_transactions Local Store Transactions 0 0 0
155 gld_transactions Global Load Transactions 951241 956440 953586
155 gst_transactions Global Store Transactions 972000 972000 972000
155 sysmem_read_transactions System Memory Read Transactions 0 0 0
155 sysmem_write_transactions System Memory Write Transactions 5 5 5
155 l2_read_transactions L2 Read Transactions 1038567 1040403 1039379
155 l2_write_transactions L2 Write Transactions 972025 995930 982210
155 dram_read_transactions Device Memory Read Transactions 1037623 1039333 1038860
155 dram_write_transactions Device Memory Write Transactions 969012 993763 982481
155 global_hit_rate Global Hit Rate in unified l1/tex 49.38% 49.52% 49.45%
155 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
155 gld_requested_throughput Requested Global Load Throughput 348.81GB/s 358.76GB/s 354.87GB/s
155 gst_requested_throughput Requested Global Store Throughput 326.69GB/s 336.01GB/s 332.37GB/s
155 gld_throughput Global Load Throughput 320.56GB/s 330.06GB/s 326.07GB/s
155 gst_throughput Global Store Throughput 326.69GB/s 336.01GB/s 332.37GB/s
155 local_memory_overhead Local Memory Overhead 51.48% 51.70% 51.58%
155 tex_cache_hit_rate Unified Cache Hit Rate 0.17% 0.17% 0.17%
155 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 13.09% 13.17% 13.11%
155 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 94.67% 96.05% 95.38%
155 dram_read_throughput Device Memory Read Throughput 349.10GB/s 359.25GB/s 355.23GB/s
155 dram_write_throughput Device Memory Write Throughput 328.44GB/s 343.32GB/s 335.95GB/s
155 tex_cache_throughput Unified cache to SM throughput 391.18GB/s 402.34GB/s 397.97GB/s
155 l2_tex_read_throughput L2 Throughput (Texture Reads) 349.00GB/s 358.96GB/s 355.06GB/s
155 l2_tex_write_throughput L2 Throughput (Texture Writes) 326.69GB/s 336.01GB/s 332.37GB/s
155 l2_read_throughput L2 Throughput (Reads) 349.68GB/s 359.57GB/s 355.41GB/s
155 l2_write_throughput L2 Throughput (Writes) 328.01GB/s 344.20GB/s 335.86GB/s
155 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 sysmem_write_throughput System Memory Write Throughput 1.7209MB/s 1.7699MB/s 1.7507MB/s
155 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 gld_efficiency Global Memory Load Efficiency 108.51% 109.10% 108.83%
155 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
155 tex_cache_transactions Unified cache to SM transactions 290966 290966 290966
155 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
155 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
155 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
155 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
155 flop_count_sp Floating Point Operations(Single Precision) 20088000 20088000 20088000
155 flop_count_sp_add Floating Point Operations(Single Precision Add) 4131000 4131000 4131000
155 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 6966000 6966000 6966000
155 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 2025000 2025000 2025000
155 flop_count_sp_special Floating Point Operations(Single Precision Special) 1134000 1134000 1134000
155 inst_executed Instructions Executed 2488500 7649250 5218703
155 inst_issued Instructions Issued 2492872 2495312 2493368
155 dram_utilization Device Memory Utilization High (9) High (9) High (9)
155 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
155 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 2.24% 11.92% 6.24%
155 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 8.04% 10.73% 9.27%
155 stall_memory_dependency Issue Stall Reasons (Data Request) 41.85% 50.95% 46.67%
155 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
155 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
155 stall_other Issue Stall Reasons (Other) 0.29% 0.39% 0.34%
155 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.94% 1.49% 1.17%
155 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.48% 0.65% 0.55%
155 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
155 inst_fp_32 FP Instructions(Single) 17334000 17334000 17334000
155 inst_fp_64 FP Instructions(Double) 0 0 0
155 inst_integer Integer Instructions 32400000 32400000 32400000
155 inst_bit_convert Bit-Convert Instructions 324000 324000 324000
155 inst_control Control-Flow Instructions 4050000 4050000 4050000
155 inst_compute_ld_st Load/Store Instructions 16200000 16200000 16200000
155 inst_misc Misc Instructions 3402000 3402000 3402000
155 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
155 issue_slots Issue Slots 2492872 2495312 2493368
155 cf_issued Issued Control-Flow Instructions 173250 173250 173250
155 cf_executed Executed Control-Flow Instructions 173250 173250 173250
155 ldst_issued Issued Load/Store Instructions 556500 556500 556500
155 ldst_executed Executed Load/Store Instructions 556500 556500 556500
155 atomic_transactions Atomic Transactions 0 0 0
155 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
155 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
155 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
155 l2_tex_read_transactions L2 Transactions (Texture Reads) 1038375 1038375 1038375
155 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 31.65% 38.22% 34.79%
155 stall_not_selected Issue Stall Reasons (Not Selected) 0.81% 1.15% 0.98%
155 l2_tex_write_transactions L2 Transactions (Texture Writes) 972000 972000 972000
155 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
155 nvlink_total_data_received NVLink Total Data Received 864 864 864
155 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
155 nvlink_user_data_received NVLink User Data Received 0 0 0
155 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
155 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
155 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
155 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
155 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
155 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
155 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
155 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
155 nvlink_transmit_throughput NVLink Transmit Throughput 12.390MB/s 12.744MB/s 12.605MB/s
155 nvlink_receive_throughput NVLink Receive Throughput 9.2926MB/s 9.5576MB/s 9.4539MB/s
155 nvlink_total_response_data_received NVLink Total Response Data Received 288 1056 298
155 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
155 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
155 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
155 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
155 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
155 inst_fp_16 HP Instructions(Half) 0 0 0
155 ipc Executed IPC 0.249160 0.543066 0.373083
155 issued_ipc Issued IPC 0.248012 0.303364 0.272104
155 issue_slot_utilization Issue Slot Utilization 6.20% 7.58% 6.80%
155 sm_efficiency Multiprocessor Activity 81.64% 95.34% 90.83%
155 achieved_occupancy Achieved Occupancy 0.105718 0.106256 0.106001
155 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.302730 0.373152 0.333547
155 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
155 l2_utilization L2 Cache Utilization Low (1) Low (2) Low (1)
155 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
155 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
155 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
155 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
155 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
155 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
155 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.91% 1.71% 1.48%
155 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
155 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
155 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
155 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
155 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
155 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_volume_gradients__16
155 inst_per_warp Instructions per warp 1.7590e+04 1.7590e+04 1.7590e+04
155 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
155 warp_execution_efficiency Warp Execution Efficiency 56.25% 56.25% 56.25%
155 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 55.76% 55.76% 55.76%
155 inst_replay_overhead Instruction Replay Overhead 0.000718 0.000851 0.000790
155 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.117239 1.125548 1.121521
155 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.011719 1.016370 1.013733
155 local_load_transactions_per_request Local Memory Load Transactions Per Request 2.500000 2.500000 2.500000
155 local_store_transactions_per_request Local Memory Store Transactions Per Request 2.500000 2.500000 2.500000
155 gld_transactions_per_request Global Load Transactions Per Request 2.651715 2.655303 2.653849
155 gst_transactions_per_request Global Store Transactions Per Request 2.750000 2.750000 2.750000
155 shared_store_transactions Shared Store Transactions 374842 376565 375588
155 shared_load_transactions Shared Load Transactions 3831012 3859505 3845696
155 local_load_transactions Local Load Transactions 9060000 9060000 9060000
155 local_store_transactions Local Store Transactions 12765000 12765000 12765000
155 gld_transactions Global Load Transactions 5039584 5046404 5043640
155 gst_transactions Global Store Transactions 2920500 2920500 2920500
155 sysmem_read_transactions System Memory Read Transactions 0 0 0
155 sysmem_write_transactions System Memory Write Transactions 5 5 5
155 l2_read_transactions L2 Read Transactions 10820602 10878616 10848083
155 l2_write_transactions L2 Write Transactions 17984664 18025907 18003286
155 dram_read_transactions Device Memory Read Transactions 14345828 14434798 14387443
155 dram_write_transactions Device Memory Write Transactions 12843700 12874897 12858857
155 global_hit_rate Global Hit Rate in unified l1/tex 50.20% 50.65% 50.42%
155 local_hit_rate Local Hit Rate 22.62% 22.88% 22.74%
155 gld_requested_throughput Requested Global Load Throughput 63.124GB/s 64.419GB/s 63.834GB/s
155 gst_requested_throughput Requested Global Store Throughput 35.274GB/s 35.997GB/s 35.671GB/s
155 gld_throughput Global Load Throughput 74.467GB/s 75.992GB/s 75.292GB/s
155 gst_throughput Global Store Throughput 43.112GB/s 43.997GB/s 43.597GB/s
155 local_memory_overhead Local Memory Overhead 85.02% 85.18% 85.11%
155 tex_cache_hit_rate Unified Cache Hit Rate 11.46% 11.58% 11.51%
155 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 7.04% 7.47% 7.26%
155 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 35.23% 35.89% 35.55%
155 dram_read_throughput Device Memory Read Throughput 212.30GB/s 217.15GB/s 214.78GB/s
155 dram_write_throughput Device Memory Write Throughput 189.64GB/s 193.92GB/s 191.96GB/s
155 tex_cache_throughput Unified cache to SM throughput 534.11GB/s 544.96GB/s 540.07GB/s
155 l2_tex_read_throughput L2 Throughput (Texture Reads) 160.04GB/s 163.32GB/s 161.77GB/s
155 l2_tex_write_throughput L2 Throughput (Texture Writes) 231.55GB/s 236.30GB/s 234.15GB/s
155 l2_read_throughput L2 Throughput (Reads) 159.98GB/s 163.77GB/s 161.94GB/s
155 l2_write_throughput L2 Throughput (Writes) 265.97GB/s 271.07GB/s 268.75GB/s
155 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 sysmem_write_throughput System Memory Write Throughput 77.395KB/s 78.982KB/s 78.265KB/s
155 local_load_throughput Local Memory Load Throughput 133.74GB/s 136.49GB/s 135.25GB/s
155 local_store_throughput Local Memory Store Throughput 188.44GB/s 192.30GB/s 190.56GB/s
155 shared_load_throughput Shared Memory Load Throughput 227.25GB/s 231.62GB/s 229.63GB/s
155 shared_store_throughput Shared Memory Store Throughput 22.202GB/s 22.625GB/s 22.427GB/s
155 gld_efficiency Global Memory Load Efficiency 84.74% 84.85% 84.78%
155 gst_efficiency Global Memory Store Efficiency 81.82% 81.82% 81.82%
155 tex_cache_transactions Unified cache to SM transactions 9042362 9046345 9044503
155 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
155 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
155 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
155 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
155 flop_count_sp Floating Point Operations(Single Precision) 254826000 254826000 254826000
155 flop_count_sp_add Floating Point Operations(Single Precision Add) 1134000 1134000 1134000
155 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 123444000 123444000 123444000
155 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 6804000 6804000 6804000
155 flop_count_sp_special Floating Point Operations(Single Precision Special) 513000 513000 513000
155 inst_executed Instructions Executed 22279500 26385000 24504416
155 inst_issued Instructions Issued 22295393 22298431 22297140
155 dram_utilization Device Memory Utilization Mid (5) Mid (5) Mid (5)
155 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
155 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.24% 2.17% 0.99%
155 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 2.10% 2.55% 2.28%
155 stall_memory_dependency Issue Stall Reasons (Data Request) 63.37% 66.48% 64.92%
155 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
155 stall_sync Issue Stall Reasons (Synchronization) 0.69% 0.95% 0.81%
155 stall_other Issue Stall Reasons (Other) 0.06% 0.11% 0.09%
155 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.04% 0.08% 0.06%
155 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 1.68% 2.14% 1.90%
155 shared_efficiency Shared Memory Efficiency 19.14% 19.28% 19.21%
155 inst_fp_32 FP Instructions(Single) 132219000 132219000 132219000
155 inst_fp_64 FP Instructions(Double) 0 0 0
155 inst_integer Integer Instructions 21222000 21222000 21222000
155 inst_bit_convert Bit-Convert Instructions 54000 54000 54000
155 inst_control Control-Flow Instructions 1998000 1998000 1998000
155 inst_compute_ld_st Load/Store Instructions 233469000 233469000 233469000
155 inst_misc Misc Instructions 11124000 11124000 11124000
155 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
155 issue_slots Issue Slots 22295393 22298431 22297140
155 cf_issued Issued Control-Flow Instructions 135000 135000 135000
155 cf_executed Executed Control-Flow Instructions 135000 135000 135000
155 ldst_issued Issued Load/Store Instructions 13014000 13014000 13014000
155 ldst_executed Executed Load/Store Instructions 13014000 13014000 13014000
155 atomic_transactions Atomic Transactions 0 0 0
155 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
155 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
155 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
155 l2_tex_read_transactions L2 Transactions (Texture Reads) 10814596 10850276 10836560
155 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 27.44% 30.06% 28.76%
155 stall_not_selected Issue Stall Reasons (Not Selected) 0.19% 0.24% 0.21%
155 l2_tex_write_transactions L2 Transactions (Texture Writes) 15685500 15685500 15685500
155 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
155 nvlink_total_data_received NVLink Total Data Received 864 864 864
155 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
155 nvlink_user_data_received NVLink User Data Received 0 0 0
155 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
155 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
155 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
155 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
155 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
155 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
155 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
155 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
155 nvlink_transmit_throughput NVLink Transmit Throughput 557.24KB/s 568.68KB/s 563.51KB/s
155 nvlink_receive_throughput NVLink Receive Throughput 417.93KB/s 426.51KB/s 422.63KB/s
155 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
155 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
155 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
155 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
155 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
155 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
155 inst_fp_16 HP Instructions(Half) 0 0 0
155 ipc Executed IPC 0.096423 0.132104 0.111972
155 issued_ipc Issued IPC 0.096503 0.116254 0.105112
155 issue_slot_utilization Issue Slot Utilization 2.41% 2.91% 2.63%
155 sm_efficiency Multiprocessor Activity 91.33% 93.08% 92.30%
155 achieved_occupancy Achieved Occupancy 0.121941 0.123391 0.122706
155 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.112371 0.134675 0.121493
155 shared_utilization Shared Memory Utilization Low (1) Low (1) Low (1)
155 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
155 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
155 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
155 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
155 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
155 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
155 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
155 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.78% 0.95% 0.86%
155 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
155 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
155 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
155 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
155 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
155 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_band_forward_kernel__25
104 inst_per_warp Instructions per warp 4.0182e+04 4.0182e+04 4.0182e+04
104 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
104 warp_execution_efficiency Warp Execution Efficiency 56.25% 56.25% 56.25%
104 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 55.63% 55.63% 55.63%
104 inst_replay_overhead Instruction Replay Overhead 0.000549 0.000717 0.000631
104 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
104 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
104 local_load_transactions_per_request Local Memory Load Transactions Per Request 2.500000 2.500000 2.500000
104 local_store_transactions_per_request Local Memory Store Transactions Per Request 2.500000 2.500000 2.500000
103 gld_transactions_per_request Global Load Transactions Per Request 2.643685 2.680684 2.660489
103 gst_transactions_per_request Global Store Transactions Per Request 2.750000 2.750000 2.750000
103 shared_store_transactions Shared Store Transactions 0 0 0
103 shared_load_transactions Shared Load Transactions 0 0 0
103 local_load_transactions Local Load Transactions 2546250 2546250 2546250
103 local_store_transactions Local Store Transactions 2530500 2530500 2530500
103 gld_transactions Global Load Transactions 3568975 3618923 3591660
103 gst_transactions Global Store Transactions 123750 123750 123750
103 sysmem_read_transactions System Memory Read Transactions 0 0 0
103 sysmem_write_transactions System Memory Write Transactions 5 5 5
103 l2_read_transactions L2 Read Transactions 5239255 5266701 5251905
103 l2_write_transactions L2 Write Transactions 3156310 3181032 3166212
103 dram_read_transactions Device Memory Read Transactions 6067090 6081890 6074513
103 dram_write_transactions Device Memory Write Transactions 2615284 2641375 2628610
103 global_hit_rate Global Hit Rate in unified l1/tex 19.25% 19.51% 19.39%
103 local_hit_rate Local Hit Rate 8.64% 8.86% 8.75%
103 gld_requested_throughput Requested Global Load Throughput 72.877GB/s 78.486GB/s 76.349GB/s
103 gst_requested_throughput Requested Global Store Throughput 2.4292GB/s 2.6162GB/s 2.5450GB/s
103 gld_throughput Global Load Throughput 86.042GB/s 93.069GB/s 90.278GB/s
103 gst_throughput Global Store Throughput 2.9691GB/s 3.1976GB/s 3.1105GB/s
103 local_memory_overhead Local Memory Overhead 61.79% 62.29% 62.07%
103 tex_cache_hit_rate Unified Cache Hit Rate 11.36% 11.46% 11.40%
103 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 13.79% 14.09% 13.95%
103 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 5.61% 5.78% 5.68%
103 dram_read_throughput Device Memory Read Throughput 145.73GB/s 157.04GB/s 152.69GB/s
103 dram_write_throughput Device Memory Write Throughput 62.918GB/s 68.072GB/s 66.071GB/s
103 tex_cache_throughput Unified cache to SM throughput 243.16GB/s 261.89GB/s 254.72GB/s
103 l2_tex_read_throughput L2 Throughput (Texture Reads) 125.81GB/s 135.50GB/s 131.78GB/s
103 l2_tex_write_throughput L2 Throughput (Texture Writes) 63.682GB/s 68.584GB/s 66.716GB/s
103 l2_read_throughput L2 Throughput (Reads) 125.85GB/s 135.80GB/s 132.01GB/s
103 l2_write_throughput L2 Throughput (Writes) 75.741GB/s 82.082GB/s 79.584GB/s
103 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 sysmem_write_throughput System Memory Write Throughput 125.79KB/s 135.47KB/s 131.78KB/s
103 local_load_throughput Local Memory Load Throughput 61.091GB/s 65.793GB/s 64.001GB/s
103 local_store_throughput Local Memory Store Throughput 60.713GB/s 65.386GB/s 63.605GB/s
103 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 gld_efficiency Global Memory Load Efficiency 83.93% 85.11% 84.57%
103 gst_efficiency Global Memory Store Efficiency 81.82% 81.82% 81.82%
103 tex_cache_transactions Unified cache to SM transactions 2529908 2537878 2533502
103 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
103 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
103 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
103 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
103 flop_count_sp Floating Point Operations(Single Precision) 46980000 46980000 46980000
103 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
103 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 23490000 23490000 23490000
103 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
103 flop_count_sp_special Floating Point Operations(Single Precision Special) 5400 5400 5400
103 inst_executed Instructions Executed 9425100 12054600 10701556
103 inst_issued Instructions Issued 9430276 9431857 9431057
103 dram_utilization Device Memory Utilization Low (3) Low (3) Low (3)
103 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
103 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.09% 0.29% 0.17%
103 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 3.18% 3.49% 3.30%
103 stall_memory_dependency Issue Stall Reasons (Data Request) 96.05% 96.59% 96.38%
103 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
103 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
103 stall_other Issue Stall Reasons (Other) 0.03% 0.04% 0.04%
103 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.05% 0.10% 0.07%
103 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.01% 0.01% 0.01%
103 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
103 inst_fp_32 FP Instructions(Single) 23495400 23495400 23495400
103 inst_fp_64 FP Instructions(Double) 0 0 0
103 inst_integer Integer Instructions 79401600 79401600 79401600
103 inst_bit_convert Bit-Convert Instructions 10800 10800 10800
103 inst_control Control-Flow Instructions 199800 199800 199800
103 inst_compute_ld_st Load/Store Instructions 61630200 61630200 61630200
103 inst_misc Misc Instructions 4179600 4179600 4179600
103 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
103 issue_slots Issue Slots 9430276 9431857 9431057
103 cf_issued Issued Control-Flow Instructions 63000 63000 63000
103 cf_executed Executed Control-Flow Instructions 63000 63000 63000
103 ldst_issued Issued Load/Store Instructions 3426000 3426000 3426000
103 ldst_executed Executed Load/Store Instructions 3426000 3426000 3426000
103 atomic_transactions Atomic Transactions 0 0 0
103 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
103 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
103 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
103 l2_tex_read_transactions L2 Transactions (Texture Reads) 5237522 5245988 5242609
103 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.01% 0.03% 0.02%
103 stall_not_selected Issue Stall Reasons (Not Selected) 0.01% 0.02% 0.02%
103 l2_tex_write_transactions L2 Transactions (Texture Writes) 2654250 2654250 2654250
103 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
103 nvlink_total_data_received NVLink Total Data Received 864 864 864
103 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
103 nvlink_user_data_received NVLink User Data Received 0 0 0
103 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
103 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
103 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
103 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
103 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
103 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
103 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
103 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
103 nvlink_transmit_throughput NVLink Transmit Throughput 905.68KB/s 975.39KB/s 948.83KB/s
103 nvlink_receive_throughput NVLink Receive Throughput 679.26KB/s 731.55KB/s 711.62KB/s
103 nvlink_total_response_data_received NVLink Total Response Data Received 288 480 291
103 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
103 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
103 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
103 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
103 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
103 inst_fp_16 HP Instructions(Half) 0 0 0
103 ipc Executed IPC 0.068260 0.090479 0.078439
103 issued_ipc Issued IPC 0.068197 0.074746 0.070693
103 issue_slot_utilization Issue Slot Utilization 1.70% 1.87% 1.77%
103 sm_efficiency Multiprocessor Activity 96.12% 97.51% 96.99%
103 achieved_occupancy Achieved Occupancy 0.058575 0.058678 0.058625
103 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.068824 0.075366 0.071368
103 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
103 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
103 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
103 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
103 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
103 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
103 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
103 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
103 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
103 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
103 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
103 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.25% 0.28% 0.27%
103 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
103 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
103 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
103 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
103 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
103 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_volume_divergence_of_gradients__18
155 inst_per_warp Instructions per warp 1.0280e+03 1.0280e+03 1.0280e+03
155 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
155 warp_execution_efficiency Warp Execution Efficiency 96.43% 96.43% 96.43%
155 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 93.85% 93.85% 93.85%
155 inst_replay_overhead Instruction Replay Overhead 0.005245 0.011328 0.008139
155 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.560938 1.577383 1.568644
155 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.054945 1.054945 1.054945
155 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 gld_transactions_per_request Global Load Transactions Per Request 3.604080 3.634386 3.617958
155 gst_transactions_per_request Global Store Transactions Per Request 3.857143 3.857143 3.857143
155 shared_store_transactions Shared Store Transactions 72000 72000 72000
155 shared_load_transactions Shared Load Transactions 983391 993751 988245
155 local_load_transactions Local Load Transactions 0 0 0
155 local_store_transactions Local Store Transactions 0 0 0
155 gld_transactions Global Load Transactions 359507 362530 360891
155 gst_transactions Global Store Transactions 81000 81000 81000
155 sysmem_read_transactions System Memory Read Transactions 0 0 0
155 sysmem_write_transactions System Memory Write Transactions 5 5 5
155 l2_read_transactions L2 Read Transactions 365246 367318 366195
155 l2_write_transactions L2 Write Transactions 81025 104883 91831
155 dram_read_transactions Device Memory Read Transactions 366015 366441 366174
155 dram_write_transactions Device Memory Write Transactions 103259 138183 120506
155 global_hit_rate Global Hit Rate in unified l1/tex 12.13% 12.47% 12.30%
155 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
155 gld_requested_throughput Requested Global Load Throughput 406.27GB/s 436.85GB/s 423.26GB/s
155 gst_requested_throughput Requested Global Store Throughput 85.530GB/s 91.968GB/s 89.106GB/s
155 gld_throughput Global Load Throughput 381.04GB/s 409.63GB/s 397.01GB/s
155 gst_throughput Global Store Throughput 85.530GB/s 91.968GB/s 89.106GB/s
155 local_memory_overhead Local Memory Overhead 12.82% 13.45% 13.13%
155 tex_cache_hit_rate Unified Cache Hit Rate 5.13% 5.15% 5.14%
155 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 12.88% 12.91% 12.89%
155 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 38.76% 38.97% 38.87%
155 dram_read_throughput Device Memory Read Throughput 386.49GB/s 416.00GB/s 402.82GB/s
155 dram_write_throughput Device Memory Write Throughput 115.57GB/s 155.01GB/s 132.57GB/s
155 tex_cache_throughput Unified cache to SM throughput 3122.7GB/s 3357.6GB/s 3253.4GB/s
155 l2_tex_read_throughput L2 Throughput (Texture Reads) 385.52GB/s 414.51GB/s 401.62GB/s
155 l2_tex_write_throughput L2 Throughput (Texture Writes) 85.530GB/s 91.968GB/s 89.106GB/s
155 l2_read_throughput L2 Throughput (Reads) 385.72GB/s 415.71GB/s 402.84GB/s
155 l2_write_throughput L2 Throughput (Writes) 85.727GB/s 117.88GB/s 101.02GB/s
155 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 sysmem_write_throughput System Memory Write Throughput 5.4063MB/s 5.8133MB/s 5.6324MB/s
155 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_load_throughput Shared Memory Load Throughput 4169.7GB/s 4488.3GB/s 4348.6GB/s
155 shared_store_throughput Shared Memory Store Throughput 304.11GB/s 327.00GB/s 316.82GB/s
155 gld_efficiency Global Memory Load Efficiency 106.13% 107.02% 106.61%
155 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
155 tex_cache_transactions Unified cache to SM transactions 739256 739493 739358
155 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
155 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
155 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
155 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
155 flop_count_sp Floating Point Operations(Single Precision) 53784000 53784000 53784000
155 flop_count_sp_add Floating Point Operations(Single Precision Add) 648000 648000 648000
155 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 25920000 25920000 25920000
155 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 1296000 1296000 1296000
155 flop_count_sp_special Floating Point Operations(Single Precision Special) 324000 324000 324000
155 inst_executed Instructions Executed 3155250 5397000 4268893
155 inst_issued Instructions Issued 3172077 3190992 3180698
155 dram_utilization Device Memory Utilization High (7) High (7) High (7)
155 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
155 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.38% 13.92% 6.23%
155 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 5.52% 8.64% 6.90%
155 stall_memory_dependency Issue Stall Reasons (Data Request) 38.73% 52.36% 44.45%
155 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
155 stall_sync Issue Stall Reasons (Synchronization) 4.36% 9.57% 6.96%
155 stall_other Issue Stall Reasons (Other) 1.40% 2.31% 1.80%
155 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 1.12% 5.52% 2.81%
155 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 10.28% 17.87% 13.62%
155 shared_efficiency Shared Memory Efficiency 24.67% 24.92% 24.80%
155 inst_fp_32 FP Instructions(Single) 28188000 28188000 28188000
155 inst_fp_64 FP Instructions(Double) 0 0 0
155 inst_integer Integer Instructions 32400000 32400000 32400000
155 inst_bit_convert Bit-Convert Instructions 648000 648000 648000
155 inst_control Control-Flow Instructions 972000 972000 972000
155 inst_compute_ld_st Load/Store Instructions 25272000 25272000 25272000
155 inst_misc Misc Instructions 7938000 7938000 7938000
155 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
155 issue_slots Issue Slots 3172077 3190992 3180698
155 cf_issued Issued Control-Flow Instructions 52500 52500 52500
155 cf_executed Executed Control-Flow Instructions 52500 52500 52500
155 ldst_issued Issued Load/Store Instructions 840000 840000 840000
155 ldst_executed Executed Load/Store Instructions 840000 840000 840000
155 atomic_transactions Atomic Transactions 0 0 0
155 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
155 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
155 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
155 l2_tex_read_transactions L2 Transactions (Texture Reads) 365026 365149 365081
155 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 7.51% 12.79% 9.90%
155 stall_not_selected Issue Stall Reasons (Not Selected) 5.59% 9.50% 7.34%
155 l2_tex_write_transactions L2 Transactions (Texture Writes) 81000 81000 81000
155 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
155 nvlink_total_data_received NVLink Total Data Received 864 864 864
155 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
155 nvlink_user_data_received NVLink User Data Received 0 0 0
155 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
155 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
155 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
155 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
155 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
155 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
155 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
155 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
155 nvlink_transmit_throughput NVLink Transmit Throughput 38.925MB/s 41.856MB/s 40.553MB/s
155 nvlink_receive_throughput NVLink Receive Throughput 29.194MB/s 31.392MB/s 30.415MB/s
155 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
155 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
155 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
155 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
155 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
155 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
155 inst_fp_16 HP Instructions(Half) 0 0 0
155 ipc Executed IPC 0.760504 1.347159 1.005393
155 issued_ipc Issued IPC 0.966492 1.353029 1.163890
155 issue_slot_utilization Issue Slot Utilization 24.16% 33.83% 29.10%
155 sm_efficiency Multiprocessor Activity 54.43% 92.58% 82.26%
155 achieved_occupancy Achieved Occupancy 0.563138 0.580315 0.570436
155 eligible_warps_per_cycle Eligible Warps Per Active Cycle 3.157054 4.310169 3.713334
155 shared_utilization Shared Memory Utilization Low (2) Low (2) Low (2)
155 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
155 tex_utilization Unified Cache Utilization Low (2) Low (3) Low (2)
155 ldst_fu_utilization Load/Store Function Unit Utilization Low (3) Mid (4) Low (3)
155 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
155 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
155 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (2) Low (3) Low (2)
155 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
155 flop_sp_efficiency FLOP Efficiency(Peak Single) 5.60% 14.13% 11.42%
155 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
155 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
155 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
155 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
155 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
155 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_kernel_apply_filter__30
51 inst_per_warp Instructions per warp 2.1140e+03 2.1140e+03 2.1140e+03
51 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
51 warp_execution_efficiency Warp Execution Efficiency 96.43% 96.43% 96.43%
51 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 95.18% 95.18% 95.18%
51 inst_replay_overhead Instruction Replay Overhead 0.000416 0.000625 0.000465
51 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.567679 1.568684 1.568007
51 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.019736 1.027548 1.022827
51 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
51 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
51 gld_transactions_per_request Global Load Transactions Per Request 3.561599 3.582441 3.573183
51 gst_transactions_per_request Global Store Transactions Per Request 3.857143 3.857143 3.857143
51 shared_store_transactions Shared Store Transactions 599605 604198 601422
51 shared_load_transactions Shared Load Transactions 4715970 4718995 4716957
51 local_load_transactions Local Load Transactions 0 0 0
51 local_store_transactions Local Store Transactions 0 0 0
51 gld_transactions Global Load Transactions 710539 714697 712849
51 gst_transactions Global Store Transactions 749250 749250 749250
51 sysmem_read_transactions System Memory Read Transactions 0 0 0
51 sysmem_write_transactions System Memory Write Transactions 5 5 5
51 l2_read_transactions L2 Read Transactions 749938 755052 752039
51 l2_write_transactions L2 Write Transactions 749288 773071 759131
51 dram_read_transactions Device Memory Read Transactions 749294 750171 749671
51 dram_write_transactions Device Memory Write Transactions 746599 770097 757856
51 global_hit_rate Global Hit Rate in unified l1/tex 52.81% 52.95% 52.88%
51 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
51 gld_requested_throughput Requested Global Load Throughput 199.34GB/s 218.93GB/s 209.75GB/s
51 gst_requested_throughput Requested Global Store Throughput 194.09GB/s 213.17GB/s 204.23GB/s
51 gld_throughput Global Load Throughput 185.04GB/s 202.92GB/s 194.31GB/s
51 gst_throughput Global Store Throughput 194.09GB/s 213.17GB/s 204.23GB/s
51 local_memory_overhead Local Memory Overhead 53.91% 54.18% 54.04%
51 tex_cache_hit_rate Unified Cache Hit Rate 1.60% 1.60% 1.60%
51 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 13.01% 13.01% 13.01%
51 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 99.82% 99.95% 99.90%
51 dram_read_throughput Device Memory Read Throughput 194.32GB/s 213.43GB/s 204.35GB/s
51 dram_write_throughput Device Memory Write Throughput 195.40GB/s 217.48GB/s 206.58GB/s
51 tex_cache_throughput Unified cache to SM throughput 3338.9GB/s 3667.1GB/s 3513.4GB/s
51 l2_tex_read_throughput L2 Throughput (Texture Reads) 194.20GB/s 213.29GB/s 204.34GB/s
51 l2_tex_write_throughput L2 Throughput (Texture Writes) 194.09GB/s 213.17GB/s 204.23GB/s
51 l2_read_throughput L2 Throughput (Reads) 194.27GB/s 214.82GB/s 204.99GB/s
51 l2_write_throughput L2 Throughput (Writes) 196.16GB/s 218.07GB/s 206.93GB/s
51 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
51 sysmem_write_throughput System Memory Write Throughput 1.3263MB/s 1.4567MB/s 1.3956MB/s
51 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
51 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
51 shared_load_throughput Shared Memory Load Throughput 4886.7GB/s 5368.3GB/s 5143.1GB/s
51 shared_store_throughput Shared Memory Store Throughput 625.30GB/s 684.23GB/s 655.75GB/s
51 gld_efficiency Global Memory Load Efficiency 107.67% 108.30% 107.95%
51 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
51 tex_cache_transactions Unified cache to SM transactions 3222250 3222289 3222262
51 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
51 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
51 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
51 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
51 flop_count_sp Floating Point Operations(Single Precision) 215784000 215784000 215784000
51 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
51 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 107892000 107892000 107892000
51 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
51 flop_count_sp_special Floating Point Operations(Single Precision Special) 324000 324000 324000
51 inst_executed Instructions Executed 8856750 11098500 9911691
51 inst_issued Instructions Issued 8860430 8862337 8860759
51 dram_utilization Device Memory Utilization Mid (5) Mid (6) Mid (5)
51 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
51 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.92% 13.90% 7.18%
51 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 27.36% 35.20% 31.23%
51 stall_memory_dependency Issue Stall Reasons (Data Request) 18.00% 30.02% 24.04%
51 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
51 stall_sync Issue Stall Reasons (Synchronization) 6.75% 8.53% 7.61%
51 stall_other Issue Stall Reasons (Other) 0.41% 0.52% 0.46%
51 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.57% 0.85% 0.68%
51 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 15.83% 21.10% 18.28%
51 shared_efficiency Shared Memory Efficiency 42.78% 42.82% 42.80%
51 inst_fp_32 FP Instructions(Single) 108216000 108216000 108216000
51 inst_fp_64 FP Instructions(Double) 0 0 0
51 inst_integer Integer Instructions 30618000 30618000 30618000
51 inst_bit_convert Bit-Convert Instructions 648000 648000 648000
51 inst_control Control-Flow Instructions 972000 972000 972000
51 inst_compute_ld_st Load/Store Instructions 123120000 123120000 123120000
51 inst_misc Misc Instructions 7776000 7776000 7776000
51 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
51 issue_slots Issue Slots 8860430 8862337 8860759
51 cf_issued Issued Control-Flow Instructions 52500 52500 52500
51 cf_executed Executed Control-Flow Instructions 52500 52500 52500
51 ldst_issued Issued Load/Store Instructions 4032000 4032000 4032000
51 ldst_executed Executed Load/Store Instructions 4032000 4032000 4032000
51 atomic_transactions Atomic Transactions 0 0 0
51 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
51 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
51 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
51 l2_tex_read_transactions L2 Transactions (Texture Reads) 749650 749672 749656
51 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 5.43% 7.84% 6.56%
51 stall_not_selected Issue Stall Reasons (Not Selected) 3.48% 4.49% 3.95%
51 l2_tex_write_transactions L2 Transactions (Texture Writes) 749250 749250 749250
51 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
51 nvlink_total_data_received NVLink Total Data Received 864 864 864
51 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
51 nvlink_user_data_received NVLink User Data Received 0 0 0
51 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
51 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
51 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
51 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
51 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
51 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
51 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
51 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
51 nvlink_transmit_throughput NVLink Transmit Throughput 9.5495MB/s 10.488MB/s 10.049MB/s
51 nvlink_receive_throughput NVLink Receive Throughput 7.1621MB/s 7.8663MB/s 7.5364MB/s
51 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
51 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
51 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
51 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
51 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
51 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
51 inst_fp_16 HP Instructions(Half) 0 0 0
51 ipc Executed IPC 0.652266 0.807078 0.758255
51 issued_ipc Issued IPC 0.652537 0.820166 0.728063
51 issue_slot_utilization Issue Slot Utilization 16.31% 20.50% 18.20%
51 sm_efficiency Multiprocessor Activity 83.94% 93.95% 92.44%
51 achieved_occupancy Achieved Occupancy 0.109066 0.109089 0.109078
51 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.856441 1.101611 0.983864
51 shared_utilization Shared Memory Utilization Mid (4) Mid (4) Mid (4)
51 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
51 tex_utilization Unified Cache Utilization Low (3) Low (3) Low (3)
51 ldst_fu_utilization Load/Store Function Unit Utilization Low (3) Mid (4) Low (3)
51 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
51 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
51 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
51 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
51 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (2) Low (2) Low (2)
51 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
51 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
51 flop_sp_efficiency FLOP Efficiency(Peak Single) 10.91% 14.25% 12.63%
51 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
51 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
51 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
51 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
51 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
51 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_volume_tendency__22
155 inst_per_warp Instructions per warp 5.7836e+04 7.4062e+04 5.7956e+04
155 branch_efficiency Branch Efficiency 98.81% 99.21% 98.82%
155 warp_execution_efficiency Warp Execution Efficiency 54.96% 55.26% 54.97%
155 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 53.13% 53.21% 53.14%
155 inst_replay_overhead Instruction Replay Overhead 0.000932 0.001200 0.001000
155 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.134168 1.143866 1.138681
155 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 1.510191 1.511871 1.510877
155 local_load_transactions_per_request Local Memory Load Transactions Per Request 2.500000 2.500000 2.500000
155 local_store_transactions_per_request Local Memory Store Transactions Per Request 2.500000 2.500000 2.500000
155 gld_transactions_per_request Global Load Transactions Per Request 2.582710 2.609019 2.595801
155 gst_transactions_per_request Global Store Transactions Per Request 2.750000 2.750000 2.750000
155 shared_store_transactions Shared Store Transactions 1497354 1499020 1498034
155 shared_load_transactions Shared Load Transactions 5032302 5075335 5052329
155 local_load_transactions Local Load Transactions 7376250 7376250 7376250
155 local_store_transactions Local Store Transactions 6918750 6918750 6918750
155 gld_transactions Global Load Transactions 4838707 4887997 4863232
155 gst_transactions Global Store Transactions 915750 915750 915750
155 sysmem_read_transactions System Memory Read Transactions 0 0 0
155 sysmem_write_transactions System Memory Write Transactions 5 5 5
155 l2_read_transactions L2 Read Transactions 11788429 11841602 11813449
155 l2_write_transactions L2 Write Transactions 8918267 8977715 8946567
155 dram_read_transactions Device Memory Read Transactions 14696023 14801572 14745806
155 dram_write_transactions Device Memory Write Transactions 7411133 7475465 7440813
155 global_hit_rate Global Hit Rate in unified l1/tex 11.89% 12.93% 12.47%
155 local_hit_rate Local Hit Rate 8.48% 8.74% 8.59%
155 gld_requested_throughput Requested Global Load Throughput 87.610GB/s 90.272GB/s 89.013GB/s
155 gst_requested_throughput Requested Global Store Throughput 15.572GB/s 16.045GB/s 15.821GB/s
155 gld_throughput Global Load Throughput 100.89GB/s 104.27GB/s 102.69GB/s
155 gst_throughput Global Store Throughput 19.033GB/s 19.611GB/s 19.337GB/s
155 local_memory_overhead Local Memory Overhead 74.00% 74.51% 74.27%
155 tex_cache_hit_rate Unified Cache Hit Rate 3.43% 3.65% 3.55%
155 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 6.26% 6.65% 6.42%
155 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 29.26% 30.78% 30.05%
155 dram_read_throughput Device Memory Read Throughput 306.98GB/s 316.10GB/s 311.38GB/s
155 dram_write_throughput Device Memory Write Throughput 154.83GB/s 159.20GB/s 157.12GB/s
155 tex_cache_throughput Unified cache to SM throughput 791.84GB/s 816.30GB/s 804.78GB/s
155 l2_tex_read_throughput L2 Throughput (Texture Reads) 245.40GB/s 253.09GB/s 249.29GB/s
155 l2_tex_write_throughput L2 Throughput (Texture Writes) 162.83GB/s 167.78GB/s 165.44GB/s
155 l2_read_throughput L2 Throughput (Reads) 245.35GB/s 253.48GB/s 249.46GB/s
155 l2_write_throughput L2 Throughput (Writes) 185.78GB/s 191.32GB/s 188.92GB/s
155 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 sysmem_write_throughput System Memory Write Throughput 108.96KB/s 112.28KB/s 110.71KB/s
155 local_load_throughput Local Memory Load Throughput 153.30GB/s 157.96GB/s 155.76GB/s
155 local_store_throughput Local Memory Store Throughput 143.80GB/s 148.16GB/s 146.10GB/s
155 shared_load_throughput Shared Memory Load Throughput 420.42GB/s 431.87GB/s 426.75GB/s
155 shared_store_throughput Shared Memory Store Throughput 124.53GB/s 128.33GB/s 126.53GB/s
155 gld_efficiency Global Memory Load Efficiency 86.24% 87.12% 86.68%
155 gst_efficiency Global Memory Store Efficiency 81.82% 81.82% 81.82%
155 tex_cache_transactions Unified cache to SM transactions 9524828 9532180 9527950
155 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
155 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
155 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
155 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
155 flop_count_sp Floating Point Operations(Single Precision) 504360511 512469380 512412526
155 flop_count_sp_add Floating Point Operations(Single Precision Add) 12143911 12793178 12788340
155 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 194618926 198185460 198161153
155 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 102978748 103305282 103301878
155 flop_count_sp_special Floating Point Operations(Single Precision Special) 2934217 3583484 3578646
155 inst_executed Instructions Executed 39677249 111092290 58067329
155 inst_issued Instructions Issued 39715215 41013783 39726511
155 dram_utilization Device Memory Utilization Mid (6) Mid (6) Mid (6)
155 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
155 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 2.58% 5.73% 4.08%
155 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 5.19% 6.37% 5.65%
155 stall_memory_dependency Issue Stall Reasons (Data Request) 36.00% 40.97% 38.55%
155 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
155 stall_sync Issue Stall Reasons (Synchronization) 1.57% 2.22% 1.89%
155 stall_other Issue Stall Reasons (Other) 2.93% 4.30% 3.70%
155 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.09% 0.16% 0.11%
155 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 1.91% 2.46% 2.10%
155 shared_efficiency Shared Memory Efficiency 30.97% 31.18% 31.08%
155 inst_fp_32 FP Instructions(Single) 327768619 329720208 329703700
155 inst_fp_64 FP Instructions(Double) 0 0 0
155 inst_integer Integer Instructions 97647980 106235247 97704030
155 inst_bit_convert Bit-Convert Instructions 2278992 2278992 2278992
155 inst_control Control-Flow Instructions 21379944 27218279 21420852
155 inst_compute_ld_st Load/Store Instructions 217161000 217161000 217161000
155 inst_misc Misc Instructions 28322678 32862479 28353912
155 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
155 issue_slots Issue Slots 39715215 41013783 39726511
155 cf_issued Issued Control-Flow Instructions 1427210 1994516 1431489
155 cf_executed Executed Control-Flow Instructions 1427210 1994516 1431489
155 ldst_issued Issued Load/Store Instructions 12353541 12398587 12353910
155 ldst_executed Executed Load/Store Instructions 12353541 12398587 12353910
155 atomic_transactions Atomic Transactions 0 0 0
155 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
155 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
155 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
155 l2_tex_read_transactions L2 Transactions (Texture Reads) 11785232 11828904 11805734
155 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 39.91% 46.25% 43.35%
155 stall_not_selected Issue Stall Reasons (Not Selected) 0.51% 0.68% 0.58%
155 l2_tex_write_transactions L2 Transactions (Texture Writes) 7834500 7834500 7834500
155 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
155 nvlink_total_data_received NVLink Total Data Received 864 864 864
155 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
155 nvlink_user_data_received NVLink User Data Received 0 0 0
155 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
155 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
155 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
155 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
155 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
155 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
155 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
155 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
155 nvlink_transmit_throughput NVLink Transmit Throughput 784.55KB/s 808.39KB/s 797.11KB/s
155 nvlink_receive_throughput NVLink Receive Throughput 588.41KB/s 606.29KB/s 597.83KB/s
155 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
155 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
155 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
155 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
155 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
155 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
155 inst_fp_16 HP Instructions(Half) 0 0 0
155 ipc Executed IPC 0.244187 0.570360 0.424228
155 issued_ipc Issued IPC 0.244445 0.298409 0.264295
155 issue_slot_utilization Issue Slot Utilization 6.11% 7.46% 6.61%
155 sm_efficiency Multiprocessor Activity 84.41% 93.51% 90.65%
155 achieved_occupancy Achieved Occupancy 0.118197 0.122005 0.119778
155 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.285144 0.345793 0.308377
155 shared_utilization Shared Memory Utilization Low (1) Low (1) Low (1)
155 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
155 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
155 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
155 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
155 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
155 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
155 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
155 flop_sp_efficiency FLOP Efficiency(Peak Single) 1.79% 2.73% 2.44%
155 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
155 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
155 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
155 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
155 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
155 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_interface_tendency__23
155 inst_per_warp Instructions per warp 5.6220e+04 8.7998e+04 5.6467e+04
155 branch_efficiency Branch Efficiency 98.21% 99.09% 98.23%
155 warp_execution_efficiency Warp Execution Efficiency 54.26% 55.01% 54.28%
155 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 52.58% 52.92% 52.61%
155 inst_replay_overhead Instruction Replay Overhead 0.001664 0.002169 0.001733
155 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 local_load_transactions_per_request Local Memory Load Transactions Per Request 2.500000 2.500000 2.500000
155 local_store_transactions_per_request Local Memory Store Transactions Per Request 2.500000 2.500000 2.500000
155 gld_transactions_per_request Global Load Transactions Per Request 7.221265 7.227326 7.224633
155 gst_transactions_per_request Global Store Transactions Per Request 7.083333 7.083333 7.083333
155 shared_store_transactions Shared Store Transactions 0 0 0
155 shared_load_transactions Shared Load Transactions 0 0 0
155 local_load_transactions Local Load Transactions 5875500 5875500 5875500
155 local_store_transactions Local Store Transactions 5248500 5248500 5248500
155 gld_transactions Global Load Transactions 26687629 26710027 26700077
155 gst_transactions Global Store Transactions 2358750 2358750 2358750
155 sysmem_read_transactions System Memory Read Transactions 0 0 0
155 sysmem_write_transactions System Memory Write Transactions 5 5 5
155 l2_read_transactions L2 Read Transactions 30655390 30762288 30714960
155 l2_write_transactions L2 Write Transactions 8658848 8695666 8675063
155 dram_read_transactions Device Memory Read Transactions 31885169 32452956 32190176
155 dram_write_transactions Device Memory Write Transactions 7548880 7576679 7563710
155 global_hit_rate Global Hit Rate in unified l1/tex 12.71% 12.93% 12.80%
155 local_hit_rate Local Hit Rate 7.53% 7.92% 7.69%
155 gld_requested_throughput Requested Global Load Throughput 92.171GB/s 95.723GB/s 94.333GB/s
155 gst_requested_throughput Requested Global Store Throughput 8.2856GB/s 8.6049GB/s 8.4799GB/s
155 gld_throughput Global Load Throughput 295.29GB/s 306.70GB/s 302.19GB/s
155 gst_throughput Global Store Throughput 26.084GB/s 27.089GB/s 26.696GB/s
155 local_memory_overhead Local Memory Overhead 33.76% 34.00% 33.87%
155 tex_cache_hit_rate Unified Cache Hit Rate 5.42% 5.64% 5.51%
155 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 15.83% 17.56% 16.73%
155 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 30.82% 30.94% 30.88%
155 dram_read_throughput Device Memory Read Throughput 354.28GB/s 371.81GB/s 364.32GB/s
155 dram_write_throughput Device Memory Write Throughput 83.581GB/s 87.016GB/s 85.605GB/s
155 tex_cache_throughput Unified cache to SM throughput 362.99GB/s 377.21GB/s 371.66GB/s
155 l2_tex_read_throughput L2 Throughput (Texture Reads) 339.45GB/s 352.62GB/s 347.53GB/s
155 l2_tex_write_throughput L2 Throughput (Texture Writes) 84.125GB/s 87.367GB/s 86.098GB/s
155 l2_read_throughput L2 Throughput (Reads) 339.44GB/s 352.80GB/s 347.63GB/s
155 l2_write_throughput L2 Throughput (Writes) 95.832GB/s 99.730GB/s 98.183GB/s
155 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 sysmem_write_throughput System Memory Write Throughput 57.978KB/s 60.212KB/s 59.337KB/s
155 local_load_throughput Local Memory Load Throughput 64.974GB/s 67.478GB/s 66.498GB/s
155 local_store_throughput Local Memory Store Throughput 58.041GB/s 60.277GB/s 59.402GB/s
155 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 gld_efficiency Global Memory Load Efficiency 31.20% 31.23% 31.22%
155 gst_efficiency Global Memory Store Efficiency 31.76% 31.76% 31.76%
155 tex_cache_transactions Unified cache to SM transactions 8199960 8220353 8209511
155 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
155 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
155 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
155 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
155 flop_count_sp Floating Point Operations(Single Precision) 334506426 350619944 350502251
155 flop_count_sp_add Floating Point Operations(Single Precision Add) 18899718 20177192 20166988
155 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 95104236 102216784 102166972
155 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 125398236 126009184 126001318
155 flop_count_sp_special Floating Point Operations(Single Precision Special) 2332518 3609992 3599788
155 inst_executed Instructions Executed 35415834 84420353 57552148
155 inst_issued Instructions Issued 35475522 37982705 35496032
155 dram_utilization Device Memory Utilization Mid (6) Mid (6) Mid (6)
155 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
155 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.91% 2.32% 1.49%
155 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 2.78% 3.64% 2.97%
155 stall_memory_dependency Issue Stall Reasons (Data Request) 91.03% 93.20% 92.13%
155 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
155 stall_sync Issue Stall Reasons (Synchronization) 0.50% 0.63% 0.55%
155 stall_other Issue Stall Reasons (Other) 0.61% 0.72% 0.67%
155 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.06% 0.13% 0.08%
155 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.11% 0.14% 0.12%
155 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
155 inst_fp_32 FP Instructions(Single) 254370708 258277152 258240176
155 inst_fp_64 FP Instructions(Double) 0 0 0
155 inst_integer Integer Instructions 166851808 183594882 166961789
155 inst_bit_convert Bit-Convert Instructions 0 0 0
155 inst_control Control-Flow Instructions 20581640 32001810 20665128
155 inst_compute_ld_st Load/Store Instructions 150697800 150697800 150697800
155 inst_misc Misc Instructions 28660040 37164210 28724715
155 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
155 issue_slots Issue Slots 35475522 37982705 35496032
155 cf_issued Issued Control-Flow Instructions 1437616 2546926 1446439
155 cf_executed Executed Control-Flow Instructions 1437616 2546926 1446439
155 ldst_issued Issued Load/Store Instructions 8639377 8727067 8640125
155 ldst_executed Executed Load/Store Instructions 8639377 8727067 8640125
155 atomic_transactions Atomic Transactions 0 0 0
155 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
155 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
155 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
155 l2_tex_read_transactions L2 Transactions (Texture Reads) 30654539 30742851 30705896
155 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 1.38% 2.09% 1.80%
155 stall_not_selected Issue Stall Reasons (Not Selected) 0.16% 0.20% 0.17%
155 l2_tex_write_transactions L2 Transactions (Texture Writes) 7607250 7607250 7607250
155 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
155 nvlink_total_data_received NVLink Total Data Received 864 864 864
155 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
155 nvlink_user_data_received NVLink User Data Received 0 0 0
155 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
155 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
155 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
155 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
155 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
155 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
155 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
155 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
155 nvlink_transmit_throughput NVLink Transmit Throughput 417.44KB/s 433.53KB/s 427.23KB/s
155 nvlink_receive_throughput NVLink Receive Throughput 313.08KB/s 325.15KB/s 320.43KB/s
155 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
155 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
155 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
155 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
155 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
155 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
155 inst_fp_16 HP Instructions(Half) 0 0 0
155 ipc Executed IPC 0.124833 0.340911 0.205249
155 issued_ipc Issued IPC 0.124244 0.152726 0.133009
155 issue_slot_utilization Issue Slot Utilization 3.11% 3.82% 3.33%
155 sm_efficiency Multiprocessor Activity 84.53% 89.31% 87.18%
155 achieved_occupancy Achieved Occupancy 0.118253 0.121083 0.119840
155 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.136061 0.168959 0.145864
155 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
155 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
155 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
155 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
155 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
155 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
155 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
155 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
155 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.70% 0.97% 0.88%
155 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
155 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
155 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
155 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
155 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
155 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_band_back_kernel__26
103 inst_per_warp Instructions per warp 7.4554e+04 7.4554e+04 7.4554e+04
103 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
103 warp_execution_efficiency Warp Execution Efficiency 56.25% 56.25% 56.25%
103 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 54.82% 54.82% 54.82%
103 inst_replay_overhead Instruction Replay Overhead 0.000404 0.000487 0.000439
103 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
103 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
103 local_load_transactions_per_request Local Memory Load Transactions Per Request 2.500000 2.500000 2.500000
103 local_store_transactions_per_request Local Memory Store Transactions Per Request 2.500000 2.500000 2.500000
103 gld_transactions_per_request Global Load Transactions Per Request 2.621333 2.647758 2.632854
103 gst_transactions_per_request Global Store Transactions Per Request 2.750000 2.750000 2.750000
103 shared_store_transactions Shared Store Transactions 0 0 0
103 shared_load_transactions Shared Load Transactions 0 0 0
103 local_load_transactions Local Load Transactions 1905000 1905000 1905000
103 local_store_transactions Local Store Transactions 1891500 1891500 1891500
103 gld_transactions Global Load Transactions 3656759 3693623 3672831
103 gst_transactions Global Store Transactions 123750 123750 123750
103 sysmem_read_transactions System Memory Read Transactions 0 0 0
103 sysmem_write_transactions System Memory Write Transactions 5 5 5
103 l2_read_transactions L2 Read Transactions 4993036 5069547 5037472
103 l2_write_transactions L2 Write Transactions 2397028 2421927 2407366
103 dram_read_transactions Device Memory Read Transactions 5675340 5879775 5770164
103 dram_write_transactions Device Memory Write Transactions 2028850 2052293 2041506
103 global_hit_rate Global Hit Rate in unified l1/tex 14.08% 15.81% 14.84%
103 local_hit_rate Local Hit Rate 6.28% 7.07% 6.60%
103 gld_requested_throughput Requested Global Load Throughput 87.116GB/s 93.356GB/s 90.869GB/s
103 gst_requested_throughput Requested Global Store Throughput 2.8102GB/s 3.0115GB/s 2.9313GB/s
103 gld_throughput Global Load Throughput 101.61GB/s 109.55GB/s 106.33GB/s
103 gst_throughput Global Store Throughput 3.4347GB/s 3.6807GB/s 3.5826GB/s
103 local_memory_overhead Local Memory Overhead 53.52% 54.42% 53.94%
103 tex_cache_hit_rate Unified Cache Hit Rate 8.95% 10.06% 9.45%
103 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 14.26% 15.36% 14.80%
103 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 4.16% 4.42% 4.26%
103 dram_read_throughput Device Memory Read Throughput 158.35GB/s 174.36GB/s 167.05GB/s
103 dram_write_throughput Device Memory Write Throughput 56.460GB/s 60.961GB/s 59.103GB/s
103 tex_cache_throughput Unified cache to SM throughput 256.81GB/s 275.67GB/s 268.15GB/s
103 l2_tex_read_throughput L2 Throughput (Texture Reads) 138.33GB/s 149.82GB/s 145.00GB/s
103 l2_tex_write_throughput L2 Throughput (Texture Writes) 55.933GB/s 59.940GB/s 58.343GB/s
103 l2_read_throughput L2 Throughput (Reads) 139.12GB/s 150.39GB/s 145.84GB/s
103 l2_write_throughput L2 Throughput (Writes) 66.634GB/s 71.915GB/s 69.695GB/s
103 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 sysmem_write_throughput System Memory Write Throughput 145.52KB/s 155.94KB/s 151.78KB/s
103 local_load_throughput Local Memory Load Throughput 52.873GB/s 56.661GB/s 55.151GB/s
103 local_store_throughput Local Memory Store Throughput 52.499GB/s 56.259GB/s 54.760GB/s
103 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
103 gld_efficiency Global Memory Load Efficiency 84.98% 85.83% 85.46%
103 gst_efficiency Global Memory Store Efficiency 81.82% 81.82% 81.82%
103 tex_cache_transactions Unified cache to SM transactions 2310600 2320466 2315582
103 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
103 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
103 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
103 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
103 flop_count_sp Floating Point Operations(Single Precision) 59130000 59130000 59130000
103 flop_count_sp_add Floating Point Operations(Single Precision Add) 810000 810000 810000
103 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 29160000 29160000 29160000
103 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
103 flop_count_sp_special Floating Point Operations(Single Precision Special) 815400 815400 815400
103 inst_executed Instructions Executed 12221700 22366200 17540175
103 inst_issued Instructions Issued 12226667 12227719 12227076
103 dram_utilization Device Memory Utilization Low (3) Low (3) Low (3)
103 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
103 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 1.11% 1.90% 1.42%
103 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 5.22% 5.80% 5.44%
103 stall_memory_dependency Issue Stall Reasons (Data Request) 92.12% 93.38% 92.88%
103 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
103 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
103 stall_other Issue Stall Reasons (Other) 0.09% 0.10% 0.09%
103 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.10% 0.13% 0.11%
103 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.01% 0.02% 0.01%
103 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
103 inst_fp_32 FP Instructions(Single) 31595400 31595400 31595400
103 inst_fp_64 FP Instructions(Double) 0 0 0
103 inst_integer Integer Instructions 119988000 119988000 119988000
103 inst_bit_convert Bit-Convert Instructions 10800 10800 10800
103 inst_control Control-Flow Instructions 3439800 3439800 3439800
103 inst_compute_ld_st Load/Store Instructions 53254800 53254800 53254800
103 inst_misc Misc Instructions 9347400 9347400 9347400
103 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
103 issue_slots Issue Slots 12226667 12227719 12227076
103 cf_issued Issued Control-Flow Instructions 279000 279000 279000
103 cf_executed Executed Control-Flow Instructions 279000 279000 279000
103 ldst_issued Issued Load/Store Instructions 3005700 3005700 3005700
103 ldst_executed Executed Load/Store Instructions 3005700 3005700 3005700
103 atomic_transactions Atomic Transactions 0 0 0
103 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
103 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
103 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
103 l2_tex_read_transactions L2 Transactions (Texture Reads) 4960919 5047296 5008558
103 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.00% 0.01% 0.01%
103 stall_not_selected Issue Stall Reasons (Not Selected) 0.02% 0.04% 0.03%
103 l2_tex_write_transactions L2 Transactions (Texture Writes) 2015250 2015250 2015250
103 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
103 nvlink_total_data_received NVLink Total Data Received 864 864 864
103 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
103 nvlink_user_data_received NVLink User Data Received 0 0 0
103 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
103 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
103 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
103 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
103 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
103 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
103 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
103 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
103 nvlink_transmit_throughput NVLink Transmit Throughput 1.0232MB/s 1.0964MB/s 1.0672MB/s
103 nvlink_receive_throughput NVLink Receive Throughput 785.79KB/s 842.07KB/s 819.64KB/s
103 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
103 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
103 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
103 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
103 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
103 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
103 inst_fp_16 HP Instructions(Half) 0 0 0
103 ipc Executed IPC 0.104562 0.175170 0.136577
103 issued_ipc Issued IPC 0.104607 0.114902 0.108720
103 issue_slot_utilization Issue Slot Utilization 2.62% 2.87% 2.72%
103 sm_efficiency Multiprocessor Activity 93.17% 96.42% 94.72%
103 achieved_occupancy Achieved Occupancy 0.058455 0.058861 0.058625
103 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.105543 0.116186 0.109803
103 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
103 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
103 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
103 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
103 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
103 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
103 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
103 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
103 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
103 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
103 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
103 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.32% 0.41% 0.38%
103 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
103 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
103 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
103 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
103 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
103 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_interface_gradients_of_laplacians__21
155 inst_per_warp Instructions per warp 3.6900e+03 3.6900e+03 3.6900e+03
155 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
155 warp_execution_efficiency Warp Execution Efficiency 56.25% 56.25% 56.25%
155 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 53.93% 53.93% 53.93%
155 inst_replay_overhead Instruction Replay Overhead 0.014868 0.020550 0.016656
155 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
155 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
155 gld_transactions_per_request Global Load Transactions Per Request 9.191900 9.197652 9.194650
155 gst_transactions_per_request Global Store Transactions Per Request 9.250000 9.250000 9.250000
155 shared_store_transactions Shared Store Transactions 0 0 0
155 shared_load_transactions Shared Load Transactions 0 0 0
155 local_load_transactions Local Load Transactions 0 0 0
155 local_store_transactions Local Store Transactions 0 0 0
155 gld_transactions Global Load Transactions 1558027 1559002 1558493
155 gst_transactions Global Store Transactions 666000 666000 666000
155 sysmem_read_transactions System Memory Read Transactions 0 0 0
155 sysmem_write_transactions System Memory Write Transactions 5 5 5
155 l2_read_transactions L2 Read Transactions 1121736 1129357 1125262
155 l2_write_transactions L2 Write Transactions 757344 784877 769611
155 dram_read_transactions Device Memory Read Transactions 1197193 1211628 1205184
155 dram_write_transactions Device Memory Write Transactions 696473 721863 709764
155 global_hit_rate Global Hit Rate in unified l1/tex 48.79% 49.16% 48.96%
155 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
155 gld_requested_throughput Requested Global Load Throughput 120.42GB/s 123.54GB/s 122.12GB/s
155 gst_requested_throughput Requested Global Store Throughput 48.122GB/s 49.371GB/s 48.803GB/s
155 gld_throughput Global Load Throughput 462.91GB/s 475.12GB/s 469.50GB/s
155 gst_throughput Global Store Throughput 197.84GB/s 202.97GB/s 200.64GB/s
155 local_memory_overhead Local Memory Overhead 36.36% 36.72% 36.57%
155 tex_cache_hit_rate Unified Cache Hit Rate 19.69% 20.01% 19.84%
155 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 16.97% 18.43% 17.66%
155 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 85.44% 86.43% 85.94%
155 dram_read_throughput Device Memory Read Throughput 357.12GB/s 367.86GB/s 363.07GB/s
155 dram_write_throughput Device Memory Write Throughput 208.12GB/s 219.00GB/s 213.82GB/s
155 tex_cache_throughput Unified cache to SM throughput 373.23GB/s 383.92GB/s 379.00GB/s
155 l2_tex_read_throughput L2 Throughput (Texture Reads) 333.64GB/s 342.61GB/s 338.70GB/s
155 l2_tex_write_throughput L2 Throughput (Texture Writes) 197.84GB/s 202.97GB/s 200.64GB/s
155 l2_read_throughput L2 Throughput (Reads) 333.72GB/s 343.62GB/s 338.99GB/s
155 l2_write_throughput L2 Throughput (Writes) 226.39GB/s 237.74GB/s 231.85GB/s
155 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 sysmem_write_throughput System Memory Write Throughput 1.5209MB/s 1.5604MB/s 1.5424MB/s
155 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
155 gld_efficiency Global Memory Load Efficiency 26.00% 26.02% 26.01%
155 gst_efficiency Global Memory Store Efficiency 24.32% 24.32% 24.32%
155 tex_cache_transactions Unified cache to SM transactions 313827 315061 314515
155 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
155 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
155 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
155 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
155 flop_count_sp Floating Point Operations(Single Precision) 7884000 7884000 7884000
155 flop_count_sp_add Floating Point Operations(Single Precision Add) 540000 540000 540000
155 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 2052000 2052000 2052000
155 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 3240000 3240000 3240000
155 flop_count_sp_special Floating Point Operations(Single Precision Special) 108000 108000 108000
155 inst_executed Instructions Executed 1806000 5535000 3706587
155 inst_issued Instructions Issued 1832832 1840144 1836104
155 dram_utilization Device Memory Utilization High (7) High (8) High (7)
155 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
155 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.50% 2.11% 1.26%
155 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 1.49% 1.79% 1.60%
155 stall_memory_dependency Issue Stall Reasons (Data Request) 89.63% 92.75% 91.27%
155 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
155 stall_sync Issue Stall Reasons (Synchronization) 2.43% 2.98% 2.61%
155 stall_other Issue Stall Reasons (Other) 0.16% 0.21% 0.18%
155 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.72% 1.65% 1.09%
155 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.15% 0.26% 0.20%
155 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
155 inst_fp_32 FP Instructions(Single) 6048000 6048000 6048000
155 inst_fp_64 FP Instructions(Double) 0 0 0
155 inst_integer Integer Instructions 17523000 17523000 17523000
155 inst_bit_convert Bit-Convert Instructions 0 0 0
155 inst_control Control-Flow Instructions 1350000 1350000 1350000
155 inst_compute_ld_st Load/Store Instructions 4347000 4347000 4347000
155 inst_misc Misc Instructions 1917000 1917000 1917000
155 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
155 issue_slots Issue Slots 1832832 1840144 1836104
155 cf_issued Issued Control-Flow Instructions 94500 94500 94500
155 cf_executed Executed Control-Flow Instructions 94500 94500 94500
155 ldst_issued Issued Load/Store Instructions 283500 283500 283500
155 ldst_executed Executed Load/Store Instructions 283500 283500 283500
155 atomic_transactions Atomic Transactions 0 0 0
155 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
155 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
155 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
155 l2_tex_read_transactions L2 Transactions (Texture Reads) 1120563 1127677 1124298
155 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.94% 2.84% 1.35%
155 stall_not_selected Issue Stall Reasons (Not Selected) 0.37% 0.51% 0.43%
155 l2_tex_write_transactions L2 Transactions (Texture Writes) 666000 666000 666000
155 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
155 nvlink_total_data_received NVLink Total Data Received 864 864 864
155 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
155 nvlink_user_data_received NVLink User Data Received 0 0 0
155 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
155 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
155 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
155 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
155 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
155 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
155 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
155 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
155 nvlink_transmit_throughput NVLink Transmit Throughput 10.951MB/s 11.235MB/s 11.105MB/s
155 nvlink_receive_throughput NVLink Receive Throughput 8.2129MB/s 8.4260MB/s 8.3291MB/s
155 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
155 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
155 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
155 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
155 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
155 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
155 inst_fp_16 HP Instructions(Half) 0 0 0
155 ipc Executed IPC 0.155944 0.440132 0.280529
155 issued_ipc Issued IPC 0.158479 0.189305 0.171594
155 issue_slot_utilization Issue Slot Utilization 3.96% 4.73% 4.29%
155 sm_efficiency Multiprocessor Activity 83.05% 94.29% 91.73%
155 achieved_occupancy Achieved Occupancy 0.282278 0.286532 0.283861
155 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.231217 0.276409 0.249759
155 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
155 l2_utilization L2 Cache Utilization Low (1) Low (2) Low (1)
155 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
155 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
155 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
155 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
155 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
155 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
155 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
155 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.38% 0.59% 0.52%
155 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
155 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
155 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
155 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
155 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
155 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_gpu_solution_update__29
51 inst_per_warp Instructions per warp 146.996861 146.996861 146.996861
51 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
51 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
51 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 96.00% 96.00% 96.00%
51 inst_replay_overhead Instruction Replay Overhead 0.001421 0.002164 0.001742
51 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
51 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
51 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
51 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
51 gld_transactions_per_request Global Load Transactions Per Request 3.999989 3.999989 3.999989
51 gst_transactions_per_request Global Store Transactions Per Request 3.999989 3.999989 3.999989
51 shared_store_transactions Shared Store Transactions 0 0 0
51 shared_load_transactions Shared Load Transactions 0 0 0
51 local_load_transactions Local Load Transactions 0 0 0
51 local_store_transactions Local Store Transactions 0 0 0
51 gld_transactions Global Load Transactions 2997000 2997000 2997000
51 gst_transactions Global Store Transactions 2247750 2247750 2247750
51 sysmem_read_transactions System Memory Read Transactions 0 0 0
51 sysmem_write_transactions System Memory Write Transactions 5 5 5
51 l2_read_transactions L2 Read Transactions 2997096 2997712 2997235
51 l2_write_transactions L2 Write Transactions 2247782 2271570 2258118
51 dram_read_transactions Device Memory Read Transactions 2997005 2997075 2997030
51 dram_write_transactions Device Memory Write Transactions 781631 804089 792464
51 global_hit_rate Global Hit Rate in unified l1/tex 42.78% 42.81% 42.80%
51 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
51 gld_requested_throughput Requested Global Load Throughput 569.92GB/s 576.75GB/s 574.71GB/s
51 gst_requested_throughput Requested Global Store Throughput 427.44GB/s 432.56GB/s 431.03GB/s
51 gld_throughput Global Load Throughput 569.92GB/s 576.75GB/s 574.71GB/s
51 gst_throughput Global Store Throughput 427.44GB/s 432.56GB/s 431.03GB/s
51 local_memory_overhead Local Memory Overhead 42.76% 42.82% 42.80%
51 tex_cache_hit_rate Unified Cache Hit Rate 0.00% 0.00% 0.00%
51 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
51 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 100.00% 100.00% 100.00%
51 dram_read_throughput Device Memory Read Throughput 569.92GB/s 576.76GB/s 574.72GB/s
51 dram_write_throughput Device Memory Write Throughput 148.75GB/s 154.60GB/s 151.96GB/s
51 tex_cache_throughput Unified cache to SM throughput 712.41GB/s 720.95GB/s 718.40GB/s
51 l2_tex_read_throughput L2 Throughput (Texture Reads) 569.92GB/s 576.75GB/s 574.71GB/s
51 l2_tex_write_throughput L2 Throughput (Texture Writes) 427.44GB/s 432.56GB/s 431.03GB/s
51 l2_read_throughput L2 Throughput (Reads) 569.94GB/s 576.82GB/s 574.76GB/s
51 l2_write_throughput L2 Throughput (Writes) 429.41GB/s 436.70GB/s 433.02GB/s
51 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
51 sysmem_write_throughput System Memory Write Throughput 997.01KB/s 0.9853MB/s 0.9818MB/s
51 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
51 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
51 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
51 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
51 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
51 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
51 tex_cache_transactions Unified cache to SM transactions 936572 936572 936572
51 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
51 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
51 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
51 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
51 flop_count_sp Floating Point Operations(Single Precision) 53946000 53946000 53946000
51 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
51 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 17982000 17982000 17982000
51 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 17982000 17982000 17982000
51 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
51 inst_executed Instructions Executed 7492590 27535452 18889511
51 inst_issued Instructions Issued 7502946 7508803 7505542
51 dram_utilization Device Memory Utilization High (9) High (9) High (9)
51 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
51 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.27% 0.43% 0.33%
51 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 2.28% 2.72% 2.48%
51 stall_memory_dependency Issue Stall Reasons (Data Request) 94.67% 95.69% 95.28%
51 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
51 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
51 stall_other Issue Stall Reasons (Other) 0.10% 0.13% 0.11%
51 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.09% 0.28% 0.15%
51 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.23% 0.32% 0.27%
51 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
51 inst_fp_32 FP Instructions(Single) 35964000 35964000 35964000
51 inst_fp_64 FP Instructions(Double) 0 0 0
51 inst_integer Integer Instructions 113887440 113887440 113887440
51 inst_bit_convert Bit-Convert Instructions 0 0 0
51 inst_control Control-Flow Instructions 5994240 5994240 5994240
51 inst_compute_ld_st Load/Store Instructions 41958000 41958000 41958000
51 inst_misc Misc Instructions 29970480 29970480 29970480
51 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
51 issue_slots Issue Slots 7502946 7508803 7505542
51 cf_issued Issued Control-Flow Instructions 561953 561953 561953
51 cf_executed Executed Control-Flow Instructions 561953 561953 561953
51 ldst_issued Issued Load/Store Instructions 1685831 1685831 1685831
51 ldst_executed Executed Load/Store Instructions 1685831 1685831 1685831
51 atomic_transactions Atomic Transactions 0 0 0
51 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
51 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
51 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
51 l2_tex_read_transactions L2 Transactions (Texture Reads) 2997000 2997000 2997000
51 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.86% 1.04% 0.94%
51 stall_not_selected Issue Stall Reasons (Not Selected) 0.40% 0.51% 0.45%
51 l2_tex_write_transactions L2 Transactions (Texture Writes) 2247750 2247750 2247750
51 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
51 nvlink_total_data_received NVLink Total Data Received 864 864 864
51 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
51 nvlink_user_data_received NVLink User Data Received 0 0 0
51 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
51 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
51 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
51 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
51 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
51 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
51 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
51 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
51 nvlink_transmit_throughput NVLink Transmit Throughput 7.0102MB/s 7.0942MB/s 7.0691MB/s
51 nvlink_receive_throughput NVLink Receive Throughput 5.2577MB/s 5.3207MB/s 5.3018MB/s
51 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
51 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
51 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
51 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
51 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
51 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
51 inst_fp_16 HP Instructions(Half) 0 0 0
51 ipc Executed IPC 0.346575 0.474004 0.408312
51 issued_ipc Issued IPC 0.402231 0.475105 0.434735
51 issue_slot_utilization Issue Slot Utilization 10.06% 11.88% 10.87%
51 sm_efficiency Multiprocessor Activity 96.02% 99.38% 97.18%
51 achieved_occupancy Achieved Occupancy 0.892468 0.896008 0.894539
51 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.644478 0.779110 0.703747
51 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
51 l2_utilization L2 Cache Utilization Low (1) Low (3) Low (1)
51 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
51 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (2) Low (1)
51 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
51 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
51 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
51 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
51 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
51 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
51 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
51 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.51% 2.57% 2.04%
51 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
51 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
51 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
51 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
51 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
51 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_copy_kernel__5
104 inst_per_warp Instructions per warp 562.981353 562.981353 562.981353
104 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
104 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
104 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 95.13% 95.13% 95.13%
104 inst_replay_overhead Instruction Replay Overhead 0.000559 0.001271 0.000821
104 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
104 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
104 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
104 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
104 gld_transactions_per_request Global Load Transactions Per Request 3.999989 3.999989 3.999989
104 gst_transactions_per_request Global Store Transactions Per Request 3.999989 3.999989 3.999989
104 shared_store_transactions Shared Store Transactions 0 0 0
104 shared_load_transactions Shared Load Transactions 0 0 0
104 local_load_transactions Local Load Transactions 0 0 0
104 local_store_transactions Local Store Transactions 0 0 0
104 gld_transactions Global Load Transactions 749250 749250 749250
104 gst_transactions Global Store Transactions 749250 749250 749250
104 sysmem_read_transactions System Memory Read Transactions 0 0 0
104 sysmem_write_transactions System Memory Write Transactions 5 5 5
104 l2_read_transactions L2 Read Transactions 749394 750078 749678
104 l2_write_transactions L2 Write Transactions 749270 773201 758698
104 dram_read_transactions Device Memory Read Transactions 749262 749401 749322
104 dram_write_transactions Device Memory Write Transactions 739825 765772 754082
104 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
104 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
104 gld_requested_throughput Requested Global Load Throughput 269.99GB/s 311.11GB/s 295.07GB/s
104 gst_requested_throughput Requested Global Store Throughput 269.99GB/s 311.11GB/s 295.07GB/s
104 gld_throughput Global Load Throughput 269.99GB/s 311.11GB/s 295.07GB/s
104 gst_throughput Global Store Throughput 269.99GB/s 311.11GB/s 295.07GB/s
104 local_memory_overhead Local Memory Overhead 0.00% 0.00% 0.00%
104 tex_cache_hit_rate Unified Cache Hit Rate 0.00% 0.00% 0.00%
104 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
104 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
104 dram_read_throughput Device Memory Read Throughput 269.99GB/s 311.11GB/s 295.09GB/s
104 dram_write_throughput Device Memory Write Throughput 268.76GB/s 316.40GB/s 296.97GB/s
104 tex_cache_throughput Unified cache to SM throughput 539.99GB/s 622.23GB/s 590.14GB/s
104 l2_tex_read_throughput L2 Throughput (Texture Reads) 269.99GB/s 311.11GB/s 295.07GB/s
104 l2_tex_write_throughput L2 Throughput (Texture Writes) 269.99GB/s 311.11GB/s 295.07GB/s
104 l2_read_throughput L2 Throughput (Reads) 270.28GB/s 311.31GB/s 295.23GB/s
104 l2_write_throughput L2 Throughput (Writes) 270.00GB/s 320.89GB/s 298.79GB/s
104 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
104 sysmem_write_throughput System Memory Write Throughput 1.8450MB/s 2.1259MB/s 2.0163MB/s
104 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
104 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
104 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
104 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
104 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
104 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
104 tex_cache_transactions Unified cache to SM transactions 374633 374633 374633
104 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
104 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
104 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
104 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
104 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
104 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
104 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
104 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
104 flop_count_sp_special Floating Point Operations(Single Precision Special) 11988000 11988000 11988000
104 inst_executed Instructions Executed 25474645 105457667 69311493
104 inst_issued Instructions Issued 25489110 25507044 25495751
104 dram_utilization Device Memory Utilization High (7) High (8) High (7)
104 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
104 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 2.48% 3.29% 2.75%
104 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 19.61% 20.54% 20.06%
104 stall_memory_dependency Issue Stall Reasons (Data Request) 26.96% 34.25% 31.29%
104 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
104 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
104 stall_other Issue Stall Reasons (Other) 5.19% 6.43% 5.71%
104 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.53% 1.39% 0.82%
104 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 15.62% 18.00% 16.58%
104 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
104 inst_fp_32 FP Instructions(Single) 11988000 11988000 11988000
104 inst_fp_64 FP Instructions(Double) 0 0 0
104 inst_integer Integer Instructions 617762746 617762746 617762746
104 inst_bit_convert Bit-Convert Instructions 23976000 23976000 23976000
104 inst_control Control-Flow Instructions 53946240 53946240 53946240
104 inst_compute_ld_st Load/Store Instructions 11988000 11988000 11988000
104 inst_misc Misc Instructions 23976480 23976480 23976480
104 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
104 issue_slots Issue Slots 25489110 25507044 25495751
104 cf_issued Issued Control-Flow Instructions 2435083 2435083 2435083
104 cf_executed Executed Control-Flow Instructions 2435083 2435083 2435083
104 ldst_issued Issued Load/Store Instructions 749266 749266 749266
104 ldst_executed Executed Load/Store Instructions 749266 749266 749266
104 atomic_transactions Atomic Transactions 0 0 0
104 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
104 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
104 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
104 l2_tex_read_transactions L2 Transactions (Texture Reads) 749250 749250 749250
104 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.83% 0.85% 0.84%
104 stall_not_selected Issue Stall Reasons (Not Selected) 20.34% 24.04% 21.95%
104 l2_tex_write_transactions L2 Transactions (Texture Writes) 749250 749250 749250
104 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
104 nvlink_total_data_received NVLink Total Data Received 864 864 864
104 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
104 nvlink_user_data_received NVLink User Data Received 0 0 0
104 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
104 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
104 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
104 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
104 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
104 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
104 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
104 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
104 nvlink_transmit_throughput NVLink Transmit Throughput 13.284MB/s 15.307MB/s 14.518MB/s
104 nvlink_receive_throughput NVLink Receive Throughput 9.9628MB/s 11.480MB/s 10.888MB/s
104 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
104 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
104 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
104 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
104 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
104 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
104 inst_fp_16 HP Instructions(Half) 0 0 0
104 ipc Executed IPC 0.473342 3.118486 1.914070
104 issued_ipc Issued IPC 3.028094 3.120229 3.072590
104 issue_slot_utilization Issue Slot Utilization 75.70% 78.01% 76.81%
104 sm_efficiency Multiprocessor Activity 82.84% 99.61% 95.29%
104 achieved_occupancy Achieved Occupancy 0.912039 0.922041 0.916920
104 eligible_warps_per_cycle Eligible Warps Per Active Cycle 13.953308 15.857738 14.756881
104 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
104 l2_utilization L2 Cache Utilization Low (1) Low (2) Low (1)
104 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
104 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
104 cf_fu_utilization Control-Flow Function Unit Utilization Low (2) Low (2) Low (2)
104 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
104 special_fu_utilization Special Function Unit Utilization Low (3) Low (3) Low (3)
104 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
104 single_precision_fu_utilization Single-Precision Function Unit Utilization Mid (6) High (7) Mid (6)
104 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
104 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
104 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
104 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
104 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
104 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
104 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
104 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
104 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
======== Error: Application returned non-zero code 130
lucas@ip-172-31-35-196 ~/research/code/ClimateMachine.jl lcw/diff_nstate* 23m 51s
❯
The following kernels have local loads and stores
ptxcall_gpu_interface_gradients__17
ptxcall_gpu_interface_tendency__23
ptxcall_gpu_volume_tendency__22
ptxcall_gpu_volume_gradients__16