Skip to content

Instantly share code, notes, and snippets.

@lcw
Last active June 25, 2019 03:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lcw/a04ba357bc572718e235eec17ceeb3a3 to your computer and use it in GitHub Desktop.
Save lcw/a04ba357bc572718e235eec17ceeb3a3 to your computer and use it in GitHub Desktop.
==50304== Profiling application: julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
==50304== Profiling result:
==50304== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla V100-SXM2-16GB (0)"
Kernel: ptxcall_update__10
55 inst_per_warp Instructions per warp 1.3458e+03 1.3458e+03 1.3458e+03
55 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
55 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
55 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 95.54% 95.54% 95.54%
55 inst_replay_overhead Instruction Replay Overhead 0.000950 0.001755 0.001265
55 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
55 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
55 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
55 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
55 gld_transactions_per_request Global Load Transactions Per Request 7.999988 7.999988 7.999988
55 gst_transactions_per_request Global Store Transactions Per Request 7.999988 7.999988 7.999988
55 shared_store_transactions Shared Store Transactions 0 0 0
55 shared_load_transactions Shared Load Transactions 0 0 0
55 local_load_transactions Local Load Transactions 0 0 0
55 local_store_transactions Local Store Transactions 0 0 0
55 gld_transactions Global Load Transactions 2067189 2067189 2067189
55 gst_transactions Global Store Transactions 1378126 1378126 1378126
55 sysmem_read_transactions System Memory Read Transactions 0 0 0
55 sysmem_write_transactions System Memory Write Transactions 5 5 5
55 l2_read_transactions L2 Read Transactions 1378270 1379638 1378862
55 l2_write_transactions L2 Write Transactions 1378169 1402235 1389351
55 dram_read_transactions Device Memory Read Transactions 1378141 1378725 1378273
55 dram_write_transactions Device Memory Write Transactions 1374528 1389753 1383038
55 global_hit_rate Global Hit Rate in unified l1/tex 60.00% 60.00% 60.00%
55 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
55 gld_requested_throughput Requested Global Load Throughput 505.05GB/s 513.70GB/s 510.28GB/s
55 gst_requested_throughput Requested Global Store Throughput 336.70GB/s 342.47GB/s 340.19GB/s
55 gld_throughput Global Load Throughput 505.05GB/s 513.70GB/s 510.28GB/s
55 gst_throughput Global Store Throughput 336.70GB/s 342.47GB/s 340.19GB/s
55 local_memory_overhead Local Memory Overhead 50.00% 50.00% 50.00%
55 tex_cache_hit_rate Unified Cache Hit Rate 20.00% 20.00% 20.00%
55 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
55 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 100.00% 100.00% 100.00%
55 dram_read_throughput Device Memory Read Throughput 336.75GB/s 342.52GB/s 340.22GB/s
55 dram_write_throughput Device Memory Write Throughput 336.95GB/s 345.28GB/s 341.40GB/s
55 tex_cache_throughput Unified cache to SM throughput 589.24GB/s 599.33GB/s 595.34GB/s
55 l2_tex_read_throughput L2 Throughput (Texture Reads) 336.70GB/s 342.47GB/s 340.19GB/s
55 l2_tex_write_throughput L2 Throughput (Texture Writes) 336.70GB/s 342.47GB/s 340.19GB/s
55 l2_read_throughput L2 Throughput (Reads) 337.03GB/s 342.50GB/s 340.37GB/s
55 l2_write_throughput L2 Throughput (Writes) 338.45GB/s 347.09GB/s 342.96GB/s
55 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 sysmem_write_throughput System Memory Write Throughput 1.2509MB/s 1.2723MB/s 1.2639MB/s
55 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
55 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
55 tex_cache_transactions Unified cache to SM transactions 602942 602942 602942
55 flop_count_dp Floating Point Operations(Double Precision) 11025000 11025000 11025000
55 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
55 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 2756250 2756250 2756250
55 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 5512500 5512500 5512500
55 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
55 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
55 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
55 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
55 flop_count_sp_special Floating Point Operations(Single Precision Special) 11025000 11025000 11025000
55 inst_executed Instructions Executed 24203846 115936074 77575324
55 inst_issued Instructions Issued 24226813 24246334 24233940
55 dram_utilization Device Memory Utilization High (9) High (9) High (9)
55 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
55 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 3.35% 5.34% 4.22%
55 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 18.76% 19.45% 19.12%
55 stall_memory_dependency Issue Stall Reasons (Data Request) 36.51% 38.35% 37.17%
55 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
55 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
55 stall_other Issue Stall Reasons (Other) 4.41% 4.58% 4.50%
55 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.63% 1.92% 1.27%
55 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 15.04% 15.59% 15.32%
55 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
55 inst_fp_32 FP Instructions(Single) 11025000 11025000 11025000
55 inst_fp_64 FP Instructions(Double) 8268750 8268750 8268750
55 inst_integer Integer Instructions 520110936 520110936 520110936
55 inst_bit_convert Bit-Convert Instructions 22050000 22050000 22050000
55 inst_control Control-Flow Instructions 63394108 63394108 63394108
55 inst_compute_ld_st Load/Store Instructions 13781250 13781250 13781250
55 inst_misc Misc Instructions 77178938 77178938 77178938
55 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
55 issue_slots Issue Slots 24226813 24246334 24233940
55 cf_issued Issued Control-Flow Instructions 2497879 2497879 2497879
55 cf_executed Executed Control-Flow Instructions 2497879 2497879 2497879
55 ldst_issued Issued Load/Store Instructions 602953 602953 602953
55 ldst_executed Executed Load/Store Instructions 602953 602953 602953
55 atomic_transactions Atomic Transactions 0 0 0
55 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
55 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
55 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
55 l2_tex_read_transactions L2 Transactions (Texture Reads) 1378126 1378134 1378126
55 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 1.68% 1.90% 1.78%
55 stall_not_selected Issue Stall Reasons (Not Selected) 16.31% 16.91% 16.62%
55 l2_tex_write_transactions L2 Transactions (Texture Writes) 1378126 1378126 1378126
55 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
55 nvlink_total_data_received NVLink Total Data Received 864 864 864
55 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
55 nvlink_user_data_received NVLink User Data Received 0 0 0
55 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
55 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
55 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
55 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
55 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
55 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
55 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
55 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
55 nvlink_transmit_throughput NVLink Transmit Throughput 9.0065MB/s 9.1608MB/s 9.0998MB/s
55 nvlink_receive_throughput NVLink Receive Throughput 6.7549MB/s 6.8706MB/s 6.8248MB/s
55 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
55 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
55 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
55 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
55 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
55 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
55 inst_fp_16 HP Instructions(Half) 0 0 0
55 ipc Executed IPC 0.447668 1.793692 1.232652
55 issued_ipc Issued IPC 1.731246 1.792476 1.763761
55 issue_slot_utilization Issue Slot Utilization 43.28% 44.81% 44.09%
55 sm_efficiency Multiprocessor Activity 84.76% 98.88% 93.25%
55 achieved_occupancy Achieved Occupancy 0.487811 0.488845 0.488430
55 eligible_warps_per_cycle Eligible Warps Per Active Cycle 6.244738 6.471777 6.368446
55 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
55 l2_utilization L2 Cache Utilization Low (1) Low (2) Low (1)
55 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
55 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
55 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
55 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 special_fu_utilization Special Function Unit Utilization Low (2) Low (2) Low (2)
55 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 single_precision_fu_utilization Single-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
55 double_precision_fu_utilization Double-Precision Function Unit Utilization Low (1) Low (1) Low (1)
55 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
55 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
55 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.07% 1.17% 1.04%
55 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
55 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
55 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
55 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
55 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_volumeviscterms__6
55 inst_per_warp Instructions per warp 6.9090e+03 7.0545e+03 6.9283e+03
55 branch_efficiency Branch Efficiency 99.17% 99.21% 99.20%
55 warp_execution_efficiency Warp Execution Efficiency 83.95% 85.33% 85.16%
55 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 80.85% 82.17% 82.01%
55 inst_replay_overhead Instruction Replay Overhead 0.000437 0.000749 0.000581
55 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 2.583021 2.625531 2.596400
55 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 2.042786 2.051396 2.046694
55 local_load_transactions_per_request Local Memory Load Transactions Per Request 3.404154 3.532058 3.441186
55 local_store_transactions_per_request Local Memory Store Transactions Per Request 3.407473 3.535293 3.444447
55 gld_transactions_per_request Global Load Transactions Per Request 8.123902 8.130336 8.126401
55 gst_transactions_per_request Global Store Transactions Per Request 8.576923 8.576923 8.576923
55 shared_store_transactions Shared Store Transactions 187681 188472 188039
55 shared_load_transactions Shared Load Transactions 3417337 3473577 3435037
55 local_load_transactions Local Load Transactions 477579 531935 487495
55 local_store_transactions Local Store Transactions 490366 546125 500544
55 gld_transactions Global Load Transactions 1940597 1942134 1941193
55 gst_transactions Global Store Transactions 1639050 1639050 1639050
55 sysmem_read_transactions System Memory Read Transactions 0 0 0
55 sysmem_write_transactions System Memory Write Transactions 5 5 5
55 l2_read_transactions L2 Read Transactions 1925994 1941476 1932356
55 l2_write_transactions L2 Write Transactions 2335074 2434745 2360811
55 dram_read_transactions Device Memory Read Transactions 2054129 2090998 2058486
55 dram_write_transactions Device Memory Write Transactions 1788685 1822611 1801692
55 global_hit_rate Global Hit Rate in unified l1/tex 16.93% 16.97% 16.95%
55 local_hit_rate Local Hit Rate 68.86% 69.00% 68.93%
55 gld_requested_throughput Requested Global Load Throughput 219.57GB/s 235.29GB/s 232.81GB/s
55 gst_requested_throughput Requested Global Store Throughput 176.20GB/s 188.81GB/s 186.82GB/s
55 gld_throughput Global Load Throughput 229.06GB/s 245.48GB/s 242.91GB/s
55 gst_throughput Global Store Throughput 193.44GB/s 207.29GB/s 205.10GB/s
55 local_memory_overhead Local Memory Overhead 26.66% 27.83% 26.89%
55 tex_cache_hit_rate Unified Cache Hit Rate 12.72% 13.40% 12.85%
55 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 6.82% 6.89% 6.87%
55 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 32.00% 32.58% 32.12%
55 dram_read_throughput Device Memory Read Throughput 246.78GB/s 260.25GB/s 257.59GB/s
55 dram_write_throughput Device Memory Write Throughput 214.35GB/s 228.42GB/s 225.45GB/s
55 tex_cache_throughput Unified cache to SM throughput 1593.7GB/s 1702.4GB/s 1685.2GB/s
55 l2_tex_read_throughput L2 Throughput (Texture Reads) 228.29GB/s 243.57GB/s 241.20GB/s
55 l2_tex_write_throughput L2 Throughput (Texture Writes) 257.90GB/s 272.08GB/s 267.74GB/s
55 l2_read_throughput L2 Throughput (Reads) 228.39GB/s 243.91GB/s 241.81GB/s
55 l2_write_throughput L2 Throughput (Writes) 286.04GB/s 304.16GB/s 295.42GB/s
55 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 sysmem_write_throughput System Memory Write Throughput 618.77KB/s 663.05KB/s 656.06KB/s
55 local_load_throughput Local Memory Load Throughput 59.548GB/s 65.572GB/s 61.003GB/s
55 local_store_throughput Local Memory Store Throughput 61.142GB/s 67.321GB/s 62.636GB/s
55 shared_load_throughput Shared Memory Load Throughput 1639.8GB/s 1739.5GB/s 1719.4GB/s
55 shared_store_throughput Shared Memory Store Throughput 88.975GB/s 95.147GB/s 94.122GB/s
55 gld_efficiency Global Memory Load Efficiency 95.80% 95.87% 95.84%
55 gst_efficiency Global Memory Store Efficiency 91.09% 91.09% 91.09%
55 tex_cache_transactions Unified cache to SM transactions 3365294 3375850 3366734
55 flop_count_dp Floating Point Operations(Double Precision) 723595584 727663363 725245612
55 flop_count_dp_add Floating Point Operations(Double Precision Add) 120487291 121178932 120767785
55 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 273283088 274806204 273900989
55 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 56542117 56872023 56675848
55 flop_count_sp Floating Point Operations(Single Precision) 19650878 19811928 19716193
55 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
55 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 9825439 9905964 9858096
55 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
55 flop_count_sp_special Floating Point Operations(Single Precision Special) 14517099 14630709 14563180
55 inst_executed Instructions Executed 37354715 102459761 69031730
55 inst_issued Instructions Issued 37372022 38074489 37469414
55 dram_utilization Device Memory Utilization Mid (6) Mid (6) Mid (6)
55 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
55 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 36.07% 38.13% 37.12%
55 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 23.54% 24.71% 24.18%
55 stall_memory_dependency Issue Stall Reasons (Data Request) 8.83% 10.20% 9.43%
55 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
55 stall_sync Issue Stall Reasons (Synchronization) 5.73% 7.31% 6.73%
55 stall_other Issue Stall Reasons (Other) 1.04% 1.12% 1.07%
55 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.29% 1.92% 0.93%
55 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 12.68% 14.12% 13.25%
55 shared_efficiency Shared Memory Efficiency 32.55% 33.07% 32.90%
55 inst_fp_32 FP Instructions(Single) 95329917 96084604 95635847
55 inst_fp_64 FP Instructions(Double) 462250620 464914511 463331027
55 inst_integer Integer Instructions 277652376 279027567 278210378
55 inst_bit_convert Bit-Convert Instructions 5526943 5572199 5545273
55 inst_control Control-Flow Instructions 72594343 73135864 72814035
55 inst_compute_ld_st Load/Store Instructions 61247230 61368395 61296874
55 inst_misc Misc Instructions 26288320 26354491 26315168
55 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
55 issue_slots Issue Slots 37372022 38074489 37469414
55 cf_issued Issued Control-Flow Instructions 2879052 2946863 2888005
55 cf_executed Executed Control-Flow Instructions 2879052 2946863 2888005
55 ldst_issued Issued Load/Store Instructions 2351574 2373273 2354553
55 ldst_executed Executed Load/Store Instructions 2351574 2373273 2354553
55 atomic_transactions Atomic Transactions 0 0 0
55 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
55 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
55 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
55 l2_tex_read_transactions L2 Transactions (Texture Reads) 1925640 1934329 1927544
55 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 1.28% 2.01% 1.60%
55 stall_not_selected Issue Stall Reasons (Not Selected) 5.50% 5.91% 5.69%
55 l2_tex_write_transactions L2 Transactions (Texture Writes) 2129416 2185175 2139594
55 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
55 nvlink_total_data_received NVLink Total Data Received 864 864 864
55 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
55 nvlink_user_data_received NVLink User Data Received 0 0 0
55 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
55 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
55 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
55 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
55 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
55 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
55 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
55 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
55 nvlink_transmit_throughput NVLink Transmit Throughput 4.3507MB/s 4.6621MB/s 4.6130MB/s
55 nvlink_receive_throughput NVLink Receive Throughput 3.2630MB/s 3.4966MB/s 3.4597MB/s
55 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
55 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
55 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
55 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
55 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
55 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
55 inst_fp_16 HP Instructions(Half) 0 0 0
55 ipc Executed IPC 0.572609 1.393068 0.987988
55 issued_ipc Issued IPC 1.338923 1.411893 1.364409
55 issue_slot_utilization Issue Slot Utilization 33.47% 35.30% 34.11%
55 sm_efficiency Multiprocessor Activity 90.99% 96.66% 93.87%
55 achieved_occupancy Achieved Occupancy 0.302693 0.305265 0.304339
55 eligible_warps_per_cycle Eligible Warps Per Active Cycle 2.318277 2.473052 2.373496
55 shared_utilization Shared Memory Utilization Low (1) Low (1) Low (1)
55 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
55 tex_utilization Unified Cache Utilization Low (2) Low (2) Low (2)
55 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
55 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
55 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
55 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (2) Low (2) Low (2)
55 double_precision_fu_utilization Double-Precision Function Unit Utilization High (7) High (7) High (7)
55 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
55 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.08% 0.54% 0.51%
55 flop_dp_efficiency FLOP Efficiency(Peak Double) 6.19% 39.88% 37.15%
55 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
55 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
55 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
55 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
55 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_anonymous19_2
1 inst_per_warp Instructions per warp 136.997806 136.997806 136.997806
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
1 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
1 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 95.71% 95.71% 95.71%
1 inst_replay_overhead Instruction Replay Overhead 0.007605 0.007605 0.007605
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 gld_transactions_per_request Global Load Transactions Per Request 0.000000 0.000000 0.000000
1 gst_transactions_per_request Global Store Transactions Per Request 7.999988 7.999988 7.999988
1 shared_store_transactions Shared Store Transactions 0 0 0
1 shared_load_transactions Shared Load Transactions 0 0 0
1 local_load_transactions Local Load Transactions 0 0 0
1 local_store_transactions Local Store Transactions 0 0 0
1 gld_transactions Global Load Transactions 0 0 0
1 gst_transactions Global Store Transactions 689063 689063 689063
1 sysmem_read_transactions System Memory Read Transactions 0 0 0
1 sysmem_write_transactions System Memory Write Transactions 5 5 5
1 l2_read_transactions L2 Read Transactions 96 96 96
1 l2_write_transactions L2 Write Transactions 689106 689106 689106
1 dram_read_transactions Device Memory Read Transactions 9 9 9
1 dram_write_transactions Device Memory Write Transactions 669700 669700 669700
1 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
1 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
1 gld_requested_throughput Requested Global Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gst_requested_throughput Requested Global Store Throughput 738.08GB/s 738.08GB/s 738.08GB/s
1 gld_throughput Global Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gst_throughput Global Store Throughput 738.08GB/s 738.08GB/s 738.08GB/s
1 local_memory_overhead Local Memory Overhead 0.00% 0.00% 0.00%
1 tex_cache_hit_rate Unified Cache Hit Rate 0.00% 0.00% 0.00%
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
1 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
1 dram_read_throughput Device Memory Read Throughput 9.8716MB/s 9.8716MB/s 9.8716MB/s
1 dram_write_throughput Device Memory Write Throughput 717.34GB/s 717.34GB/s 717.34GB/s
1 tex_cache_throughput Unified cache to SM throughput 369.05GB/s 369.05GB/s 369.05GB/s
1 l2_tex_read_throughput L2 Throughput (Texture Reads) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_tex_write_throughput L2 Throughput (Texture Writes) 738.08GB/s 738.08GB/s 738.08GB/s
1 l2_read_throughput L2 Throughput (Reads) 105.30MB/s 105.30MB/s 105.30MB/s
1 l2_write_throughput L2 Throughput (Writes) 738.13GB/s 738.13GB/s 738.13GB/s
1 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 sysmem_write_throughput System Memory Write Throughput 5.4842MB/s 5.4842MB/s 5.4842MB/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gld_efficiency Global Memory Load Efficiency 0.00% 0.00% 0.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 tex_cache_transactions Unified cache to SM transactions 86136 86136 86136
1 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
1 inst_executed Instructions Executed 11800443 11800443 11800443
1 inst_issued Instructions Issued 2603705 2603705 2603705
1 dram_utilization Device Memory Utilization High (9) High (9) High (9)
1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
1 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 1.35% 1.35% 1.35%
1 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 14.74% 14.74% 14.74%
1 stall_memory_dependency Issue Stall Reasons (Data Request) 0.00% 0.00% 0.00%
1 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
1 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
1 stall_other Issue Stall Reasons (Other) 0.88% 0.88% 0.88%
1 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 1.00% 1.00% 1.00%
1 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 21.42% 21.42% 21.42%
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 inst_fp_32 FP Instructions(Single) 0 0 0
1 inst_fp_64 FP Instructions(Double) 0 0 0
1 inst_integer Integer Instructions 52370280 52370280 52370280
1 inst_bit_convert Bit-Convert Instructions 0 0 0
1 inst_control Control-Flow Instructions 2756352 2756352 2756352
1 inst_compute_ld_st Load/Store Instructions 2756250 2756250 2756250
1 inst_misc Misc Instructions 19294158 19294158 19294158
1 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
1 issue_slots Issue Slots 2603705 2603705 2603705
1 cf_issued Issued Control-Flow Instructions 258405 258405 258405
1 cf_executed Executed Control-Flow Instructions 258405 258405 258405
1 ldst_issued Issued Load/Store Instructions 258405 258405 258405
1 ldst_executed Executed Load/Store Instructions 258405 258405 258405
1 atomic_transactions Atomic Transactions 0 0 0
1 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 0 0 0
1 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 57.24% 57.24% 57.24%
1 stall_not_selected Issue Stall Reasons (Not Selected) 3.35% 3.35% 3.35%
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 689063 689063 689063
1 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
1 nvlink_total_data_received NVLink Total Data Received 864 864 864
1 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
1 nvlink_user_data_received NVLink User Data Received 0 0 0
1 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
1 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
1 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
1 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
1 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
1 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
1 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
1 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
1 nvlink_transmit_throughput NVLink Transmit Throughput 39.486MB/s 39.486MB/s 39.486MB/s
1 nvlink_receive_throughput NVLink Receive Throughput 29.615MB/s 29.615MB/s 29.615MB/s
1 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
1 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
1 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
1 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
1 inst_fp_16 HP Instructions(Half) 0 0 0
1 ipc Executed IPC 1.042892 1.042892 1.042892
1 issued_ipc Issued IPC 1.049568 1.049568 1.049568
1 issue_slot_utilization Issue Slot Utilization 26.24% 26.24% 26.24%
1 sm_efficiency Multiprocessor Activity 80.13% 80.13% 80.13%
1 achieved_occupancy Achieved Occupancy 0.902661 0.902661 0.902661
1 eligible_warps_per_cycle Eligible Warps Per Active Cycle 2.472440 2.472440 2.472440
1 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
1 l2_utilization L2 Cache Utilization Low (2) Low (2) Low (2)
1 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
1 ldst_fu_utilization Load/Store Function Unit Utilization Low (2) Low (2) Low (2)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
1 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (3) Low (3) Low (3)
1 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
1 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
1 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
1 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.01% 0.01% 0.01%
1 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_faceviscterms__7
55 inst_per_warp Instructions per warp 7.1985e+04 7.5107e+04 7.2514e+04
55 branch_efficiency Branch Efficiency 99.41% 99.51% 99.50%
55 warp_execution_efficiency Warp Execution Efficiency 65.90% 68.32% 67.97%
55 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 63.38% 65.71% 65.38%
55 inst_replay_overhead Instruction Replay Overhead 0.000272 0.000378 0.000320
55 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
55 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
55 local_load_transactions_per_request Local Memory Load Transactions Per Request 3.121848 3.165834 3.142720
55 local_store_transactions_per_request Local Memory Store Transactions Per Request 3.126148 3.170088 3.146907
55 gld_transactions_per_request Global Load Transactions Per Request 14.519745 14.522633 14.521275
55 gst_transactions_per_request Global Store Transactions Per Request 14.000000 14.000000 14.000000
55 shared_store_transactions Shared Store Transactions 0 0 0
55 shared_load_transactions Shared Load Transactions 0 0 0
55 local_load_transactions Local Load Transactions 1227080 1417762 1265009
55 local_store_transactions Local Store Transactions 1260788 1456264 1299678
55 gld_transactions Global Load Transactions 12166094 12168514 12167376
55 gst_transactions Global Store Transactions 4939200 4939200 4939200
55 sysmem_read_transactions System Memory Read Transactions 0 0 0
55 sysmem_write_transactions System Memory Write Transactions 5 5 5
55 l2_read_transactions L2 Read Transactions 9928032 9978195 9950922
55 l2_write_transactions L2 Write Transactions 6862243 7214653 6931546
55 dram_read_transactions Device Memory Read Transactions 12498267 12665274 12542459
55 dram_write_transactions Device Memory Write Transactions 5560021 5660544 5584314
55 global_hit_rate Global Hit Rate in unified l1/tex 43.04% 43.30% 43.14%
55 local_hit_rate Local Hit Rate 73.56% 74.07% 73.77%
55 gld_requested_throughput Requested Global Load Throughput 120.16GB/s 127.44GB/s 126.58GB/s
55 gst_requested_throughput Requested Global Store Throughput 51.907GB/s 55.048GB/s 54.677GB/s
55 gld_throughput Global Load Throughput 286.42GB/s 303.75GB/s 301.71GB/s
55 gst_throughput Global Store Throughput 116.27GB/s 123.31GB/s 122.48GB/s
55 local_memory_overhead Local Memory Overhead 39.57% 40.74% 39.89%
55 tex_cache_hit_rate Unified Cache Hit Rate 18.28% 18.97% 18.49%
55 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 7.87% 8.35% 8.14%
55 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 87.21% 88.22% 87.95%
55 dram_read_throughput Device Memory Read Throughput 298.15GB/s 312.56GB/s 311.01GB/s
55 dram_write_throughput Device Memory Write Throughput 133.25GB/s 139.27GB/s 138.47GB/s
55 tex_cache_throughput Unified cache to SM throughput 316.93GB/s 330.56GB/s 329.00GB/s
55 l2_tex_read_throughput L2 Throughput (Texture Reads) 234.04GB/s 248.15GB/s 246.54GB/s
55 l2_tex_write_throughput L2 Throughput (Texture Writes) 150.55GB/s 155.74GB/s 154.71GB/s
55 l2_read_throughput L2 Throughput (Reads) 233.81GB/s 248.35GB/s 246.75GB/s
55 l2_write_throughput L2 Throughput (Writes) 169.84GB/s 174.95GB/s 171.88GB/s
55 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 sysmem_write_throughput System Memory Write Throughput 123.42KB/s 130.89KB/s 130.01KB/s
55 local_load_throughput Local Memory Load Throughput 30.496GB/s 34.102GB/s 31.368GB/s
55 local_store_throughput Local Memory Store Throughput 31.334GB/s 35.029GB/s 32.228GB/s
55 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 gld_efficiency Global Memory Load Efficiency 41.95% 41.96% 41.95%
55 gst_efficiency Global Memory Store Efficiency 44.64% 44.64% 44.64%
55 tex_cache_transactions Unified cache to SM transactions 3306277 3365770 3316951
55 flop_count_dp Floating Point Operations(Double Precision) 1495402080 1503915166 1498295745
55 flop_count_dp_add Floating Point Operations(Double Precision Add) 279845340 281293146 280337374
55 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 546832880 550020060 547916336
55 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 121890980 122581900 122125697
55 flop_count_sp Floating Point Operations(Single Precision) 46071960 46409076 46186530
55 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
55 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 23035980 23204538 23093265
55 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
55 flop_count_sp_special Floating Point Operations(Single Precision Special) 34294660 34532436 34375478
55 inst_executed Instructions Executed 94069547 276018492 179120731
55 inst_issued Instructions Issued 94095762 97848526 94734143
55 dram_utilization Device Memory Utilization Mid (6) Mid (6) Mid (6)
55 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
55 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 26.81% 30.37% 27.90%
55 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 23.96% 25.72% 24.41%
55 stall_memory_dependency Issue Stall Reasons (Data Request) 39.10% 44.24% 43.06%
55 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
55 stall_sync Issue Stall Reasons (Synchronization) 0.01% 0.01% 0.01%
55 stall_other Issue Stall Reasons (Other) 0.29% 0.32% 0.30%
55 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.11% 0.28% 0.20%
55 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 2.57% 2.90% 2.71%
55 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
55 inst_fp_32 FP Instructions(Single) 226646960 228227470 227183900
55 inst_fp_64 FP Instructions(Double) 977232900 982808782 979127956
55 inst_integer Integer Instructions 549128955 551903363 550072439
55 inst_bit_convert Bit-Convert Instructions 13269330 13364190 13301534
55 inst_control Control-Flow Instructions 165941555 167074665 166326756
55 inst_compute_ld_st Load/Store Instructions 38525590 38775290 38612460
55 inst_misc Misc Instructions 36400285 36538723 36447351
55 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
55 issue_slots Issue Slots 94095762 97848526 94734143
55 cf_issued Issued Control-Flow Instructions 7807239 8172847 7869029
55 cf_executed Executed Control-Flow Instructions 7807239 8172847 7869029
55 ldst_issued Issued Load/Store Instructions 2317865 2438523 2338483
55 ldst_executed Executed Load/Store Instructions 2317865 2438523 2338483
55 atomic_transactions Atomic Transactions 0 0 0
55 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
55 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
55 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
55 l2_tex_read_transactions L2 Transactions (Texture Reads) 9915023 9961665 9942257
55 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.07% 0.21% 0.13%
55 stall_not_selected Issue Stall Reasons (Not Selected) 1.20% 1.36% 1.27%
55 l2_tex_write_transactions L2 Transactions (Texture Writes) 6199988 6395464 6238878
55 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
55 nvlink_total_data_received NVLink Total Data Received 864 864 864
55 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
55 nvlink_user_data_received NVLink User Data Received 0 0 0
55 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
55 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
55 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
55 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
55 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
55 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
55 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
55 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
55 nvlink_transmit_throughput NVLink Transmit Throughput 888.62KB/s 942.41KB/s 936.05KB/s
55 nvlink_receive_throughput NVLink Receive Throughput 666.47KB/s 706.80KB/s 702.04KB/s
55 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
55 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
55 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
55 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
55 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
55 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
55 inst_fp_16 HP Instructions(Half) 0 0 0
55 ipc Executed IPC 0.509044 0.716988 0.642518
55 issued_ipc Issued IPC 0.686952 0.734776 0.700049
55 issue_slot_utilization Issue Slot Utilization 17.17% 18.37% 17.50%
55 sm_efficiency Multiprocessor Activity 88.49% 93.16% 91.36%
55 achieved_occupancy Achieved Occupancy 0.160858 0.167531 0.163760
55 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.811133 0.867836 0.824014
55 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
55 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
55 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
55 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
55 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
55 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
55 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
55 double_precision_fu_utilization Double-Precision Function Unit Utilization Mid (4) Mid (4) Mid (4)
55 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
55 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.07% 0.25% 0.22%
55 flop_dp_efficiency FLOP Efficiency(Peak Double) 4.28% 16.18% 14.48%
55 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
55 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
55 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
55 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
55 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_volumerhs__8
55 inst_per_warp Instructions per warp 9.2360e+03 9.3814e+03 9.2553e+03
55 branch_efficiency Branch Efficiency 99.37% 99.40% 99.39%
55 warp_execution_efficiency Warp Execution Efficiency 87.34% 88.42% 88.29%
55 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 84.16% 85.20% 85.07%
55 inst_replay_overhead Instruction Replay Overhead 0.000277 0.000514 0.000400
55 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 2.426895 2.482983 2.450226
55 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 2.023625 2.029701 2.026903
55 local_load_transactions_per_request Local Memory Load Transactions Per Request 3.404154 3.532058 3.441186
55 local_store_transactions_per_request Local Memory Store Transactions Per Request 3.407473 3.535293 3.444447
55 gld_transactions_per_request Global Load Transactions Per Request 8.189466 8.195449 8.192852
55 gst_transactions_per_request Global Store Transactions Per Request 8.562483 8.562483 8.562483
55 shared_store_transactions Shared Store Transactions 542888 544518 543767
55 shared_load_transactions Shared Load Transactions 6421564 6569973 6483297
55 local_load_transactions Local Load Transactions 477579 531935 487495
55 local_store_transactions Local Store Transactions 490366 546125 500544
55 gld_transactions Global Load Transactions 5447428 5451408 5449680
55 gst_transactions Global Store Transactions 755211 755211 755211
55 sysmem_read_transactions System Memory Read Transactions 0 0 0
55 sysmem_write_transactions System Memory Write Transactions 5 5 5
55 l2_read_transactions L2 Read Transactions 4725058 4746776 4734693
55 l2_write_transactions L2 Write Transactions 1402773 1514494 1428750
55 dram_read_transactions Device Memory Read Transactions 4817485 4855850 4823111
55 dram_write_transactions Device Memory Write Transactions 957298 1002218 973880
55 global_hit_rate Global Hit Rate in unified l1/tex 21.66% 22.12% 21.90%
55 local_hit_rate Local Hit Rate 67.54% 67.73% 67.64%
55 gld_requested_throughput Requested Global Load Throughput 410.40GB/s 419.41GB/s 414.62GB/s
55 gst_requested_throughput Requested Global Store Throughput 54.478GB/s 55.674GB/s 55.038GB/s
55 gld_throughput Global Load Throughput 430.90GB/s 440.32GB/s 435.28GB/s
55 gst_throughput Global Store Throughput 59.708GB/s 61.019GB/s 60.321GB/s
55 local_memory_overhead Local Memory Overhead 18.65% 19.69% 18.99%
55 tex_cache_hit_rate Unified Cache Hit Rate 19.43% 19.78% 19.52%
55 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 6.93% 7.02% 6.98%
55 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 46.31% 49.50% 47.49%
55 dram_read_throughput Device Memory Read Throughput 381.74GB/s 389.50GB/s 385.24GB/s
55 dram_write_throughput Device Memory Write Throughput 76.010GB/s 79.991GB/s 77.787GB/s
55 tex_cache_throughput Unified cache to SM throughput 2186.3GB/s 2231.2GB/s 2206.8GB/s
55 l2_tex_read_throughput L2 Throughput (Texture Reads) 373.94GB/s 381.78GB/s 377.66GB/s
55 l2_tex_write_throughput L2 Throughput (Texture Writes) 98.680GB/s 103.68GB/s 100.30GB/s
55 l2_read_throughput L2 Throughput (Reads) 374.08GB/s 382.06GB/s 378.18GB/s
55 l2_write_throughput L2 Throughput (Writes) 111.19GB/s 120.55GB/s 114.12GB/s
55 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 sysmem_write_throughput System Memory Write Throughput 414.51KB/s 423.61KB/s 418.76KB/s
55 local_load_throughput Local Memory Load Throughput 37.866GB/s 42.342GB/s 38.938GB/s
55 local_store_throughput Local Memory Store Throughput 38.879GB/s 43.471GB/s 39.980GB/s
55 shared_load_throughput Shared Memory Load Throughput 2040.8GB/s 2106.2GB/s 2071.4GB/s
55 shared_store_throughput Shared Memory Store Throughput 172.04GB/s 175.89GB/s 173.73GB/s
55 gld_efficiency Global Memory Load Efficiency 95.22% 95.29% 95.25%
55 gst_efficiency Global Memory Store Efficiency 91.24% 91.24% 91.24%
55 tex_cache_transactions Unified cache to SM transactions 6901702 6914234 6907171
55 flop_count_dp Floating Point Operations(Double Precision) 878585426 882653191 880235453
55 flop_count_dp_add Floating Point Operations(Double Precision Add) 121462832 122154473 121743326
55 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 318879842 320402953 319497743
55 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 119362910 119692812 119496641
55 flop_count_sp Floating Point Operations(Single Precision) 20456046 20617096 20521361
55 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
55 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 9825439 9905964 9858096
55 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 805168 805168 805168
55 flop_count_sp_special Floating Point Operations(Single Precision Special) 16643601 16757210 16689682
55 inst_executed Instructions Executed 49111854 136552365 76027433
55 inst_issued Instructions Issued 49128113 49827653 49224979
55 dram_utilization Device Memory Utilization Mid (6) Mid (6) Mid (6)
55 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
55 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 27.99% 32.20% 29.92%
55 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 30.12% 32.69% 31.45%
55 stall_memory_dependency Issue Stall Reasons (Data Request) 17.02% 20.75% 18.83%
55 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
55 stall_sync Issue Stall Reasons (Synchronization) 3.92% 5.23% 4.65%
55 stall_other Issue Stall Reasons (Other) 0.87% 0.95% 0.91%
55 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.21% 0.83% 0.48%
55 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 6.43% 7.13% 6.80%
55 shared_efficiency Shared Memory Efficiency 28.17% 28.77% 28.52%
55 inst_fp_32 FP Instructions(Single) 107449087 108203773 107755017
55 inst_fp_64 FP Instructions(Double) 573940583 576604463 575020990
55 inst_integer Integer Instructions 413358963 414711037 413907544
55 inst_bit_convert Bit-Convert Instructions 8515404 8560660 8533734
55 inst_control Control-Flow Instructions 94443051 94984567 94662742
55 inst_compute_ld_st Load/Store Instructions 118669105 118790270 118718749
55 inst_misc Misc Instructions 39860442 39943152 39894002
55 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
55 issue_slots Issue Slots 49128113 49827653 49224979
55 cf_issued Issued Control-Flow Instructions 3870910 3938714 3879863
55 cf_executed Executed Control-Flow Instructions 3870910 3938714 3879863
55 ldst_issued Issued Load/Store Instructions 4262574 4284272 4265553
55 ldst_executed Executed Load/Store Instructions 4262574 4284272 4265553
55 atomic_transactions Atomic Transactions 0 0 0
55 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
55 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
55 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
55 l2_tex_read_transactions L2 Transactions (Texture Reads) 4724141 4735618 4728203
55 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 3.00% 4.62% 3.67%
55 stall_not_selected Issue Stall Reasons (Not Selected) 3.16% 3.41% 3.29%
55 l2_tex_write_transactions L2 Transactions (Texture Writes) 1245577 1301336 1255755
55 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
55 nvlink_total_data_received NVLink Total Data Received 864 864 864
55 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
55 nvlink_user_data_received NVLink User Data Received 0 0 0
55 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
55 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
55 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
55 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
55 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
55 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
55 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
55 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
55 nvlink_transmit_throughput NVLink Transmit Throughput 2.9145MB/s 2.9785MB/s 2.9444MB/s
55 nvlink_receive_throughput NVLink Receive Throughput 2.1859MB/s 2.2339MB/s 2.2083MB/s
55 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
55 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
55 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
55 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
55 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
55 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
55 inst_fp_16 HP Instructions(Half) 0 0 0
55 ipc Executed IPC 0.524193 1.157060 0.776593
55 issued_ipc Issued IPC 1.089128 1.160794 1.123778
55 issue_slot_utilization Issue Slot Utilization 27.23% 29.02% 28.09%
55 sm_efficiency Multiprocessor Activity 95.34% 97.74% 96.50%
55 achieved_occupancy Achieved Occupancy 0.184673 0.185497 0.185105
55 eligible_warps_per_cycle Eligible Warps Per Active Cycle 1.419163 1.526940 1.470013
55 shared_utilization Shared Memory Utilization Low (1) Low (1) Low (1)
55 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
55 tex_utilization Unified Cache Utilization Low (2) Low (2) Low (2)
55 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (2) Low (1)
55 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
55 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
55 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (2) Low (2) Low (2)
55 double_precision_fu_utilization Double-Precision Function Unit Utilization Mid (5) Mid (5) Mid (5)
55 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
55 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.06% 0.36% 0.33%
55 flop_dp_efficiency FLOP Efficiency(Peak Double) 5.17% 31.14% 27.95%
55 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
55 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
55 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
55 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
55 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_facerhs__9
55 inst_per_warp Instructions per warp 1.2956e+05 1.3581e+05 1.3062e+05
55 branch_efficiency Branch Efficiency 99.34% 99.44% 99.43%
55 warp_execution_efficiency Warp Execution Efficiency 64.56% 67.19% 66.81%
55 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 62.16% 64.69% 64.33%
55 inst_replay_overhead Instruction Replay Overhead 0.000243 0.000276 0.000259
55 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
55 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
55 local_load_transactions_per_request Local Memory Load Transactions Per Request 3.450529 3.476831 3.467822
55 local_store_transactions_per_request Local Memory Store Transactions Per Request 3.372245 3.398068 3.387072
55 gld_transactions_per_request Global Load Transactions Per Request 14.303273 14.304891 14.303965
55 gst_transactions_per_request Global Store Transactions Per Request 14.000000 14.000000 14.000000
55 shared_store_transactions Shared Store Transactions 0 0 0
55 shared_load_transactions Shared Load Transactions 0 0 0
55 local_load_transactions Local Load Transactions 4975064 5413956 5062461
55 local_store_transactions Local Store Transactions 3814784 4205736 3892565
55 gld_transactions Global Load Transactions 18270028 18272095 18270912
55 gst_transactions Global Store Transactions 1852200 1852200 1852200
55 sysmem_read_transactions System Memory Read Transactions 0 0 0
55 sysmem_write_transactions System Memory Write Transactions 5 5 5
55 l2_read_transactions L2 Read Transactions 17610003 17713325 17655886
55 l2_write_transactions L2 Write Transactions 7018930 7678734 7151163
55 dram_read_transactions Device Memory Read Transactions 21222067 21722794 21379870
55 dram_write_transactions Device Memory Write Transactions 4348250 4539694 4390603
55 global_hit_rate Global Hit Rate in unified l1/tex 23.67% 23.95% 23.84%
55 local_hit_rate Local Hit Rate 55.97% 58.04% 56.47%
55 gld_requested_throughput Requested Global Load Throughput 93.915GB/s 97.237GB/s 96.371GB/s
55 gst_requested_throughput Requested Global Store Throughput 9.8912GB/s 10.241GB/s 10.150GB/s
55 gld_throughput Global Load Throughput 218.56GB/s 226.28GB/s 224.27GB/s
55 gst_throughput Global Store Throughput 22.156GB/s 22.940GB/s 22.736GB/s
55 local_memory_overhead Local Memory Overhead 33.57% 34.92% 33.97%
55 tex_cache_hit_rate Unified Cache Hit Rate 20.62% 21.37% 20.81%
55 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 10.10% 11.08% 10.66%
55 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 56.85% 57.42% 57.07%
55 dram_read_throughput Device Memory Read Throughput 258.29GB/s 264.81GB/s 262.44GB/s
55 dram_write_throughput Device Memory Write Throughput 53.539GB/s 54.305GB/s 53.894GB/s
55 tex_cache_throughput Unified cache to SM throughput 292.40GB/s 296.65GB/s 295.00GB/s
55 l2_tex_read_throughput L2 Throughput (Texture Reads) 209.56GB/s 216.21GB/s 214.45GB/s
55 l2_tex_write_throughput L2 Throughput (Texture Writes) 69.683GB/s 72.466GB/s 70.517GB/s
55 l2_read_throughput L2 Throughput (Reads) 211.77GB/s 218.46GB/s 216.72GB/s
55 l2_write_throughput L2 Throughput (Writes) 86.426GB/s 91.854GB/s 87.780GB/s
55 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 sysmem_write_throughput System Memory Write Throughput 62.716KB/s 64.935KB/s 64.354KB/s
55 local_load_throughput Local Memory Load Throughput 61.219GB/s 64.763GB/s 62.141GB/s
55 local_store_throughput Local Memory Store Throughput 46.970GB/s 50.310GB/s 47.781GB/s
55 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
55 gld_efficiency Global Memory Load Efficiency 42.97% 42.97% 42.97%
55 gst_efficiency Global Memory Store Efficiency 44.64% 44.64% 44.64%
55 tex_cache_transactions Unified cache to SM transactions 5982302 6116460 6008213
55 flop_count_dp Floating Point Operations(Double Precision) 2537817584 2554843700 2543604914
55 flop_count_dp_add Floating Point Operations(Double Precision Add) 427569432 430465044 428553500
55 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 942979098 949353438 945146011
55 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 224289956 225671780 224759391
55 flop_count_sp Floating Point Operations(Single Precision) 78639716 79313948 78868857
55 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
55 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 38356910 38694026 38471480
55 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 1925896 1925896 1925896
55 flop_count_sp_special Floating Point Operations(Single Precision Special) 62667964 63143512 62829600
55 inst_executed Instructions Executed 164688747 492168684 331340037
55 inst_issued Instructions Issued 164730224 172248391 166003730
55 dram_utilization Device Memory Utilization Mid (4) Mid (4) Mid (4)
55 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
55 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 43.15% 44.59% 43.79%
55 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 35.00% 36.02% 35.49%
55 stall_memory_dependency Issue Stall Reasons (Data Request) 15.95% 17.08% 16.51%
55 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
55 stall_sync Issue Stall Reasons (Synchronization) 0.01% 0.01% 0.01%
55 stall_other Issue Stall Reasons (Other) 0.26% 0.27% 0.27%
55 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.17% 0.25% 0.20%
55 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 1.39% 1.48% 1.42%
55 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
55 inst_fp_32 FP Instructions(Single) 412256022 415420326 413332157
55 inst_fp_64 FP Instructions(Double) 1653206450 1664358170 1656996562
55 inst_integer Integer Instructions 901637641 907132852 903506492
55 inst_bit_convert Bit-Convert Instructions 31003638 31193358 31068047
55 inst_control Control-Flow Instructions 306895511 309161711 307665913
55 inst_compute_ld_st Load/Store Instructions 72762311 73316191 72955004
55 inst_misc Misc Instructions 73672652 74025455 73792638
55 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
55 issue_slots Issue Slots 164730224 172248391 166003730
55 cf_issued Issued Control-Flow Instructions 14591309 15322497 14714889
55 cf_executed Executed Control-Flow Instructions 14591309 15322497 14714889
55 ldst_issued Issued Load/Store Instructions 4293820 4551500 4337921
55 ldst_executed Executed Load/Store Instructions 4293820 4551500 4337921
55 atomic_transactions Atomic Transactions 0 0 0
55 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
55 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
55 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
55 l2_tex_read_transactions L2 Transactions (Texture Reads) 17428160 17529445 17470960
55 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 1.33% 1.81% 1.58%
55 stall_not_selected Issue Stall Reasons (Not Selected) 0.71% 0.75% 0.72%
55 l2_tex_write_transactions L2 Transactions (Texture Writes) 5666984 6057936 5744765
55 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
55 nvlink_total_data_received NVLink Total Data Received 864 864 864
55 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
55 nvlink_user_data_received NVLink User Data Received 0 0 0
55 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
55 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
55 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
55 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
55 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
55 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
55 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
55 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
55 nvlink_transmit_throughput NVLink Transmit Throughput 451.56KB/s 467.53KB/s 463.36KB/s
55 nvlink_receive_throughput NVLink Receive Throughput 338.67KB/s 350.65KB/s 347.52KB/s
55 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
55 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
55 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
55 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
55 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
55 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
55 inst_fp_16 HP Instructions(Half) 0 0 0
55 ipc Executed IPC 0.487200 0.609204 0.579057
55 issued_ipc Issued IPC 0.593597 0.612294 0.603865
55 issue_slot_utilization Issue Slot Utilization 14.84% 15.31% 15.10%
55 sm_efficiency Multiprocessor Activity 89.27% 93.73% 91.77%
55 achieved_occupancy Achieved Occupancy 0.105739 0.109860 0.107469
55 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.641693 0.661028 0.649953
55 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
55 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
55 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
55 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
55 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
55 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
55 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
55 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
55 double_precision_fu_utilization Double-Precision Function Unit Utilization Low (3) Low (3) Low (3)
55 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
55 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.06% 0.21% 0.19%
55 flop_dp_efficiency FLOP Efficiency(Peak Double) 4.04% 13.67% 12.50%
55 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
55 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
55 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
55 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
55 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_knl_reverse_indefinite_stack_integral__4
56 inst_per_warp Instructions per warp 2.3300e+03 2.3300e+03 2.3300e+03
56 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
56 warp_execution_efficiency Warp Execution Efficiency 78.12% 78.12% 78.12%
56 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 77.18% 77.18% 77.18%
56 inst_replay_overhead Instruction Replay Overhead 0.006509 0.006509 0.006509
56 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
56 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
56 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
56 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
56 gld_transactions_per_request Global Load Transactions Per Request 6.588689 6.588689 6.588689
56 gst_transactions_per_request Global Store Transactions Per Request 7.000000 7.000000 7.000000
56 shared_store_transactions Shared Store Transactions 0 0 0
56 shared_load_transactions Shared Load Transactions 0 0 0
56 local_load_transactions Local Load Transactions 0 0 0
56 local_store_transactions Local Store Transactions 0 0 0
56 gld_transactions Global Load Transactions 121390 121390 121390
56 gst_transactions Global Store Transactions 128625 128625 128625
56 sysmem_read_transactions System Memory Read Transactions 0 0 0
56 sysmem_write_transactions System Memory Write Transactions 5 5 5
56 l2_read_transactions L2 Read Transactions 117961 119173 118541
56 l2_write_transactions L2 Write Transactions 133622 157226 144033
56 dram_read_transactions Device Memory Read Transactions 120938 121237 121073
56 dram_write_transactions Device Memory Write Transactions 135485 155260 144981
56 global_hit_rate Global Hit Rate in unified l1/tex 17.67% 17.68% 17.68%
56 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
56 gld_requested_throughput Requested Global Load Throughput 24.658GB/s 25.478GB/s 25.340GB/s
56 gst_requested_throughput Requested Global Store Throughput 24.592GB/s 25.410GB/s 25.273GB/s
56 gld_throughput Global Load Throughput 25.994GB/s 26.858GB/s 26.713GB/s
56 gst_throughput Global Store Throughput 27.543GB/s 28.459GB/s 28.305GB/s
56 local_memory_overhead Local Memory Overhead 16.50% 16.50% 16.50%
56 tex_cache_hit_rate Unified Cache Hit Rate 4.31% 4.31% 4.31%
56 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 6.24% 6.27% 6.25%
56 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 25.72% 25.72% 25.72%
56 dram_read_throughput Device Memory Read Throughput 25.901GB/s 26.760GB/s 26.643GB/s
56 dram_write_throughput Device Memory Write Throughput 29.012GB/s 34.175GB/s 31.905GB/s
56 tex_cache_throughput Unified cache to SM throughput 36.896GB/s 38.123GB/s 37.917GB/s
56 l2_tex_read_throughput L2 Throughput (Texture Reads) 25.239GB/s 26.079GB/s 25.938GB/s
56 l2_tex_write_throughput L2 Throughput (Texture Writes) 27.543GB/s 28.459GB/s 28.305GB/s
56 l2_read_throughput L2 Throughput (Reads) 25.517GB/s 26.365GB/s 26.086GB/s
56 l2_write_throughput L2 Throughput (Writes) 28.620GB/s 34.758GB/s 31.696GB/s
56 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
56 sysmem_write_throughput System Memory Write Throughput 1.0964MB/s 1.1328MB/s 1.1267MB/s
56 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
56 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
56 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
56 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
56 gld_efficiency Global Memory Load Efficiency 94.86% 94.86% 94.86%
56 gst_efficiency Global Memory Store Efficiency 89.29% 89.29% 89.29%
56 tex_cache_transactions Unified cache to SM transactions 43074 43078 43075
56 flop_count_dp Floating Point Operations(Double Precision) 459375 459375 459375
56 flop_count_dp_add Floating Point Operations(Double Precision Add) 459375 459375 459375
56 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
56 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
56 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
56 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
56 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
56 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
56 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
56 inst_executed Instructions Executed 82810 114170 99050
56 inst_issued Instructions Issued 83349 83367 83349
56 dram_utilization Device Memory Utilization Low (1) Low (1) Low (1)
56 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
56 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.04% 0.34% 0.16%
56 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 2.81% 2.93% 2.84%
56 stall_memory_dependency Issue Stall Reasons (Data Request) 96.21% 96.74% 96.53%
56 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
56 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
56 stall_other Issue Stall Reasons (Other) 0.01% 0.01% 0.01%
56 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.35% 0.67% 0.46%
56 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.00% 0.00% 0.00%
56 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
56 inst_fp_32 FP Instructions(Single) 0 0 0
56 inst_fp_64 FP Instructions(Double) 459375 459375 459375
56 inst_integer Integer Instructions 665175 665175 665175
56 inst_bit_convert Bit-Convert Instructions 0 0 0
56 inst_control Control-Flow Instructions 9800 9800 9800
56 inst_compute_ld_st Load/Store Instructions 919975 919975 919975
56 inst_misc Misc Instructions 9800 9800 9800
56 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
56 issue_slots Issue Slots 83349 83367 83349
56 cf_issued Issued Control-Flow Instructions 637 637 637
56 cf_executed Executed Control-Flow Instructions 637 637 637
56 ldst_issued Issued Load/Store Instructions 36946 36946 36946
56 ldst_executed Executed Load/Store Instructions 36946 36946 36946
56 atomic_transactions Atomic Transactions 0 0 0
56 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
56 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
56 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
56 l2_tex_read_transactions L2 Transactions (Texture Reads) 117865 117869 117865
56 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.01% 0.01% 0.01%
56 stall_not_selected Issue Stall Reasons (Not Selected) 0.00% 0.00% 0.00%
56 l2_tex_write_transactions L2 Transactions (Texture Writes) 128625 128625 128625
56 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
56 nvlink_total_data_received NVLink Total Data Received 864 864 864
56 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
56 nvlink_user_data_received NVLink User Data Received 0 0 0
56 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
56 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
56 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
56 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
56 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
56 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
56 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
56 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
56 nvlink_transmit_throughput NVLink Transmit Throughput 7.8939MB/s 8.1564MB/s 8.1123MB/s
56 nvlink_receive_throughput NVLink Receive Throughput 5.9204MB/s 6.1173MB/s 6.0842MB/s
56 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
56 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
56 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
56 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
56 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
56 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
56 inst_fp_16 HP Instructions(Half) 0 0 0
56 ipc Executed IPC 0.008552 0.011722 0.010010
56 issued_ipc Issued IPC 0.008603 0.008870 0.008706
56 issue_slot_utilization Issue Slot Utilization 0.22% 0.22% 0.22%
56 sm_efficiency Multiprocessor Activity 55.77% 59.09% 57.42%
56 achieved_occupancy Achieved Occupancy 0.015625 0.015625 0.015625
56 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.008603 0.008870 0.008703
56 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
56 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
56 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
56 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
56 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
56 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
56 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
56 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
56 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
56 double_precision_fu_utilization Double-Precision Function Unit Utilization Low (1) Low (1) Low (1)
56 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
56 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
56 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.04% 0.04% 0.04%
56 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
56 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
56 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
56 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
56 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_initauxstate__1
1 inst_per_warp Instructions per warp 339.790000 339.790000 339.790000
1 branch_efficiency Branch Efficiency 99.92% 99.92% 99.92%
1 warp_execution_efficiency Warp Execution Efficiency 97.61% 97.61% 97.61%
1 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 94.35% 94.35% 94.35%
1 inst_replay_overhead Instruction Replay Overhead 0.008911 0.008911 0.008911
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 gld_transactions_per_request Global Load Transactions Per Request 8.149093 8.149093 8.149093
1 gst_transactions_per_request Global Store Transactions Per Request 8.562478 8.562478 8.562478
1 shared_store_transactions Shared Store Transactions 0 0 0
1 shared_load_transactions Shared Load Transactions 0 0 0
1 local_load_transactions Local Load Transactions 0 0 0
1 local_store_transactions Local Store Transactions 0 0 0
1 gld_transactions Global Load Transactions 718750 718750 718750
1 gst_transactions Global Store Transactions 881079 881079 881079
1 sysmem_read_transactions System Memory Read Transactions 0 0 0
1 sysmem_write_transactions System Memory Write Transactions 5 5 5
1 l2_read_transactions L2 Read Transactions 696127 696127 696127
1 l2_write_transactions L2 Write Transactions 946471 946471 946471
1 dram_read_transactions Device Memory Read Transactions 706128 706128 706128
1 dram_write_transactions Device Memory Write Transactions 845185 845185 845185
1 global_hit_rate Global Hit Rate in unified l1/tex 30.88% 30.88% 30.88%
1 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
1 gld_requested_throughput Requested Global Load Throughput 293.46GB/s 293.46GB/s 293.46GB/s
1 gst_requested_throughput Requested Global Store Throughput 342.37GB/s 342.37GB/s 342.37GB/s
1 gld_throughput Global Load Throughput 306.10GB/s 306.10GB/s 306.10GB/s
1 gst_throughput Global Store Throughput 375.24GB/s 375.24GB/s 375.24GB/s
1 local_memory_overhead Local Memory Overhead 29.80% 29.80% 29.80%
1 tex_cache_hit_rate Unified Cache Hit Rate 3.66% 3.66% 3.66%
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 5.86% 5.86% 5.86%
1 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 42.20% 42.20% 42.20%
1 dram_read_throughput Device Memory Read Throughput 300.73GB/s 300.73GB/s 300.73GB/s
1 dram_write_throughput Device Memory Write Throughput 359.95GB/s 359.95GB/s 359.95GB/s
1 tex_cache_throughput Unified cache to SM throughput 362.92GB/s 362.92GB/s 362.92GB/s
1 l2_tex_read_throughput L2 Throughput (Texture Reads) 296.14GB/s 296.14GB/s 296.14GB/s
1 l2_tex_write_throughput L2 Throughput (Texture Writes) 375.24GB/s 375.24GB/s 375.24GB/s
1 l2_read_throughput L2 Throughput (Reads) 296.47GB/s 296.47GB/s 296.47GB/s
1 l2_write_throughput L2 Throughput (Writes) 403.08GB/s 403.08GB/s 403.08GB/s
1 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 sysmem_write_throughput System Memory Write Throughput 2.1805MB/s 2.1805MB/s 2.1805MB/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gld_efficiency Global Memory Load Efficiency 95.87% 95.87% 95.87%
1 gst_efficiency Global Memory Store Efficiency 91.24% 91.24% 91.24%
1 tex_cache_transactions Unified cache to SM transactions 213041 213041 213041
1 flop_count_dp Floating Point Operations(Double Precision) 3641925 3641925 3641925
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 1058400 1058400 1058400
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 1047375 1047375 1047375
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 488775 488775 488775
1 flop_count_sp Floating Point Operations(Single Precision) 139650 139650 139650
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 69825 69825 69825
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 69825 69825 69825
1 inst_executed Instructions Executed 4994913 4994913 4994913
1 inst_issued Instructions Issued 1754625 1754625 1754625
1 dram_utilization Device Memory Utilization High (8) High (8) High (8)
1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
1 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 2.54% 2.54% 2.54%
1 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 2.57% 2.57% 2.57%
1 stall_memory_dependency Issue Stall Reasons (Data Request) 35.19% 35.19% 35.19%
1 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
1 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
1 stall_other Issue Stall Reasons (Other) 0.12% 0.12% 0.12%
1 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.90% 0.90% 0.90%
1 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.36% 0.36% 0.36%
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 inst_fp_32 FP Instructions(Single) 2995125 2995125 2995125
1 inst_fp_64 FP Instructions(Double) 3583125 3583125 3583125
1 inst_integer Integer Instructions 31755675 31755675 31755675
1 inst_bit_convert Bit-Convert Instructions 668850 668850 668850
1 inst_control Control-Flow Instructions 1977150 1977150 1977150
1 inst_compute_ld_st Load/Store Instructions 6390825 6390825 6390825
1 inst_misc Misc Instructions 5053125 5053125 5053125
1 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
1 issue_slots Issue Slots 1754625 1754625 1754625
1 cf_issued Issued Control-Flow Instructions 123235 123235 123235
1 cf_executed Executed Control-Flow Instructions 123235 123235 123235
1 ldst_issued Issued Load/Store Instructions 220500 220500 220500
1 ldst_executed Executed Load/Store Instructions 220500 220500 220500
1 atomic_transactions Atomic Transactions 0 0 0
1 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 695353 695353 695353
1 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 57.61% 57.61% 57.61%
1 stall_not_selected Issue Stall Reasons (Not Selected) 0.72% 0.72% 0.72%
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 881079 881079 881079
1 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
1 nvlink_total_data_received NVLink Total Data Received 864 864 864
1 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
1 nvlink_user_data_received NVLink User Data Received 0 0 0
1 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
1 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
1 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
1 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
1 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
1 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
1 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
1 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
1 nvlink_transmit_throughput NVLink Transmit Throughput 15.700MB/s 15.700MB/s 15.700MB/s
1 nvlink_receive_throughput NVLink Receive Throughput 11.775MB/s 11.775MB/s 11.775MB/s
1 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
1 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
1 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
1 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
1 inst_fp_16 HP Instructions(Half) 0 0 0
1 ipc Executed IPC 0.257710 0.257710 0.257710
1 issued_ipc Issued IPC 0.260006 0.260006 0.260006
1 issue_slot_utilization Issue Slot Utilization 6.50% 6.50% 6.50%
1 sm_efficiency Multiprocessor Activity 90.68% 90.68% 90.68%
1 achieved_occupancy Achieved Occupancy 0.560784 0.560784 0.560784
1 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.516023 0.516023 0.516023
1 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
1 l2_utilization L2 Cache Utilization Low (2) Low (2) Low (2)
1 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
1 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
1 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
1 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
1 double_precision_fu_utilization Double-Precision Function Unit Utilization Low (1) Low (1) Low (1)
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.01% 0.01% 0.01%
1 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.74% 0.74% 0.74%
1 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
1 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
1 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
1 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_knl_indefinite_stack_integral__3
56 inst_per_warp Instructions per warp 1.5749e+06 1.6330e+06 1.5830e+06
56 branch_efficiency Branch Efficiency 99.21% 99.28% 99.26%
56 warp_execution_efficiency Warp Execution Efficiency 61.16% 62.95% 62.68%
56 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 58.95% 60.67% 60.42%
56 inst_replay_overhead Instruction Replay Overhead 0.000129 0.000145 0.000138
56 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 1.923077 1.923077 1.923077
56 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 2.000000 2.000000 2.000000
56 local_load_transactions_per_request Local Memory Load Transactions Per Request 3.304234 3.352957 3.323427
56 local_store_transactions_per_request Local Memory Store Transactions Per Request 3.304854 3.353892 3.324121
56 gld_transactions_per_request Global Load Transactions Per Request 6.511216 6.511515 6.511352
56 gst_transactions_per_request Global Store Transactions Per Request 7.000000 7.000000 7.000000
56 shared_store_transactions Shared Store Transactions 490 490 490
56 shared_load_transactions Shared Load Transactions 91875 91875 91875
56 local_load_transactions Local Load Transactions 686842 738459 697134
56 local_store_transactions Local Store Transactions 701583 754534 712147
56 gld_transactions Global Load Transactions 958744 958788 958764
56 gst_transactions Global Store Transactions 128625 128625 128625
56 sysmem_read_transactions System Memory Read Transactions 0 0 0
56 sysmem_write_transactions System Memory Write Transactions 5 5 5
56 l2_read_transactions L2 Read Transactions 926177 933135 929375
56 l2_write_transactions L2 Write Transactions 839151 908804 861088
56 dram_read_transactions Device Memory Read Transactions 938353 939794 939016
56 dram_write_transactions Device Memory Write Transactions 197412 214647 206049
56 global_hit_rate Global Hit Rate in unified l1/tex 12.67% 12.68% 12.68%
56 local_hit_rate Local Hit Rate 99.70% 99.72% 99.70%
56 gld_requested_throughput Requested Global Load Throughput 11.912GB/s 14.151GB/s 13.954GB/s
56 gst_requested_throughput Requested Global Store Throughput 1.4885GB/s 1.7683GB/s 1.7436GB/s
56 gld_throughput Global Load Throughput 12.426GB/s 14.762GB/s 14.556GB/s
56 gst_throughput Global Store Throughput 1.6671GB/s 1.9805GB/s 1.9528GB/s
56 local_memory_overhead Local Memory Overhead 45.93% 47.51% 46.25%
56 tex_cache_hit_rate Unified Cache Hit Rate 31.04% 31.76% 31.19%
56 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 7.64% 7.65% 7.64%
56 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 88.12% 88.85% 88.27%
56 dram_read_throughput Device Memory Read Throughput 12.176GB/s 14.449GB/s 14.257GB/s
56 dram_write_throughput Device Memory Write Throughput 2.7641GB/s 3.2842GB/s 3.1283GB/s
56 tex_cache_throughput Unified cache to SM throughput 34.260GB/s 39.986GB/s 39.406GB/s
56 l2_tex_read_throughput L2 Throughput (Texture Reads) 12.000GB/s 14.256GB/s 14.057GB/s
56 l2_tex_write_throughput L2 Throughput (Texture Writes) 11.447GB/s 13.181GB/s 12.765GB/s
56 l2_read_throughput L2 Throughput (Reads) 12.088GB/s 14.312GB/s 14.110GB/s
56 l2_write_throughput L2 Throughput (Writes) 11.562GB/s 13.621GB/s 13.073GB/s
56 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
56 sysmem_write_throughput System Memory Write Throughput 67.952KB/s 80.726KB/s 79.599KB/s
56 local_load_throughput Local Memory Load Throughput 9.5711GB/s 10.980GB/s 10.584GB/s
56 local_store_throughput Local Memory Store Throughput 9.7795GB/s 11.218GB/s 10.812GB/s
56 shared_load_throughput Shared Memory Load Throughput 4.7631GB/s 5.6585GB/s 5.5796GB/s
56 shared_store_throughput Shared Memory Store Throughput 26.013MB/s 30.903MB/s 30.472MB/s
56 gld_efficiency Global Memory Load Efficiency 95.86% 95.86% 95.86%
56 gst_efficiency Global Memory Store Efficiency 89.29% 89.29% 89.29%
56 tex_cache_transactions Unified cache to SM transactions 646878 660836 648875
56 flop_count_dp Floating Point Operations(Double Precision) 355176834 359244599 356870035
56 flop_count_dp_add Floating Point Operations(Double Precision Add) 56174791 56866432 56462626
56 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 134551838 136074949 135185903
56 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 29898367 30228269 30035601
56 flop_count_sp Floating Point Operations(Single Precision) 15057128 15218178 15124152
56 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
56 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 7528564 7609089 7562076
56 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
56 flop_count_sp_special Floating Point Operations(Single Precision Special) 10382724 10496333 10430011
56 inst_executed Instructions Executed 25413742 80016720 49727360
56 inst_issued Instructions Issued 25415873 26347375 25544928
56 dram_utilization Device Memory Utilization Low (1) Low (1) Low (1)
56 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
56 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 16.69% 17.24% 16.99%
56 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 72.03% 73.12% 72.30%
56 stall_memory_dependency Issue Stall Reasons (Data Request) 9.59% 10.37% 10.29%
56 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
56 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
56 stall_other Issue Stall Reasons (Other) 0.31% 0.32% 0.31%
56 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.09% 0.15% 0.11%
56 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.00% 0.00% 0.00%
56 shared_efficiency Shared Memory Efficiency 6.30% 6.30% 6.30%
56 inst_fp_32 FP Instructions(Single) 69145542 69900228 69459486
56 inst_fp_64 FP Instructions(Double) 231184995 233848875 232293678
56 inst_integer Integer Instructions 125131224 126485483 125695079
56 inst_bit_convert Bit-Convert Instructions 4148818 4194074 4167629
56 inst_control Control-Flow Instructions 49444293 49985809 49669731
56 inst_compute_ld_st Load/Store Instructions 9651394 9770356 9701371
56 inst_misc Misc Instructions 5740170 5806338 5767720
56 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
56 issue_slots Issue Slots 25415873 26347375 25544928
56 cf_issued Issued Control-Flow Instructions 2488960 2579342 2501409
56 cf_executed Executed Control-Flow Instructions 2488960 2579342 2501409
56 ldst_issued Issued Load/Store Instructions 655810 684587 659879
56 ldst_executed Executed Load/Store Instructions 655810 684587 659879
56 atomic_transactions Atomic Transactions 0 0 0
56 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
56 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
56 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
56 l2_tex_read_transactions L2 Transactions (Texture Reads) 925881 925923 925900
56 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.00% 0.00% 0.00%
56 stall_not_selected Issue Stall Reasons (Not Selected) 0.00% 0.00% 0.00%
56 l2_tex_write_transactions L2 Transactions (Texture Writes) 830208 883159 840772
56 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
56 nvlink_total_data_received NVLink Total Data Received 864 864 864
56 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
56 nvlink_user_data_received NVLink User Data Received 0 0 0
56 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
56 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
56 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
56 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
56 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
56 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
56 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
56 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
56 nvlink_transmit_throughput NVLink Transmit Throughput 489.26KB/s 581.22KB/s 573.12KB/s
56 nvlink_receive_throughput NVLink Receive Throughput 366.94KB/s 435.92KB/s 429.84KB/s
56 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
56 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
56 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
56 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
56 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
56 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
56 inst_fp_16 HP Instructions(Half) 0 0 0
56 ipc Executed IPC 0.120392 0.181921 0.148417
56 issued_ipc Issued IPC 0.180164 0.181948 0.180699
56 issue_slot_utilization Issue Slot Utilization 4.50% 4.55% 4.52%
56 sm_efficiency Multiprocessor Activity 58.09% 59.88% 59.04%
56 achieved_occupancy Achieved Occupancy 0.015625 0.015625 0.015625
56 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.180199 0.181998 0.180733
56 shared_utilization Shared Memory Utilization Low (1) Low (1) Low (1)
56 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
56 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
56 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
56 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
56 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
56 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
56 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
56 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
56 double_precision_fu_utilization Double-Precision Function Unit Utilization Low (1) Low (1) Low (1)
56 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
56 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.01% 0.05% 0.05%
56 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.51% 2.36% 2.16%
56 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
56 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
56 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
56 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
56 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
Kernel: ptxcall_knl_dof_iteration__5
1 inst_per_warp Instructions per warp 7.5371e+03 7.5371e+03 7.5371e+03
1 branch_efficiency Branch Efficiency 99.51% 99.51% 99.51%
1 warp_execution_efficiency Warp Execution Efficiency 84.90% 84.90% 84.90%
1 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 81.58% 81.58% 81.58%
1 inst_replay_overhead Instruction Replay Overhead 0.000784 0.000784 0.000784
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 local_load_transactions_per_request Local Memory Load Transactions Per Request 3.851323 3.851323 3.851323
1 local_store_transactions_per_request Local Memory Store Transactions Per Request 3.910409 3.910409 3.910409
1 gld_transactions_per_request Global Load Transactions Per Request 8.133350 8.133350 8.133350
1 gst_transactions_per_request Global Store Transactions Per Request 8.562486 8.562486 8.562486
1 shared_store_transactions Shared Store Transactions 0 0 0
1 shared_load_transactions Shared Load Transactions 0 0 0
1 local_load_transactions Local Load Transactions 1825535 1825535 1825535
1 local_store_transactions Local Store Transactions 3133325 3133325 3133325
1 gld_transactions Global Load Transactions 2391205 2391205 2391205
1 gst_transactions Global Store Transactions 1384554 1384554 1384554
1 sysmem_read_transactions System Memory Read Transactions 0 0 0
1 sysmem_write_transactions System Memory Write Transactions 5 5 5
1 l2_read_transactions L2 Read Transactions 2403196 2403196 2403196
1 l2_write_transactions L2 Write Transactions 4851436 4851436 4851436
1 dram_read_transactions Device Memory Read Transactions 2605046 2605046 2605046
1 dram_write_transactions Device Memory Write Transactions 3154602 3154602 3154602
1 global_hit_rate Global Hit Rate in unified l1/tex 24.68% 24.68% 24.68%
1 local_hit_rate Local Hit Rate 78.01% 78.01% 78.01%
1 gld_requested_throughput Requested Global Load Throughput 217.26GB/s 217.26GB/s 217.26GB/s
1 gst_requested_throughput Requested Global Store Throughput 119.49GB/s 119.49GB/s 119.49GB/s
1 gld_throughput Global Load Throughput 226.19GB/s 226.19GB/s 226.19GB/s
1 gst_throughput Global Store Throughput 130.97GB/s 130.97GB/s 130.97GB/s
1 local_memory_overhead Local Memory Overhead 58.87% 58.87% 58.87%
1 tex_cache_hit_rate Unified Cache Hit Rate 21.89% 21.89% 21.89%
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 7.30% 7.30% 7.30%
1 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 43.32% 43.32% 43.32%
1 dram_read_throughput Device Memory Read Throughput 246.41GB/s 246.41GB/s 246.41GB/s
1 dram_write_throughput Device Memory Write Throughput 298.40GB/s 298.40GB/s 298.40GB/s
1 tex_cache_throughput Unified cache to SM throughput 434.24GB/s 434.24GB/s 434.24GB/s
1 l2_tex_read_throughput L2 Throughput (Texture Reads) 227.30GB/s 227.30GB/s 227.30GB/s
1 l2_tex_write_throughput L2 Throughput (Texture Writes) 427.35GB/s 427.35GB/s 427.35GB/s
1 l2_read_throughput L2 Throughput (Reads) 227.32GB/s 227.32GB/s 227.32GB/s
1 l2_write_throughput L2 Throughput (Writes) 458.90GB/s 458.90GB/s 458.90GB/s
1 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 sysmem_write_throughput System Memory Write Throughput 495.93KB/s 495.93KB/s 495.92KB/s
1 local_load_throughput Local Memory Load Throughput 172.68GB/s 172.68GB/s 172.68GB/s
1 local_store_throughput Local Memory Store Throughput 296.38GB/s 296.38GB/s 296.38GB/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gld_efficiency Global Memory Load Efficiency 96.06% 96.06% 96.06%
1 gst_efficiency Global Memory Store Efficiency 91.24% 91.24% 91.24%
1 tex_cache_transactions Unified cache to SM transactions 1147674 1147674 1147674
1 flop_count_dp Floating Point Operations(Double Precision) 671800066 671800066 671800066
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 118938848 118938848 118938848
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 248740453 248740453 248740453
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 55380312 55380312 55380312
1 flop_count_sp Floating Point Operations(Single Precision) 22454596 22454596 22454596
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 10824714 10824714 10824714
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 805168 805168 805168
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 17216585 17216585 17216585
1 inst_executed Instructions Executed 110795900 110795900 110795900
1 inst_issued Instructions Issued 37483408 37483408 37483408
1 dram_utilization Device Memory Utilization High (7) High (7) High (7)
1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
1 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 31.89% 31.89% 31.89%
1 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 27.52% 27.52% 27.52%
1 stall_memory_dependency Issue Stall Reasons (Data Request) 19.22% 19.22% 19.22%
1 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
1 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
1 stall_other Issue Stall Reasons (Other) 0.70% 0.70% 0.70%
1 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.38% 0.38% 0.38%
1 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 8.42% 8.42% 8.42%
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 inst_fp_32 FP Instructions(Single) 120606898 120606898 120606898
1 inst_fp_64 FP Instructions(Double) 439914014 439914014 439914014
1 inst_integer Integer Instructions 273934814 273934814 273934814
1 inst_bit_convert Bit-Convert Instructions 10194859 10194859 10194859
1 inst_control Control-Flow Instructions 90023317 90023317 90023317
1 inst_compute_ld_st Load/Store Instructions 28499634 28499634 28499634
1 inst_misc Misc Instructions 19555738 19555738 19555738
1 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
1 issue_slots Issue Slots 37483408 37483408 37483408
1 cf_issued Issued Control-Flow Instructions 3413189 3413189 3413189
1 cf_executed Executed Control-Flow Instructions 3413189 3413189 3413189
1 ldst_issued Issued Load/Store Instructions 1222796 1222796 1222796
1 ldst_executed Executed Load/Store Instructions 1222796 1222796 1222796
1 atomic_transactions Atomic Transactions 0 0 0
1 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 2402980 2402980 2402980
1 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 7.92% 7.92% 7.92%
1 stall_not_selected Issue Stall Reasons (Not Selected) 3.97% 3.97% 3.97%
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 4517879 4517879 4517879
1 nvlink_total_data_transmitted NVLink Total Data Transmitted 1152 1152 1152
1 nvlink_total_data_received NVLink Total Data Received 864 864 864
1 nvlink_user_data_transmitted NVLink User Data Transmitted 0 0 0
1 nvlink_user_data_received NVLink User Data Received 0 0 0
1 nvlink_overhead_data_transmitted NVLink Overhead Data Transmitted 1.00% 1.00% 1.00%
1 nvlink_overhead_data_received NVLink Overhead Data Received 1.00% 1.00% 1.00%
1 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
1 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
1 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
1 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
1 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 0 0 0
1 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 0 0 0
1 nvlink_transmit_throughput NVLink Transmit Throughput 3.4870MB/s 3.4870MB/s 3.4870MB/s
1 nvlink_receive_throughput NVLink Receive Throughput 2.6152MB/s 2.6152MB/s 2.6152MB/s
1 nvlink_total_response_data_received NVLink Total Response Data Received 288 288 288
1 nvlink_user_response_data_received NVLink User Response Data Received 0 0 0
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
1 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
1 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
1 inst_fp_16 HP Instructions(Half) 0 0 0
1 ipc Executed IPC 1.170406 1.170406 1.170406
1 issued_ipc Issued IPC 1.171323 1.171323 1.171323
1 issue_slot_utilization Issue Slot Utilization 29.28% 29.28% 29.28%
1 sm_efficiency Multiprocessor Activity 94.39% 94.39% 94.39%
1 achieved_occupancy Achieved Occupancy 0.238542 0.238542 0.238542
1 eligible_warps_per_cycle Eligible Warps Per Active Cycle 1.777020 1.777020 1.777020
1 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
1 l2_utilization L2 Cache Utilization Low (2) Low (2) Low (2)
1 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
1 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
1 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 special_fu_utilization Special Function Unit Utilization Low (1) Low (1) Low (1)
1 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (2) Low (2) Low (2)
1 double_precision_fu_utilization Double-Precision Function Unit Utilization Mid (6) Mid (6) Mid (6)
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.55% 0.55% 0.55%
1 flop_dp_efficiency FLOP Efficiency(Peak Double) 32.63% 32.63% 32.63%
1 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
1 nvlink_data_transmission_efficiency NVLink Data Transmission Efficiency 0.00% 0.00% 0.00%
1 nvlink_data_receive_efficiency NVLink Data Receive Efficiency 0.00% 0.00% 0.00%
1 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
~/research/code/CLIMA lcw/dycoms3dperformance* lucas@ip-172-22-40-133 2m 44s
❯ nvprof --print-gpu-trace julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
==42400== NVPROF is profiling process 42400, command: julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
[ Info: ----------------------------------------------------
[ Info: ______ _ _____ __ ________
[ Info: | ____| | |_ _| ... | __ |
[ Info: | | | | | | | . | | | |
[ Info: | | | | | | | | | | |__| |
[ Info: | |____| |____ _| |_| | | | | | |
[ Info: | _____|______|_____|_| |_|_| |_|
[ Info:
[ Info: ----------------------------------------------------
[ Info: Dycoms
[ Info: Resolution:
[ Info: (Δx, Δy, Δz) = (3.50e+01, 3.50e+01, 1.00e+01)
[ Info: (Nex, Ney, Nez) = (6, 6, 38)
[ Info: DoF = 2103750
[ Info: Minimum necessary memory to run this test: 33660000
[ Info: Time step dt: 2.50e-03
[ Info: End time t : 0
[ Info: ----------------------------------------------------
[ Info: Topology Generation...
[ Info: Grid Generation...
[ Info: Space Discretization Generation...
[ Info: Initial Condition Generation...
[ Info: Solve...
┌ Info: Update
│ simtime = 2.5000000000000001e-03
└ runtime = 00:00:21
[ Info: Done
==42400== Profiling application: julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
==42400== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
48.1087s 2.1262ms - - - - - 19.569MB 8.9881GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
48.4963s 778.27us - - - - - 7.8278MB 9.8222GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
48.9647s 8.7680us - - - - - 64.125KB 6.9747GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
49.2753s 153.41us - - - - - 1.5656MB 9.9660GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
49.2764s 153.31us - - - - - 1.5656MB 9.9723GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
50.0005s 1.6000us - - - - - 200B 119.21MB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
50.0006s 1.5680us - - - - - 200B 121.64MB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
50.4303s 1.8827ms - - - - - 19.569MB 10.151GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
61.8869s 24.256us (1368 1 1) (125 1 1) 46 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_initauxstate__1 [62]
62.2323s 12.576ms - - - - - 19.569MB 1.5196GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
62.3593s 12.732ms - - - - - 19.569MB 1.5009GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
62.3789s 737.53us - - - - - 7.8278MB 10.365GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
62.3851s 866.84us - - - - - 9.1324MB 10.288GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
115.871s 771.01us - - - - - 7.8278MB 9.9147GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
116.704s 14.496us (4008 1 1) (256 1 1) 16 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_anonymous19_2 [81]
116.714s 7.9285ms - - - - - 19.569MB 2.4104GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
120.388s 1.1766ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [95]
121.113s 60.864us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [106]
124.076s 141.70us (1368 1 1) (125 1 1) 116 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_dof_iteration__5 [117]
124.084s 2.3178ms - - - - - 19.569MB 8.2454GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
124.088s 4.3666ms - - - - - 7.8278MB 1.7506GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
124.096s 9.1516ms - - - - - 14.351MB 1.5314GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
129.207s 1.1791ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [123]
129.208s 61.472us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [126]
131.581s 112.38us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [137]
134.199s 627.84us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [148]
136.727s 184.29us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [159]
142.055s 1.3477ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [170]
142.827s 52.639us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [181]
142.827s 1.1784ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [184]
142.828s 59.903us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [187]
142.828s 116.64us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [190]
142.828s 635.49us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [193]
142.829s 187.07us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [196]
142.829s 1.3139ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [199]
142.830s 49.856us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [202]
142.830s 1.1671ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [205]
142.832s 60.064us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [208]
142.832s 115.87us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [211]
142.832s 646.62us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [214]
142.832s 190.08us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [217]
142.833s 1.3071ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [220]
142.834s 49.984us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [223]
142.834s 1.1686ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [226]
142.835s 59.552us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [229]
142.835s 118.91us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [232]
142.835s 621.69us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [235]
142.836s 187.26us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [238]
142.836s 1.2983ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [241]
142.837s 50.880us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [244]
142.838s 1.1661ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [247]
142.839s 60.288us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [250]
142.839s 116.99us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [253]
142.839s 642.85us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [256]
142.840s 189.18us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [259]
142.840s 1.2913ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [262]
142.841s 49.888us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [265]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
~/research/code/CLIMA lcw/dycoms3dperformance* lucas@ip-172-22-40-133 2m 46s
~/research/code/CLIMA lcw/dycoms3dperformance* lucas@ip-172-22-40-133 2m 46s
~/research/code/CLIMA lcw/dycoms3dperformance* lucas@ip-172-22-40-133 2m 46s
~/research/code/CLIMA lcw/dycoms3dperformance* lucas@ip-172-22-40-133 2m 46s
❯ nvprof --print-gpu-trace julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
==42595== NVPROF is profiling process 42595, command: julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
[ Info: ----------------------------------------------------
[ Info: ______ _ _____ __ ________
[ Info: | ____| | |_ _| ... | __ |
[ Info: | | | | | | | . | | | |
[ Info: | | | | | | | | | | |__| |
[ Info: | |____| |____ _| |_| | | | | | |
[ Info: | _____|______|_____|_| |_|_| |_|
[ Info:
[ Info: ----------------------------------------------------
[ Info: Dycoms
[ Info: Resolution:
[ Info: (Δx, Δy, Δz) = (3.50e+01, 3.50e+01, 1.00e+01)
[ Info: (Nex, Ney, Nez) = (6, 6, 38)
[ Info: DoF = 2103750
[ Info: Minimum necessary memory to run this test: 33660000
[ Info: Time step dt: 2.50e-03
[ Info: End time t : 0
[ Info: ----------------------------------------------------
[ Info: Topology Generation...
[ Info: Grid Generation...
[ Info: Space Discretization Generation...
[ Info: Initial Condition Generation...
[ Info: Solve...
┌ Info: Update
│ simtime = 2.5000000000000001e-03
└ runtime = 00:00:23
[ Info: Done
==42595== Profiling application: julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
==42595== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
48.4541s 2.5663ms - - - - - 19.569MB 7.4468GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
48.8630s 791.97us - - - - - 7.8278MB 9.6523GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
49.3438s 8.6720us - - - - - 64.125KB 7.0519GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
49.6670s 153.73us - - - - - 1.5656MB 9.9453GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
49.6681s 153.41us - - - - - 1.5656MB 9.9660GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
50.3935s 1.5680us - - - - - 200B 121.64MB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
50.3935s 1.5360us - - - - - 200B 124.18MB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
50.8201s 1.8141ms - - - - - 19.569MB 10.535GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
62.1151s 25.056us (1368 1 1) (125 1 1) 46 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_initauxstate__1 [62]
62.4557s 12.569ms - - - - - 19.569MB 1.5205GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
62.5812s 12.322ms - - - - - 19.569MB 1.5509GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
62.6005s 700.51us - - - - - 7.8278MB 10.912GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
62.6066s 903.81us - - - - - 9.1324MB 9.8676GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
116.611s 784.22us - - - - - 7.8278MB 9.7476GB/s Pageable Device Tesla V100-SXM2 1 7 [CUDA memcpy HtoD]
117.466s 14.624us (4008 1 1) (256 1 1) 16 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_anonymous19_2 [81]
117.477s 7.8260ms - - - - - 19.569MB 2.4419GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
121.363s 1.1755ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [95]
122.076s 61.152us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [106]
125.222s 137.25us (1368 1 1) (125 1 1) 116 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_dof_iteration__5 [117]
125.230s 1.9465ms - - - - - 19.569MB 9.8179GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
125.232s 1.8698ms - - - - - 7.8278MB 4.0883GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
125.238s 8.9647ms - - - - - 14.351MB 1.5633GB/s Device Pageable Tesla V100-SXM2 1 7 [CUDA memcpy DtoH]
130.316s 1.1800ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [123]
130.317s 61.600us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [126]
132.877s 112.29us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [137]
135.754s 644.13us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [148]
138.539s 182.30us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [159]
144.755s 1.3128ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [170]
145.582s 52.608us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [181]
145.582s 1.1833ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [184]
145.583s 61.216us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [187]
145.583s 116.13us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [190]
145.584s 639.20us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [193]
145.584s 189.15us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [196]
145.584s 1.3144ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [199]
145.586s 49.920us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [202]
145.586s 1.1716ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [205]
145.587s 59.840us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [208]
145.587s 118.78us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [211]
145.587s 636.13us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [214]
145.588s 189.18us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [217]
145.588s 1.3193ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [220]
145.589s 49.632us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [223]
145.589s 1.1714ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [226]
145.591s 60.607us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [229]
145.591s 121.02us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [232]
145.591s 640.29us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [235]
145.591s 189.70us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [238]
145.592s 1.3028ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [241]
145.593s 49.536us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [244]
145.593s 1.1697ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [247]
145.594s 61.056us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [250]
145.594s 116.93us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [253]
145.594s 633.60us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [256]
145.595s 186.18us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [259]
145.595s 1.3397ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [262]
145.596s 49.983us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [265]
145.619s 1.1685ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [268]
145.620s 60.224us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [271]
145.621s 117.92us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [274]
145.621s 640.35us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [277]
145.621s 188.26us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [280]
145.621s 1.2885ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [283]
145.623s 50.783us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [286]
145.623s 1.1683ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [289]
145.624s 60.575us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [292]
145.624s 119.14us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [295]
145.624s 631.84us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [298]
145.625s 187.39us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [301]
145.625s 1.2962ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [304]
145.626s 49.152us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [307]
145.626s 1.1688ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [310]
145.628s 60.512us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [313]
145.628s 114.24us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [316]
145.628s 639.71us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [319]
145.628s 184.80us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [322]
145.629s 1.2958ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [325]
145.630s 50.047us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [328]
145.630s 1.1686ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [331]
145.631s 62.623us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [334]
145.631s 116.86us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [337]
145.631s 634.30us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [340]
145.632s 185.66us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [343]
145.632s 1.3195ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [346]
145.633s 49.728us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [349]
145.633s 1.1616ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [352]
145.635s 60.928us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [355]
145.635s 123.20us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [358]
145.635s 633.95us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [361]
145.635s 187.07us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [364]
145.636s 1.2854ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [367]
145.637s 50.463us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [370]
145.637s 1.1627ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [373]
145.638s 62.559us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [376]
145.638s 115.81us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [379]
145.638s 644.51us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [382]
145.639s 186.46us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [385]
145.639s 1.2936ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [388]
145.640s 50.080us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [391]
145.641s 1.1612ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [394]
145.642s 60.256us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [397]
145.642s 114.91us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [400]
145.642s 630.75us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [403]
145.642s 184.42us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [406]
145.643s 1.2992ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [409]
145.644s 51.136us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [412]
145.644s 1.1598ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [415]
145.645s 61.120us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [418]
145.645s 117.50us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [421]
145.645s 648.57us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [424]
145.646s 188.19us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [427]
145.646s 1.2866ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [430]
145.648s 49.568us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [433]
145.648s 1.1581ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [436]
145.649s 60.384us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [439]
145.649s 118.88us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [442]
145.649s 632.70us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [445]
145.650s 188.51us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [448]
145.650s 1.2861ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [451]
145.651s 49.152us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [454]
145.651s 1.1579ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [457]
145.652s 60.608us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [460]
145.652s 116.77us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [463]
145.652s 645.02us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [466]
145.653s 190.37us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [469]
145.653s 1.2973ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [472]
145.655s 49.632us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [475]
145.655s 1.1578ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [478]
145.656s 60.288us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [481]
145.656s 115.68us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [484]
145.656s 640.51us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [487]
145.657s 189.02us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [490]
145.657s 1.3213ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [493]
145.658s 49.920us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [496]
145.658s 1.1574ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [499]
145.659s 60.736us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [502]
145.659s 115.46us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [505]
145.660s 668.54us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [508]
145.660s 188.10us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [511]
145.660s 1.3009ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [514]
145.662s 50.176us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [517]
145.662s 1.1577ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [520]
145.663s 61.280us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [523]
145.663s 118.66us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [526]
145.663s 636.00us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [529]
145.664s 188.61us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [532]
145.664s 1.3748ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [535]
145.665s 50.272us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [538]
145.665s 1.1574ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [541]
145.666s 60.800us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [544]
145.667s 119.04us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [547]
145.667s 629.25us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [550]
145.667s 187.46us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [553]
145.668s 1.3043ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [556]
145.669s 49.056us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [559]
145.669s 1.1573ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [562]
145.670s 60.000us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [565]
145.670s 116.93us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [568]
145.670s 625.18us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [571]
145.671s 187.14us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [574]
145.671s 1.3011ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [577]
145.672s 50.143us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [580]
145.672s 1.1568ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [583]
145.674s 61.279us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [586]
145.674s 115.81us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [589]
145.674s 628.09us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [592]
145.674s 185.54us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [595]
145.675s 1.2713ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [598]
145.676s 50.016us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [601]
145.676s 1.1573ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [604]
145.677s 61.728us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [607]
145.677s 120.99us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [610]
145.677s 629.60us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [613]
145.678s 187.78us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [616]
145.678s 1.2644ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [619]
145.679s 49.600us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [622]
145.679s 1.1570ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [625]
145.681s 60.448us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [628]
145.681s 117.73us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [631]
145.681s 623.33us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [634]
145.681s 188.00us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [637]
145.682s 1.2895ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [640]
145.683s 50.015us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [643]
145.683s 1.1558ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [646]
145.684s 60.255us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [649]
145.684s 116.96us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [652]
145.684s 643.58us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [655]
145.685s 189.50us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [658]
145.685s 1.2793ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [661]
145.686s 49.856us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [664]
145.686s 1.1558ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [667]
145.688s 60.512us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [670]
145.688s 118.21us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [673]
145.688s 634.78us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [676]
145.688s 187.17us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [679]
145.689s 1.2937ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [682]
145.690s 49.600us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [685]
145.690s 1.1556ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [688]
145.691s 60.288us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [691]
145.691s 117.09us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [694]
145.691s 629.92us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [697]
145.692s 188.22us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [700]
145.692s 1.3032ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [703]
145.693s 50.368us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [706]
145.693s 1.1550ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [709]
145.695s 61.088us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [712]
145.695s 118.50us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [715]
145.695s 638.72us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [718]
145.695s 189.22us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [721]
145.696s 1.3137ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [724]
145.697s 50.048us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [727]
145.697s 1.1564ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [730]
145.698s 60.640us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [733]
145.698s 118.24us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [736]
145.698s 626.46us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [739]
145.699s 186.88us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [742]
145.699s 1.2972ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [745]
145.700s 49.472us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [748]
145.700s 1.1573ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [751]
145.702s 61.280us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [754]
145.702s 118.78us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [757]
145.702s 626.91us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [760]
145.702s 187.04us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [763]
145.703s 1.3051ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [766]
145.704s 50.048us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [769]
145.704s 1.1582ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [772]
145.705s 61.024us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [775]
145.705s 121.18us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [778]
145.705s 628.13us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [781]
145.706s 188.35us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [784]
145.706s 1.2917ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [787]
145.707s 50.591us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [790]
145.708s 1.1558ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [793]
145.709s 61.023us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [796]
145.709s 117.82us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [799]
145.709s 622.69us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [802]
145.709s 186.91us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [805]
145.710s 1.2973ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [808]
145.711s 49.184us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [811]
145.711s 1.1562ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [814]
145.712s 60.832us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [817]
145.712s 119.10us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [820]
145.712s 626.78us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [823]
145.713s 188.86us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [826]
145.713s 1.3111ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [829]
145.715s 50.175us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [832]
145.715s 1.1556ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [835]
145.716s 61.151us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [838]
145.716s 119.26us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [841]
145.716s 626.01us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [844]
145.717s 188.99us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [847]
145.717s 1.3008ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [850]
145.718s 50.752us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [853]
145.718s 1.1469ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [856]
145.719s 59.264us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [859]
145.719s 109.92us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [862]
145.719s 592.03us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [865]
145.720s 176.80us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [868]
145.720s 1.2000ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [871]
145.721s 48.671us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [874]
145.721s 1.0580ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [877]
145.722s 57.088us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [880]
145.723s 107.84us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [883]
145.723s 583.01us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [886]
145.723s 178.34us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [889]
145.723s 1.2018ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [892]
145.725s 47.872us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [895]
145.725s 1.0580ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [898]
145.726s 56.991us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [901]
145.726s 109.06us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [904]
145.726s 585.53us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [907]
145.727s 177.41us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [910]
145.727s 1.2074ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [913]
145.728s 48.768us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [916]
145.728s 1.0576ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [919]
145.729s 57.152us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [922]
145.729s 107.71us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [925]
145.729s 588.70us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [928]
145.730s 180.80us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [931]
145.730s 1.1829ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [934]
145.731s 48.608us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [937]
145.731s 1.0595ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [940]
145.732s 56.480us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [943]
145.732s 108.74us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [946]
145.732s 582.72us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [949]
145.733s 175.04us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [952]
145.733s 1.1963ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [955]
145.734s 47.264us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [958]
145.734s 1.0581ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [961]
145.736s 56.767us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [964]
145.736s 110.18us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [967]
145.736s 588.80us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [970]
145.736s 178.43us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [973]
145.736s 1.2053ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [976]
145.738s 48.128us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [979]
145.738s 1.0576ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [982]
145.739s 57.056us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [985]
145.739s 108.13us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [988]
145.739s 574.88us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [991]
145.740s 179.65us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [994]
145.740s 1.1868ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [997]
145.741s 48.159us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1000]
145.741s 1.0579ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1003]
145.742s 58.016us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1006]
145.742s 108.29us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1009]
145.742s 598.91us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1012]
145.743s 179.55us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1015]
145.743s 1.2021ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1018]
145.744s 48.128us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1021]
145.744s 1.0588ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1024]
145.745s 58.047us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1027]
145.745s 109.76us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1030]
145.745s 589.66us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1033]
145.746s 178.50us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1036]
145.746s 1.2175ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1039]
145.747s 48.544us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1042]
145.747s 1.0584ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1045]
145.749s 57.568us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1048]
145.749s 107.17us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1051]
145.749s 582.88us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1054]
145.749s 177.44us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1057]
145.749s 1.2104ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1060]
145.751s 48.512us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1063]
145.751s 1.0583ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1066]
145.752s 57.120us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1069]
145.752s 109.89us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1072]
145.752s 589.34us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1075]
145.753s 178.14us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1078]
145.753s 1.1966ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1081]
145.754s 48.416us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1084]
145.754s 1.0590ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1087]
145.755s 57.760us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1090]
145.755s 108.64us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1093]
145.755s 584.64us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1096]
145.756s 178.02us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1099]
145.756s 1.2052ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1102]
145.757s 48.063us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1105]
145.757s 1.0574ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1108]
145.758s 57.568us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1111]
145.758s 106.75us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1114]
145.759s 585.09us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1117]
145.759s 176.19us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1120]
145.759s 1.1889ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1123]
145.760s 48.192us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1126]
145.761s 1.0568ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1129]
145.762s 56.288us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1132]
145.762s 109.95us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1135]
145.762s 587.42us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1138]
145.762s 178.21us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1141]
145.763s 1.1716ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1144]
145.764s 48.352us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1147]
145.764s 1.0570ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1150]
145.765s 57.375us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1153]
145.765s 108.74us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1156]
145.765s 596.25us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1159]
145.766s 179.17us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1162]
145.766s 1.2101ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1165]
145.767s 48.288us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1168]
145.767s 1.0580ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1171]
145.768s 57.920us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1174]
145.768s 109.18us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1177]
145.768s 582.56us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1180]
145.769s 175.78us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1183]
145.769s 1.1949ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1186]
145.770s 48.416us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1189]
145.770s 1.0585ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1192]
145.771s 56.928us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1195]
145.771s 111.23us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1198]
145.771s 594.88us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1201]
145.772s 176.96us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1204]
145.772s 1.2273ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1207]
145.774s 48.576us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1210]
145.774s 1.0591ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1213]
145.775s 58.112us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1216]
145.775s 111.36us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1219]
145.775s 606.11us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1222]
145.775s 177.22us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1225]
145.776s 1.1907ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1228]
145.777s 48.479us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1231]
145.777s 1.0582ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1234]
145.778s 56.992us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1237]
145.778s 108.58us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1240]
145.778s 599.20us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1243]
145.779s 177.15us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1246]
145.779s 1.1971ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1249]
145.780s 48.384us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1252]
145.780s 1.0577ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1255]
145.781s 57.631us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1258]
145.781s 111.26us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1261]
145.781s 597.37us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1264]
145.782s 178.02us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1267]
145.782s 1.2099ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1270]
145.783s 48.512us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1273]
145.783s 1.0587ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1276]
145.784s 56.800us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1279]
145.784s 106.21us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1282]
145.785s 598.88us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1285]
145.785s 179.68us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1288]
145.785s 1.1711ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1291]
145.787s 48.608us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1294]
145.787s 1.0506ms (36 1 1) (5 5 1) 146 200B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_indefinite_stack_integral__3 [1297]
145.788s 57.376us (36 1 1) (5 5 1) 32 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_knl_reverse_indefinite_stack_integral__4 [1300]
145.788s 110.91us (1368 1 1) (5 5 5) 94 6.0547KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumeviscterms__6 [1303]
145.788s 597.60us (1368 1 1) (25 1 1) 165 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_faceviscterms__7 [1306]
145.788s 175.39us (1368 1 1) (5 5 5) 156 17.773KB 0B - - - - Tesla V100-SXM2 1 7 ptxcall_volumerhs__8 [1309]
145.789s 1.1701ms (1368 1 1) (25 1 1) 255 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_facerhs__9 [1312]
145.790s 48.640us (1002 1 1) (1024 1 1) 40 0B 0B - - - - Tesla V100-SXM2 1 7 ptxcall_update__10 [1315]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
~/research/code/CLIMA lcw/dycoms3dperformance* lucas@ip-172-22-40-133 2m 48s
❯ nvprof -u us julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
==49997== NVPROF is profiling process 49997, command: julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
[ Info: ------------------------------------------------------
[ Info: ______ _ _____ __ ________
[ Info: | ____| | |_ _| ... | __ |
[ Info: | | | | | | | . | | | |
[ Info: | | | | | | | | | | |__| |
[ Info: | |____| |____ _| |_| | | | | | |
[ Info: | _____|______|_____|_| |_|_| |_|
[ Info:
[ Info: ------------------------------------------------------
[ Info: Dycoms
[ Info: Resolution:
[ Info: (Δx, Δy, Δz) = (3.00e+01, 3.00e+01, 5.00e+00)
[ Info: (Nex, Ney, Nez) = (7, 7, 75)
[ Info: DoF = 2756250
[ Info: Minimum necessary memory to run this test: 0.1911 GBs
[ Info: Time step dt: 2.50e-03
[ Info: End time t : 0
[ Info: ------------------------------------------------------
┌ Info: Update
│ simtime = 2.5000000000000001e-03
└ runtime = 00:00:22
==49997== Profiling application: julia --project=env/gpu test/DGmethods/compressible_Navier_Stokes/dycoms3d-profiling.jl
==49997== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
% us us us us
GPU activities: 37.54 2.38e+05 9 2.64e+04 7.01e+03 3.53e+04 [CUDA memcpy DtoH]
22.59 1.43e+05 55 2.60e+03 2.39e+03 2.81e+03 ptxcall_facerhs__9
18.78 1.19e+05 56 2.12e+03 1.93e+03 2.33e+03 ptxcall_knl_indefinite_stack_integral__3
11.09 7.02e+04 55 1.28e+03 1.19e+03 1.37e+03 ptxcall_faceviscterms__7
3.49 2.21e+04 55 401.8850 378.5580 424.2220 ptxcall_volumerhs__8
2.25 1.42e+04 55 258.6800 238.6550 278.7180 ptxcall_volumeviscterms__6
1.93 1.22e+04 8 1.53e+03 1.536000 6.99e+03 [CUDA memcpy HtoD]
1.19 7.55e+03 56 134.8790 127.1990 141.6630 ptxcall_knl_reverse_indefinite_stack_integral__4
1.07 6.78e+03 55 123.2240 119.1360 129.5040 ptxcall_update__10
0.05 317.4060 1 317.4060 317.4060 317.4060 ptxcall_knl_dof_iteration__5
0.01 66.59200 1 66.59200 66.59200 66.59200 ptxcall_initauxstate__1
0.00 30.40000 1 30.40000 30.40000 30.40000 ptxcall_anonymous19_2
API calls: 31.82 2.63e+05 10 2.63e+04 47.38000 2.61e+05 cuModuleUnload
29.85 2.47e+05 9 2.74e+04 7.74e+03 3.63e+04 cuMemcpyDtoH
26.18 2.16e+05 1 2.16e+05 2.16e+05 2.16e+05 cuDevicePrimaryCtxRetain
7.63 6.30e+04 1 6.30e+04 6.30e+04 6.30e+04 cuDevicePrimaryCtxRelease
1.58 1.31e+04 10 1.31e+03 322.0760 4.19e+03 cuModuleLoadDataEx
1.55 1.28e+04 8 1.60e+03 11.97100 7.13e+03 cuMemcpyHtoD
0.81 6.65e+03 390 17.06200 12.02300 115.5020 cuLaunchKernel
0.43 3.57e+03 12 297.3620 15.26800 411.3690 cuMemAlloc
0.06 532.9260 413 1.290000 0.562000 18.70100 cuCtxGetCurrent
0.06 464.7890 389 1.194000 0.791000 18.08000 cuFuncGetAttribute
0.00 35.37100 11 3.215000 2.580000 5.360000 cuCtxPushCurrent
0.00 33.37200 32 1.042000 0.515000 2.605000 cuDeviceGetAttribute
0.00 16.30100 10 1.630000 1.466000 1.932000 cuModuleGetFunction
0.00 12.56900 11 1.142000 0.880000 2.580000 cuCtxPopCurrent
0.00 11.43500 10 1.143000 0.972000 1.262000 cuCtxGetDevice
0.00 7.043000 5 1.408000 0.812000 3.234000 cuDeviceGet
0.00 3.100000 1 3.100000 3.100000 3.100000 cuDriverGetVersion
0.00 2.969000 2 1.484000 0.546000 2.423000 cuDeviceGetCount
0.00 0.873000 1 0.873000 0.873000 0.873000 cuCtxSetCurrent
@jkozdon
Copy link

jkozdon commented Jun 25, 2019

What's going on with CUDA memcpy DtoH: https://gist.github.com/lcw/a04ba357bc572718e235eec17ceeb3a3#file-profile_std-out-L28

Scalar indexing somewhere?

@lcw
Copy link
Author

lcw commented Jun 25, 2019

I am not sure why it is being called but they happen before the time stepping starts. I don’t think it is scalar indexing. I believe it is turned off in the driver.

@lcw
Copy link
Author

lcw commented Jun 25, 2019

There are only 10 time steps being timed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment