eshelman/latency.txt

## latency.txt
Latency Comparison Numbers
--------------------------
L1 cache reference/hit                       1.5 ns                      4 cycles
Floating-point add/mult/FMA operation        1.5 ns                      4 cycles
L2 cache reference/hit                       5   ns                      12 ~ 17 cycles
Branch mispredict                            6   ns                      15 ~ 20 cycles
L3 cache hit (unshared cache line)          16   ns                      42 cycles
L3 cache hit (shared line in another core)  25   ns                      65 cycles
Mutex lock/unlock                           25   ns
L3 cache hit (modified in another core)     29   ns                      75 cycles
L3 cache hit (on a remote CPU socket)       40   ns                      100 ~ 300 cycles (40 ~ 116 ns)
QPI hop to a another CPU (time per hop)     40   ns
64MB main memory reference (local CPU)      46   ns                      TinyMemBench on "Broadwell" E5-2690v4
64MB main memory reference (remote CPU)     70   ns                      TinyMemBench on "Broadwell" E5-2690v4
256MB main memory reference (local CPU)     75   ns                      TinyMemBench on "Broadwell" E5-2690v4
Intel Optane persistent memory random write 94   ns                      UCSD Non-Volatile Systems Lab
256MB main memory reference (remote CPU)   120   ns                      TinyMemBench on "Broadwell" E5-2690v4
Intel Optane persistent memory random read 305   ns                      UCSD Non-Volatile Systems Lab
Send 4KB over 100 Gbps HPC fabric        1,040   ns        1 us          MVAPICH2 over Intel Omni-Path / Mellanox EDR
Compress 1KB with Google Snappy          3,000   ns        3 us
Send 4KB over 10 Gbps ethernet          10,000   ns       10 us
Write 4KB randomly to NVMe SSD          30,000   ns       30 us          DC P3608 NVMe SSD (best case; QOS 99% is 500us)
Transfer 1MB to/from NVLink GPU         30,000   ns       30 us          ~33GB/sec on NVIDIA 40GB NVLink
Transfer 1MB to/from PCI-E GPU          80,000   ns       80 us          ~12GB/sec on PCI-Express x16 gen 3.0 link
Read 4KB randomly from NVMe SSD        120,000   ns      120 us          DC P3608 NVMe SSD (QOS 99%)
Read 1MB sequentially from NVMe SSD    208,000   ns      208 us          ~4.8GB/sec DC P3608 NVMe SSD
Write 4KB randomly to SATA SSD         500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
Read 4KB randomly from SATA SSD        500,000   ns      500 us          DC S3510 SATA SSD (QOS 99.9%)
Round trip within same datacenter      500,000   ns      500 us          One-way ping across Ethernet is ~250us
Read 1MB sequentially from SATA SSD  1,818,000   ns    1,818 us    2 ms  ~550MB/sec DC S3510 SATA SSD
Read 1MB sequentially from disk      5,000,000   ns    5,000 us    5 ms  ~200MB/sec server hard disk (seek time would be additional latency)
Random Disk Access (seek+rotation)  10,000,000   ns   10,000 us   10 ms
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

Total CPU pipeline length?


NVIDIA Tesla GPU values
-----------------------
GPU Shared Memory access                    30   ns                      30~90 cycles (bank conflicts will introduce more latency)
GPU Global Memory access                   200   ns                      200~800 cycles, depending upon GPU generation and access patterns
Launch CUDA kernel on GPU               10,000   ns       10 us          Host CPU instructs GPU to start executing a kernel
Transfer 1MB to/from NVLink GPU         30,000   ns       30 us          ~33GB/sec on NVIDIA 40GB NVLink
Transfer 1MB to/from PCI-E GPU          80,000   ns       80 us          ~12GB/sec on PCI-Express x16 link

Floating-point add/mult operation?
Shift operation?
Atomic operation in GPU Global Memory?
Total GPU pipeline length?
Launch CUDA kernel (via dynamic parallelism)?


Intel Xeon CPU values
---------------------
Wake up from C1 state                      500   ns                      varies from <0.5us to 2us
Wake up from C3 state                   15,000   ns       15 us          varies from 10us to 50us
Wake up from C6 state                   30,000   ns       30 us          varies from 20us to 60us

Warm up Intel SkyLake AVX units         14,000   ns       14 us          AVX units go to sleep after ~675 us


Notes
-----
1 ns = 10^-9 seconds
1 us = 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.
Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz). That's 1ns per clock cycle.

"Local" and "Remote" cache/memory values are from dual-socket Intel Xeon. Larger SMP systems have more hops.

GPU NVLink connections are not always 40GB. They range from 20GB to 150GB, depending upon the server platform design.


Credit
------
Adapted from:               https://gist.github.com/jboner/2841832
Original curator:           http://research.google.com/people/jeff/
Originally by Peter Norvig: http://norvig.com/21-days.html#answers

Additional Data Gathered/Correlated from:
-----------------------------------------
Memory latency tool:        https://github.com/ssvb/tinymembench
Persistent memory results:  https://arxiv.org/abs/1903.05714 (UCSD Non-Volatile Systems Lab)
CPU data from Agner Fog:    http://www.agner.org/optimize/
CPU cache and QPI data:     https://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-flushing-fallacy.html
Intel performance analysis: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Intel Broadwell CPU data:   http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt
Intel SkyLake CPU data:     http://www.7-cpu.com/cpu/Skylake.html
MVAPICH2 fabric testing:    http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf
NVMe SSD:                   http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3608-spec.pdf
SATA SSD:                   http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
GPU optimization:           https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf
CPU/GPU data locality:      https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lecs.pdf
GPU Memory Hierarchy:       https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg
Intel Xeon C-state data:    http://ena-hpc.org/2014/pdf/paper_06.pdf
	Latency Comparison Numbers
	--------------------------
	L1 cache reference/hit 1.5 ns 4 cycles
	Floating-point add/mult/FMA operation 1.5 ns 4 cycles
	L2 cache reference/hit 5 ns 12 ~ 17 cycles
	Branch mispredict 6 ns 15 ~ 20 cycles
	L3 cache hit (unshared cache line) 16 ns 42 cycles
	L3 cache hit (shared line in another core) 25 ns 65 cycles
	Mutex lock/unlock 25 ns
	L3 cache hit (modified in another core) 29 ns 75 cycles
	L3 cache hit (on a remote CPU socket) 40 ns 100 ~ 300 cycles (40 ~ 116 ns)
	QPI hop to a another CPU (time per hop) 40 ns
	64MB main memory reference (local CPU) 46 ns TinyMemBench on "Broadwell" E5-2690v4
	64MB main memory reference (remote CPU) 70 ns TinyMemBench on "Broadwell" E5-2690v4
	256MB main memory reference (local CPU) 75 ns TinyMemBench on "Broadwell" E5-2690v4
	Intel Optane persistent memory random write 94 ns UCSD Non-Volatile Systems Lab
	256MB main memory reference (remote CPU) 120 ns TinyMemBench on "Broadwell" E5-2690v4
	Intel Optane persistent memory random read 305 ns UCSD Non-Volatile Systems Lab
	Send 4KB over 100 Gbps HPC fabric 1,040 ns 1 us MVAPICH2 over Intel Omni-Path / Mellanox EDR
	Compress 1KB with Google Snappy 3,000 ns 3 us
	Send 4KB over 10 Gbps ethernet 10,000 ns 10 us
	Write 4KB randomly to NVMe SSD 30,000 ns 30 us DC P3608 NVMe SSD (best case; QOS 99% is 500us)
	Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink
	Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 gen 3.0 link
	Read 4KB randomly from NVMe SSD 120,000 ns 120 us DC P3608 NVMe SSD (QOS 99%)
	Read 1MB sequentially from NVMe SSD 208,000 ns 208 us ~4.8GB/sec DC P3608 NVMe SSD
	Write 4KB randomly to SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
	Read 4KB randomly from SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
	Round trip within same datacenter 500,000 ns 500 us One-way ping across Ethernet is ~250us
	Read 1MB sequentially from SATA SSD 1,818,000 ns 1,818 us 2 ms ~550MB/sec DC S3510 SATA SSD
	Read 1MB sequentially from disk 5,000,000 ns 5,000 us 5 ms ~200MB/sec server hard disk (seek time would be additional latency)
	Random Disk Access (seek+rotation) 10,000,000 ns 10,000 us 10 ms
	Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

	Total CPU pipeline length?


	NVIDIA Tesla GPU values
	-----------------------
	GPU Shared Memory access 30 ns 30~90 cycles (bank conflicts will introduce more latency)
	GPU Global Memory access 200 ns 200~800 cycles, depending upon GPU generation and access patterns
	Launch CUDA kernel on GPU 10,000 ns 10 us Host CPU instructs GPU to start executing a kernel
	Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink
	Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 link

	Floating-point add/mult operation?
	Shift operation?
	Atomic operation in GPU Global Memory?
	Total GPU pipeline length?
	Launch CUDA kernel (via dynamic parallelism)?


	Intel Xeon CPU values
	---------------------
	Wake up from C1 state 500 ns varies from <0.5us to 2us
	Wake up from C3 state 15,000 ns 15 us varies from 10us to 50us
	Wake up from C6 state 30,000 ns 30 us varies from 20us to 60us

	Warm up Intel SkyLake AVX units 14,000 ns 14 us AVX units go to sleep after ~675 us


	Notes
	-----
	1 ns = 10^-9 seconds
	1 us = 10^-6 seconds = 1,000 ns
	1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

	Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.
	Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz). That's 1ns per clock cycle.

	"Local" and "Remote" cache/memory values are from dual-socket Intel Xeon. Larger SMP systems have more hops.

	GPU NVLink connections are not always 40GB. They range from 20GB to 150GB, depending upon the server platform design.


	Credit
	------
	Adapted from: https://gist.github.com/jboner/2841832
	Original curator: http://research.google.com/people/jeff/
	Originally by Peter Norvig: http://norvig.com/21-days.html#answers

	Additional Data Gathered/Correlated from:
	-----------------------------------------
	Memory latency tool: https://github.com/ssvb/tinymembench
	Persistent memory results: https://arxiv.org/abs/1903.05714 (UCSD Non-Volatile Systems Lab)
	CPU data from Agner Fog: http://www.agner.org/optimize/
	CPU cache and QPI data: https://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-flushing-fallacy.html
	Intel performance analysis: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
	Intel Broadwell CPU data: http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt
	Intel SkyLake CPU data: http://www.7-cpu.com/cpu/Skylake.html
	MVAPICH2 fabric testing: http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf
	NVMe SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3608-spec.pdf
	SATA SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
	GPU optimization: https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf
	CPU/GPU data locality: https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lecs.pdf
	GPU Memory Hierarchy: https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg
	Intel Xeon C-state data: http://ena-hpc.org/2014/pdf/paper_06.pdf