-
Star
5,000+
You must be signed in to star a gist -
Fork
2,185
You must be signed in to fork a gist
-
-
Save jboner/2841832 to your computer and use it in GitHub Desktop.
Latency Comparison Numbers (~2012) | |
---------------------------------- | |
L1 cache reference 0.5 ns | |
Branch mispredict 5 ns | |
L2 cache reference 7 ns 14x L1 cache | |
Mutex lock/unlock 25 ns | |
Main memory reference 100 ns 20x L2 cache, 200x L1 cache | |
Compress 1K bytes with Zippy 3,000 ns 3 us | |
Send 1K bytes over 1 Gbps network 10,000 ns 10 us | |
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD | |
Read 1 MB sequentially from memory 250,000 ns 250 us | |
Round trip within same datacenter 500,000 ns 500 us | |
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory | |
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip | |
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD | |
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms | |
Notes | |
----- | |
1 ns = 10^-9 seconds | |
1 us = 10^-6 seconds = 1,000 ns | |
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns | |
Credit | |
------ | |
By Jeff Dean: http://research.google.com/people/jeff/ | |
Originally by Peter Norvig: http://norvig.com/21-days.html#answers | |
Contributions | |
------------- | |
'Humanized' comparison: https://gist.github.com/hellerbarde/2843375 | |
Visual comparison chart: http://i.imgur.com/k0t1e.png |
I just created flash cards for this: https://ankiweb.net/shared/info/3116110484 They can be downloaded using the Anki application: http://ankisrs.net
I'm also missing something like "Send 1MB bytes over 1 Gbps network (within datacenter over TCP)". Or does that vary so much that it would be impossible to specify?
If L1 access is a second, then:
L1 cache reference : 0:00:01
Branch mispredict : 0:00:10
L2 cache reference : 0:00:14
Mutex lock/unlock : 0:00:50
Main memory reference : 0:03:20
Compress 1K bytes with Zippy : 1:40:00
Send 1K bytes over 1 Gbps network : 5:33:20
Read 4K randomly from SSD : 3 days, 11:20:00
Read 1 MB sequentially from memory : 5 days, 18:53:20
Round trip within same datacenter : 11 days, 13:46:40
Read 1 MB sequentially from SSD : 23 days, 3:33:20
Disk seek : 231 days, 11:33:20
Read 1 MB sequentially from disk : 462 days, 23:06:40
Send packet CA->Netherlands->CA : 3472 days, 5:20:00
You can add LTO4 tape seek/access time, ~ 55 sec, or 55.000.000.000 ns
I'm missing things like sending 1K via Unix pipe/ socket / tcp to another process.
Has anybody numbers about that?
@metakeule its easily measurable.
Related page from "Systems Performance" with similar second scaling mentioned by @kofemann: https://twitter.com/rzezeski/status/398306728263315456/photo/1
L1D hit on a modern Intel CPU (Nehalem+) is at least 4 cycles. For a typical server/desktop at 2.5Ghz it is at least 1.6ns.
Fastest L2 hit latency is 11 cycles(Sandy Bridge+) which is 2.75x not 14x.
May be the numbers by Norwig were true at some time, but at least caches latency numbers are pretty constant since Nehalem which was 6 years ago.
Please note that Peter Norvig first published this expanded version (at that location - http://norvig.com/21-days.html#answers) ~JUL2010 (see wayback machine). Also, note that it was "Approximate timing for various operations on a typical PC".
One light-nanosecond is roughly a foot, which is considerably less than the distance to my monitor right now. It's kind of surprising to realize just how much a CPU can get done in the time it takes light to traverse the average viewing distance...
@jboner, I would like to cite some numbers in a formal publication. Who is the author? Jeff Dean? Which url should I cite? Thanks.
I'd like to see the number for "Append 1 MB to file on disk".
The "Send 1K bytes over 1 Gbps network" doesn't feel right, if you were comparing the 1MB sequential read of memory, SSD, Disk, the Gbps network for 1MB would be faster than disk (x1024), that doesn't feel right.
A great solar system type visualisation: http://joshworth.com/dev/pixelspace/pixelspace_solarsystem.html
Can you update the the Notes section with the following
1 ns = 10^-9 seconds
1 ms = 10^-3 seconds
Thanks.
@misgeatgit Updated
Zippy is nowadays called snappy. Might be worth updating. Tx for the gist.
Several of the recent comments are spam. The links lead to sites in India which have absolutely nothing to do with latency.
Are there any numbers about latency between NUMA nodes?
Sequential SSD speed is actually more like 500 MB/s, not 1000 MB/s for SATA drives (http://www.tomshardware.com/reviews/ssd-recommendation-benchmark,3269.html).
You really should cite the folks at Berkeley. Their site is interactive, has been up for 20 years, and it is where you "sourced" your visualization. http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Question~ do these numbers not vary from one set of hardware to the next? How can these be accurate for all different types of RAM, CPU, motherboard, hard drive, etc?
(I am primarily a front-end JS dev, I know little-to-nothing about this side of programming, where one must consider numbers involving RAM and CPU. Forgive me if I'm missing something obvious.)
The link to the animated presentation is broken, here's the correct one: http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development
Love this one.
Mentioned gist
: https://gist.github.com/2843375 is private or was removed.
can someone restore it?
Thanks!
It would be nice to be able to compare this to computation times -- How long to do an add, xor, multiply, or branch operation?
Last year, I came up with this concept for an infographic illustrating these latency numbers with time analogies (if 1 CPU cycle = 1 second). Here was the result: http://imgur.com/8LIwV4C
Most of these number were valid in 2000-2001, right now some of these numbers are wrong by an order of magnitude. ( especially reading from main memory, as DRAM bandwidth doubles every 3 years )
µs
, not us
I realize this was published some time ago, but the following URLs are no longer reachable/valid:
- https://gist.github.com/2843375
- http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt
However, the second URL should now be: https://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
Oh, and @mpron - nice!
Thank you @jboner
Note: I created my own "fork" of this.
Thank you @GLMeece
Google it
Median human reaction time (to some stimulus showing up on a screen): 270 ms
(value probably increases with age)
https://www.humanbenchmark.com/tests/reactiontime/statistics
Awesome info. Thanks!
Could you please add printf & fprintf to this list
Heh, imagine this transposed into human distances.
1ns = 1 step, or 2 feet.
L1 cache reference = reaching 1 foot across your desk to pick something up
Datacentre roundtrip = 94 mile hike.
Internet roundtrip (California to Netherlands) = Walk around the entire earth. Wait! You're not done. Then walk from London, to Havana. Oh, and then to Jacksonville, Florida. Then you're done.
The last link is giving a 404
The numbers "Read 1 MB sequentially from memory" mean a memory bandwidth of 4 GB/s. That is a very old number. Can you update it? The time should be roughly 1/5th - one core can do about 20 GB/s today, all cores of a 4 or 8 core about 40 GB/s together. I remember seeing 18-19 GB/s in memtest86 for single core on my Ryzen 1800X and there are several benchmarks floating around where all cores do about 40 GB/s. It is very hard to find anything on the web about single core memory bandwidth...
Good information, thanks.
http://ram.userbenchmark.com/
Ram has gotten slightly faster. It is 70 ns now.
Edit: I was wrong. https://developers.redhat.com/blog/2016/03/01/reducing-memory-access-times-with-caches/
there is an updated version of the latency table?
Nice gist. Thanks @jboner.
Links are dead..
https://gist.github.com/2843375
http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt
@jboner lets remove them
https://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/
This prezi presentation is reversed: the larger numbers are inside the smaller ones, instead of the logical opposite.
Humanized version can be found: https://gist.github.com/hellerbarde/2843375
Thanks. Updated.
Where is the xkcd version?
This one is nice https://gist.github.com/hellerbarde/2843375#gistcomment-1896153
Just use logarithms directly: https://gist.github.com/negrinho/8a8b45a8958a8653054aa2b349b4cb05
Thanks
Is there any resources when one can test himself with a tasks involving these numbers?
E.g. calculate how much time will it take to read 5Mb from DB in another datacenter and get it back?
That would be a great test of applying those numbers in some real use cases.
I think given increased use of GPUs / TPUs it might be interesting numbers to add here now. Like: 1MB over PCIexpress to GPU memory, Computing 100 prime numbers per core of CPU compared to CPU, reading 1 MB from GPU memory to GPU etc.
useful information & thanks
Some data of the Berkeley interactive version (https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html ) is estimated, eg: 4 µs in 2019 to read 1 MB sequentially from memory; it seems too fast.
this is a great idea.
how about the time to complete a DNS request - UDP packet request and response with a DNS server having, say, 1ms response time, with the DNS server being 5ms packet time-of-flight away?
What effect on latency has the use multiple native threads on doing operations possible due to proper mutex locking? Assumed you have:
- an operation 1024 ns operation in 1st level cache
- 2 x lock unlock mutex (50 ns)
- move it from/to main memory (200 ns)
Now, I wonder about malloc latency, can you tell about it? It is definitely missing because I can compute data without any lock as owning the data.
interesting when you see in a glance. but would't it be good to use one unit in the comparison e.g. memory page 4k?
Nanoseconds
It's an excellent explanation. I had to search the video because the account was closed. Here's the result I got: https://www.youtube.com/watch?v=9eyFDBPk4Yw
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
This doesn't look right to me. 1 Gbps = 125, 000 KB/s, the time should be 1 / 125,000 = 8 * 10^-6 seconds which is 8000ns
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
This doesn't look right to me. 1 Gbps = 125, 000 KB/s, the time should be 1 / 125,000 = 8 * 10^-6 seconds which is 8000ns
For a direct host-to-host connection with 1000BaseT interfaces, a wire latency of 8µs is correct.
However, if the hosts are connected using SGMII, the Serial Gigabit Media Independent Interface, data is 8b10b encoded, meaning 10 bits are sent for every 8 bits of data, leading to a latency of 10µs.
Jeff may also have been referring to the fact that in a large cluster you'll have a few switches between the hosts, so even where 1000BaseT is in use, the added switching latency (even for switches operating in cut-through mode) for, say, 2 switches can approach 2µs.
In any event, the main thing to take away from these numbers are the orders of magnitude differences between latency for various methods of I/O.
Are these numbers still relevant in 2020? Or this needs an update?
Are these numbers still relevant in 2020? Or this needs an update?
I think hardwares are so expensive that can't update them~
One thing that is misleading is that different units are used for send over 1Gbps versus read 1 MB from RAM. RAM is at least x20 times faster, but it ranks below send over network which is misleading. They should have used the same 1MB for network and RAM.
need a solar system type visualization for this, so we can really appreciate the change of scale.
Hi
I liked your request and made an comparison. One unit is Mass of earth not radius.
Operation | Time in Nano Seconds | Astronomical Unit of Weight |
---|---|---|
L1 cache reference | 0.5 ns | 1/2 Earth or Five times Mars |
Branch mispredict | 5 ns | 5 Earths |
L2 cache reference | 7 ns | 7 Earths |
Mutex lock/unlock | 25 ns | Roughly [Uranus +Neptune] |
Main memory reference | 100 ns | Roughly Saturn + 5 Earths |
Compress 1K bytes with Zippy | 3,000 ns | 10 Jupiters |
Send 1K bytes over 1 Gbps network | 10,000 ns | 20 Times All the Planets of the Solar System |
Read 4K randomly from SSD* | 150,000 ns | 1.6 times Red Dwarf Wolf 359 |
Read 1 MB sequentially from memory | 250,000 ns | Quarter of the Sun |
Round trip within same datacenter | 500,000 ns | Half of the Mass of Sun |
Read 1 MB sequentially from SSD* | 1,000,000 ns | Sun |
Disk seek | 10,000,000 ns | 10 Suns |
Read 1 MB sequentially from disk | 20,000,000 ns | Red Giant R136a2 |
Send packet CA->Netherlands->CA | 150,000,000 ns | An Intermediate Sized Black Hole |
https://docs.google.com/spreadsheets/d/13R6JWSUry3-TcCyWPbBhD2PhCeAD4ZSFqDJYS1SxDyc/edit?usp=sharing
need a solar system type visualization for this, so we can really appreciate the change of scale.
Hi
I liked your request and made an comparison. One unit is Mass of earth not radius.
For me the best way of making this "more human relatable" would be to treat nanoseconds as seconds and then convert the large values.
eg. 150,000,000s = ~4.75 years
I've been doing some more work inspired by this, surfacing more numbers, and adding throughput:
Is there a 2021 updated edition?
@sirupsen I love your project and I'm signed up for the newsletter. Currently making Anki flashcards :)
There are some large discrepancies between your numbers and the ones found here (not sure where these numbers came from):
https://colin-scott.github.io/personal_website/research/interactive_latency.html
I'm curious what's causing them. Specifically, 1MB sequential memory read: 100us vs 3us.
@ellingtonjp My program is getting ~100 us, and this one says 250 us (from 2012). Lines up to me with some increases in performance since :) Not sure how you got 3 us
@sirupsen I was referring to the numbers here https://colin-scott.github.io/personal_website/research/interactive_latency.html
The 2020 version of "Read 1,000,000 bytes sequentially from memory" shows 3us. Not sure where that comes from though. Yours seems more realistic to me
Ahh, sorry I read your message too quick. Yeah, unclear to me how someone would get 3us. The code I use for this is very simple. It took reading the x86 a few times to ensure that the compiler didn't optimize it out. I do summing, which is one of the lightest workloads you could do in a loop like that. So I think it's quite realistic. Maybe that person's script it was optimized out? 🤷
To everyone interested in numbers like this:
@sirupsen 's project is really good. He gave an excellent talk on the "napkin math" skill and has a newsletter with monthly challenges for practicing putting these numbers to use.
Newsletter: https://sirupsen.com/napkin/
Github: https://github.com/sirupsen/napkin-math
Talk: https://www.youtube.com/watch?v=IxkSlnrRFqc
:)
Light to reach the moon 2,510,000,000 ns 2,510,000 us 2,510 ms 2.51 s
Heh, imagine this transposed into human distances.
1ns = 1 step, or 2 feet.
L1 cache reference = reaching 1 foot across your desk to pick something up
Datacentre roundtrip = 94 mile hike.
Internet roundtrip (California to Netherlands) = Walk around the entire earth. Wait! You're not done. Then walk from London, to Havana. Oh, and then to Jacksonville, Florida. Then you're done.
useful information & thanks
What about register access timings?
Markdown version :p
Operation | ns | µs | ms | note |
---|---|---|---|---|
L1 cache reference | 0.5 ns | |||
Branch mispredict | 5 ns | |||
L2 cache reference | 7 ns | 14x L1 cache | ||
Mutex lock/unlock | 25 ns | |||
Main memory reference | 100 ns | 20x L2 cache, 200x L1 cache | ||
Compress 1K bytes with Zippy | 3,000 ns | 3 µs | ||
Send 1K bytes over 1 Gbps network | 10,000 ns | 10 µs | ||
Read 4K randomly from SSD* | 150,000 ns | 150 µs | ~1GB/sec SSD | |
Read 1 MB sequentially from memory | 250,000 ns | 250 µs | ||
Round trip within same datacenter | 500,000 ns | 500 µs | ||
Read 1 MB sequentially from SSD* | 1,000,000 ns | 1,000 µs | 1 ms | ~1GB/sec SSD, 4X memory |
Disk seek | 10,000,000 ns | 10,000 µs | 10 ms | 20x datacenter roundtrip |
Read 1 MB sequentially from disk | 20,000,000 ns | 20,000 µs | 20 ms | 80x memory, 20X SSD |
Send packet CA -> Netherlands -> CA | 150,000,000 ns | 150,000 µs | 150 ms |
@jboner What do you think about adding cryptography numbers to the list? I feel like that would be a really valuable addition to the list for comparison. Especially as cryptography usage increases and becomes more common.
We could for instance add Ed25519 latency for cryptographic signing and verification. In a very rudimentary testing I did locally I got:
- Ed25519 Signing - 254.20µs
- Ed25519 Verification - 368.20µs
You can replicate the results with the following rust program:
fn main() {
println!("Hello, world!");
let msg = b"lfasjhfoihjsofh438948hhfklshfosiuf894y98s";
let sk = ed25519_zebra::SigningKey::new(rand::thread_rng());
let now = std::time::Instant::now();
let sig = sk.sign(msg);
println!("{:?}", sig);
let elapsed = now.elapsed();
println!("Elapsed: {:.2?}", elapsed);
let vk = ed25519_zebra::VerificationKey::from(&sk);
let now = std::time::Instant::now();
vk.verify(&sig, msg).unwrap();
let elapsed = now.elapsed();
println!("Elapsed: {:.2?}", elapsed);
}
What is "Zippy"? Is it a google internal compression software?
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
this seems misleading, since in common networking terminology 1 Gbps refers to throughput ("size of the pipe"), but this list is about "latency," which is generally independent of throughput - it takes the same amount of time to send 1K bytes over a 1 Mbps network and a 1 Gbps network
A better description of this measure sounds like "bit rate," or more specifically the "data signaling rate" (DSR) over some communications medium (like fiber). This also avoids the ambiguity of "over" the network (how much distance?) because DSR measures "aggregate rate at which data passes a point" instead of a segment.
Using this definition (which I just learned a minute ago), perhaps a better label would be:
- Send 1K bytes over 1 Gbps network 10,000 ns 10 us
+ Transfer 1K bytes over a point on a 1 Gbps fiber channel 10,000 ns 10 us
🤷 (also, I didn't check if the math is consistent with this labeling, but I did pull "fiber channel" from the table on the DSR wiki page)
Thanks for sharing your updates.
You could consider adding a context switch for threads right under disk seek:
computer context switches: 1e7 ns
I see "Read 1 MB sequentially from disk", but how about disk write?
the numbers are from Dr. Dean from Google reveals the length of typical computer operations in 2010. I hope someone could update them as it's 2023
The numbers should be still quite similar.
These numbers based on Physical limitation only significant technological leap can make a difference.
In any case, these are for estimates, not exact calculation. For example, 1MB read from SSD is different for each SSD, but it should be somewhere around the Millisecond range.
it could be useful to add a column with the sizes in the hierarchy. Also, a column of the minimal memory units sizes, the cache line sizes etc. Then you can also divide the sizes by the latencies, which would be some kind of limit for a simple algorithm throughput. Not really sure if this is useful though.
Here's a tool to visualize these numbers over time: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html