pervognsen/dram_latency_then_and_now.md

## dram_latency_then_and_now.md

      
    Raw
  

              dram_latency_then_and_now.md
            
          
    One thing that surprises newer programmers is that the older 8-bit microcomputers from the 70s and 80s were designed
to run at the speed of random memory access to DRAM and ROM. The C64 was released in 1982 when I was born and its
6502 CPU ran at 1 MHz (give or take depending on NTSC vs PAL). It had a 2-stage pipelined design that was designed to
overlap execution and instruction fetch for the current and next instruction. Cycle counting was simple to understand
and master since it was based almost entirely on the number of memory accesses (1 cycle each), with a 1-cycle penalty
for taken branches because of the pipelined instruction fetch for the next sequential instruction. So, the entire
architecture was based on keeping the memory subsystem busy 100% of the time by issuing a read or write every cycle.
One-byte instructions with no memory operands like INX still take the minimum 2 cycles per instruction and end up
redundantly issuing the same memory request two cycles in a row.
You can play around with a silicon-accurate 6502 simulator here and see how internal registers change as you step:
http://visual6502.org/JSSim/expert.html
For reference, halfcyc is the cycle counter, phi0 is the phase, and AB is 16-bit value on the address bus.
Given everything I said, you can infer that a random-access read/write to memory (ROM as well as DRAM) must complete
within a 1 MHz cycle, and therefore a random-access read cycle must be less than 1 microsecond.
Obviously this is very different from modern computers. A modern Intel processor, which is designed for fast L1 access,
still requires 3 cycles for a read/write to L1 cache, which is SRAM that's physically close to the load/store unit,
and L2 and L3 accesses are progressively slower, to say nothing of DRAM. So I thought it'd be fun to calculate and
compare the DRAM read cycle time on a C64 compared to DRAM in a modern high-end PC. You might be shocked by the result!
The DRAM read cycle on the C64 must fit within the 1 MHz machine cycle, so while it was likely quite a bit faster than
that to provide some safety margin and account for wire delays, let's just use 1 microsecond as a conservative estimate.
Getting 1:1 comparison data from a modern DRAM datasheet is surprisingly hard. DRAMs are by their nature not designed for
true random access. They ideally want to do burst accesses from within the same DRAM row, so a lot of the timing
characteristics are broken down at that granularity, and it's hard to find the sum of all the durations we want.
What we're trying to measure is the following. Assume the DRAM currently has another row open in its SRAM row buffer.

We first have to write back the buffered row from the SRAM to its original row of DRAM cells.
We then have to precharge the row amplifier to prepare for buffering the new row. This precharging is necessary
because the DRAM cells hold such a small charge that they won't be able to directly charge the row buffer's input gates
to the logic-high voltage level. Instead we have a sensor array that is precharged to a high-gain metastable state
between low and high where it is hyper-sensitive to tiny perturbations that will push it in either direction. Usually
metastability is resolved over time by noise, but here we rely on the released DRAM charge to do so. (Metastability is
often presented in digital logic textbooks targeting people with a non-analog background as a mysterious phenomenon,
but it is in this state that a digital circuit most closely approximates a linear amplifier, so from the linear circuit
design perspective the low and high states used in digital design are the annoying, degenerate ones!)
After precharging, we may then read the addressed DRAM row into the row buffer.
Finally, we may select the columns of bytes we're addressing from the row buffer.

These steps constitute a random-access read cycle, where consecutive accesses don't address the same DRAM row.
Let's start with a slightly older technology, an end-of-lifed SDR SDRAM part from Micron:
https://www.micron.com/~/media/documents/products/data-sheet/dram/512mb_sdr.pdf
In this datasheet, stages 1 and 2 correspond to the PRECHARGE command, stage 3 corresponds to the ACTIVE command,
and stage 4 corresponds to the READ command. Let's look at some timings:
ACTIVE-to-PRECHARGE command:      t_RAS = 37 ns
PRECHARGE command period:         t_RP  = 15 ns
ACTIVE-to-ACTIVE command period:  t_RC  = 60 ns
ACTIVE-to-READ delay:             t_RCD = 15 ns
Total random-access read cycle:   t     = 127 ns

So, this is well below 10x faster than the DRAM powering the C64!
Another way of looking at it: If you wanted to run a modern CPU with interlocked DRAM access in the manner of
the 6502, you'd be limited to a clock rate of 7 MHz with this SDR DRAM chip from Micron.
Let's look at something closer to the cutting edge with a DDR3 SDRAM from Micron:
https://www.micron.com/~/media/documents/products/data-sheet/dram/ddr3/1gb_1_35v_ddr3l.pdf
Let's pull the same timings again. I picked the 800 MHz part, so 1 cycle = 1.25 ns:
t_RAS = 15 cycles
t_RP  = 5 cycles
t_RC  = 20 cycles
t_RCD = 5 cycles
Total = 45 cycles

Converting to units of time, that is 56.25 ns, so better than twice as fast as the SDR part. However,
this still would only let us clock a DRAM-synchronous CPU at around 17.7 MHz!
Compare that to the fact that the modern CPU with that DDR3 RAM can peak at a rate of over 4 GHz,
with multiple cores, hardware threads, and a high level of instruction-level parallelism per thread!
I hope I didn't screw up these calculations or misinterpret the datasheet timings. Corrections welcome!