Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
How Meltdown Works

Algorithm

  1. A secret byte you want to read is stored at inaccessible memory location priv_mem.
  2. The sender triggers an access exception by attempting to read priv_mem.
  3. Due to CPU optimization (out-of-order execution), the load of secret from priv_mem and the use of its value in (4) and (5) below may execute before the exception is triggered.
  4. Calculate an offset into a known array probe by multiplying secret by the width of a cache line. This guarantees each of those 256 possible offsets will cache separately.
  5. Load probe[offset], which causes the CPU to cache exactly one chunk of of our array, populating one cache line.
  6. The exception finally triggers, clearing the modified registers...but cached data is not excised.
  7. Iterate over all 256 offsets into probe to find out which one loads fast. You've determined the value of secret.

Notes

  • The probe array is flushed from cache before this process, so only the secret-based offset gets cached.
  • The access exception triggers a memory fault, terminating the application, so it is performed in another process (i.e. a fork).
  • This could possibly be mitigated in microcode translation by forcing an ordering guarantee on the access check and the subsequent memory loads, but that would cause all memory accesses (including legal ones) to wait for the access check.
  • The kernel-level fix isolates the kernel's memory pages, so that all accesses are checked to see whether they come from a privileged process. This is where the performance impact comes from.
  • AMD and ARM are not affected by this exploit because they do not allow privileged memory reads to be executed before the access check. Hence, there are no cache effects to observe in step 7.
  • AMD (and some ARM, apparently) are vulnerable to the related Spectre attack, which utilizes processor branch prediction to trick the CPU into executing instructions it would not normally, but the vectors to transmit the results are more difficult to achieve.
@lance

This comment has been minimized.

Show comment Hide comment
@lance

lance Jan 5, 2018

Thanks - this is really clear and concise. A question or two, since this is all really a little out of my league.

  • When exactly does the probe array get flushed? At 5.1?
  • Order guarantee mitigation would also cause some serious performance impacts too, wouldn't it?
  • Is the AMD/ARM approach essentially the same as order guarantee, but it's implemented at the hardware level instead of microcode?

lance commented Jan 5, 2018

Thanks - this is really clear and concise. A question or two, since this is all really a little out of my league.

  • When exactly does the probe array get flushed? At 5.1?
  • Order guarantee mitigation would also cause some serious performance impacts too, wouldn't it?
  • Is the AMD/ARM approach essentially the same as order guarantee, but it's implemented at the hardware level instead of microcode?
@headius

This comment has been minimized.

Show comment Hide comment
@headius

headius Jan 5, 2018

@lance I believe the flushing of the probe array must happen before the illegal access, so you are guaranteed it is uncached by the time you inspect it. I'm not sure why ARM/AMD don't have a perf impact from the ordering guarantee...I've had some trouble locating microcode for their memory accesses.

Owner

headius commented Jan 5, 2018

@lance I believe the flushing of the probe array must happen before the illegal access, so you are guaranteed it is uncached by the time you inspect it. I'm not sure why ARM/AMD don't have a perf impact from the ordering guarantee...I've had some trouble locating microcode for their memory accesses.

@fred41

This comment has been minimized.

Show comment Hide comment
@fred41

fred41 Feb 2, 2018

Thank you, good explained.

A little detail in 4. is not exactly correct:
To calculate the offset, the secret should not be multiplied by cache line size (usually 64), but by 4096 instead (page size).
This is necessary, because the hardware prefetcher is usually prefetching multiple cache lines in advance, but is never prefetching over page boundaries.

fred41 commented Feb 2, 2018

Thank you, good explained.

A little detail in 4. is not exactly correct:
To calculate the offset, the secret should not be multiplied by cache line size (usually 64), but by 4096 instead (page size).
This is necessary, because the hardware prefetcher is usually prefetching multiple cache lines in advance, but is never prefetching over page boundaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment