Skip to content

Instantly share code, notes, and snippets.

@rygorous
Forked from anonymous/gist:5534375
Last active December 17, 2015 02:19
Show Gist options
  • Save rygorous/5534382 to your computer and use it in GitHub Desktop.
Save rygorous/5534382 to your computer and use it in GitHub Desktop.
Instruction decode/dispatch variants

Okay, here's the different op splitting/fusing strategies for different cores, as far as I've been able to discern them:

  • Pentium: Complex instructions are U-pipe only but execute directly, they don't get split.
  • Atom (Bonnel/Saltwell): Certain complex instructions don't get split.
  • Pentium Pro/2/3: All ops get split into, tracked as, and executed as uOps.
  • Pentium 4: This never happened.
  • Pentium M/Core: All ops get split into uOps. Post-split, the core can fuse two types of multi-uOp sequences into a larger fused op used for tracking:
    • For stores, address generation + actual store uOps can get fused.
    • Read-modify (but not read-modify-write) fusion, aka "load-op" fusion. This is what Intel calls "micro-op fusion". The fused uOps are what's used in the scheduler/ROB etc. The complex ops are split into uOps for the purposes of execution, but the "accounting" in the core is all in terms of fused ops.
  • Core2 and later: Like Core, plus "macro-op fusion": Certain arithmetic and branch instructions can get fused into a single arithmetic-then-branch instruction.
  • Original Athlon (K7): Ops get decoded into "macro-ops" not uOps. Macro-ops can contain references to >2 source registers and memory references. Any instruction that generates more than one macro-op is microcoded. These are then used for scheduling and dependency tracking. These complex ops are split into uOps right before execution, just like in the Pentium M. The difference to Pentium M is that AMD decodes directly into macro-ops while Intel first decodes to uOps then fuses. Both Intel and AMD can thus treat the instruction "add eax, [mem]" as a single op for the purpose of scheduling, but they arrive there in different ways.
  • Athlon64 (K8) and later: Some instructions can now generate two macro-ops without going through the microcode path. Anything above 2 macro-ops is still microcoded. The rest is fairly similar.
  • Atom (Silvermont): From a decode/dispatch standpoint, this seems very similar to what the K7 did as far as I can tell.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment