- "Iron law": 1/Perf = time/program = instructions/program (cycle count) * cycles/instruction (CPI) * time/cycle (cycle time)
- "Amdahl's law" = speedup = 1 / time = 1 / ((1-f)+(f/N))
- speedup is limited by sequential bottlenec
- Three possible data dependences between two instructions, true (RAW), anti (WAR), and output (WAW). Also applies to memory data dependencies (not applicable in simple five stage pipeline).
- There is also control dependencies.
- Consider the load use hazard. Requires the front end to the pipeline to stall for in-order.
- Locality of reference is observed property of program execution, two types: temporal and spatial
- Reordering in the memory controller subject to same correctness requirements as a pipeline processor (i.e., RAW, WAR, WAR)
- For VM, program communication with VM subsystem done through page fault exception. The page fault, however, is not implicitly handled by hardware. Control is transferred to the OS. The OS allocates a page, creates a VA-PA mapping, and installs contents of the page into physical memory, then returns control to the running program.
- Virtual memory systems use fully-associative policy to placing pages in physical memory.
- For eviction, LRU is too expensive. Approximation of LRU used instead. Each page in physical memory maintains a reference bit that is set by the hardware whenver a reference occurs to that page. The OS intermittently clears the reference bits. When the virtual memory system needs to find a page to evict, it randomly chooses a page from the set of pages with cleared reference bits.
- The virtual memory subsystem provides means memory protection. VA to PA mappings are stored in TLB. Each entry has protection bits (RWXC, etc.)
- A page table needs a table entry for EVERY possible page-sized block in the VA space of the process using the page table. Page tables are usually structured into multiple sections. Hardware page table walker?
- Six-stage template (TEM) superscalar pipeline: fetch, decode, dispatch, execute, complete, retire
- Instruction fetch
- Superscalar must be capable of fetching more than one instruction from the I-cache every cycle.
- Fetch groups go into instruction buffer. Issues happen when PC is misaligned. Or when branches are in fetch group.
- Instruction decode
- May be necessary to further decode to uops.
- Branch address calculation happens here (BTB lookup?)
- Instruction dispatch
- Prior to execution, an instruction must have ALL of its operands. Don't stall decode stage, fetch operands that are ready and advance these instructions into a separate buffer (i.e. reservation station). Reservation station can be centralized or distributed.
- Instruction execution
- Instruction completion and retirement
- Instruction completed when it finishes execution adn updates machine state.
- Interupts are handled here. One option is to stall fetch and decode until all previous work is done then take the exception.
- Exceptions are induced by instruction exception. They are looked at here.
Chapter 5 - Superscalar Techniques
- Branch target speculation (BTB) and branch condition speculation (predictor, NN,TT,NT,TN bits)
- Register data flow use register renaming. Separate rename register file and architecture register file. ROB at end of pipe.
5.3 Memory Data Flow Techniques
Address generation step is required. Involves accessing the specified register and adding the offset value.
Address translation required for virtual memory.
Third step is to access memory. This is different for superscalar.
- For load instructioin, as soon as the address register operand is available, it is issued into the functional unit. Then effective address generation occurs.
- A store must wait for availability of both address register and data register before it is issued.
The first pipe stage generates the effective address.
The second pipe stage translated from VA to PA. Done using TLB. TLB is cache of the page table that is stored in main memory.
Load instruction accesses memory during third pipe stage. The load data is retrived and is written to either the rename register or the reorder buffer. At this point the load is finished execution, but not complete.
Store instructions are considered complete after second stage. The register data to be stored to memory is stored in the reorder buffer. When store is completed, the data is written to memory (this is the very simple case, I gusss).
Instead of updating memory at completion, it is possible to move the data to a store buffer at completion. The store instructions are retired when the memory bus becomes available. With a store buffer, a store instruction can be architecturally complete, but not yet retired to memory.
During an exception the store buffer must be drained (not flushed) before the program is suspended (good to know!)
5.3.2 Ordering of Memory Accesses
- Memory data dependency exists between two load/store instructions if they both reference the same memory location (aliasing, collision of two memory addresses).
- RAW, WAW, WAR depdendencies can exist. These all must be enforced to maintain correct semantics of the program.
- To facilitate recovery from exceptions, the sequential state of memory must be preserved.
- Consitency semantics must also be followed.
5.3.3 Load Bypassing and Load Forwarding
- Bypassing allows a trailing load to be executed earlier than a preceding store. Relax ST-LD.
sw 0x1 lw 0x2 <-- bypass store, relax store-load ordering
Memory RAW not considered in bypassing case.
When store instruction is dispatched to the reservation station, an entry in the reorder buffer is allocated to it. It remains in the reservation station until all source operands are available and issed to the execution unit. One memory address is generation, the operation is finished and placed in the store buffer. The ROB is also updated.
Two parts of store buffer: Finished (not yet architecturally) and Completed (architecturally complete).
When exception occurs, stores in completed portion must be drained. (what about finished?)
Key issue is to check for implementing possible aliasing with preceding stores. Cannot issue if there is aliasing. Can use TAG for store buffer (false positives, causes additional stalling).
What is load is issued BEFORE previous store is in store buffer AND there is aliasing!!! <-- ROLLBACK
Forwarding required full address checking. Also age of store must be kept.
Load/store instructions can have different reservation stations.....