nuts-n-bits/Lec 3 notes.md

## Lec 3 notes.md

      
    Raw
  

              Lec 3 notes.md
            
          
    Lecture 3 ISA

Last time


Von Neumann Model
(Stored program + Sequential instruction) as opposed to dataflow
Algorithm
ISA
Moore's law
What is comp arch
Dataflow
ISA vs microarch

Microarchitecture


is a specific impl of the ISA
is not exposed to the software layer (we don't do that at this time)

e.g. pipelining NOT EXPOSED
e.g. Out of order execution NOT EXPOSED
e.g. memory access scheduling policy NOT EXPOSED
e.g. speculative execution NOT EXPOSED (today)
e.g. superscalar processing NOT EXPOSED (mostly, see sepctre & meltdown)
and many more....


Is part of ISA or uarch?

Opcode "+" ....................................... ISA
# of gen purpose registers ....................... ISA
# of ports to the register file .................. uarch
# of cycles to execute the MUL instr. ............ uarch
pipelining ....................................... uarch


REMEMBER: uarch is an impl of the ISA under specific design constraints and goals.
A design point is a set of design constraints and their importance
design point ==> leads to tradeoffs in both ISA and uarch.

This lecture:

ISA-level tradeoffs
uarch-level tradeoffs
system and task level tradeoffs
(how to divide labour between HW and SW)


MIPS, ARM, ALPHA are all ISAs.
The following is a LC-3b add instr layout:
Layout 1:
15                    0
+----+---+---+-+--+---+
|0001|DR |SR1|0|00|SR2|
+----+---+---+-+--+---+

Layout 2:
15                    0
+----+---+---+-+------+
|0001|DR |SR1|1| imm5 |
+----+---+---+-+------+

Types of machines


0-address machine (stack machine)
compile this to stack machine: (7+5)x8x9 = 864

push 9
push 8
push 5
push 7
add 
mul
mul
pop => 864


1-address machine: accumulator machine
2-address machine: x86 and a many others
3-address machine: MIPS, Lc-3b

Elements of ISA

Instructions


E.g OPCODE
E.g operand specifiers (addressing modes)

Data Types

E.g. int, float, char, binary, decimal, BCD (binary coded decimal), doubly linked list, queue, str, bit, vec, string (implicit, explicit)
endianness of data is also an aspect of the ISA

"Semanic gap"

    Programming language
+---------------------------+  High
|  List / DoublyLinkedList  |  
|  struct / Queue / stack   | <-- ISA ?
+---------------------------+ 
|  string / float / decimal | <-- ISA ?
|  bigint                   |  
+---------------------------+
|  int / byte / char        | <-- ISA ?
|                           |
+---------------------------+  Low
      Control signals


Memory organization


Address space
Addressing granularity
byte addressible?
64-bit addressible?  <= some supercomputers
bit addressible?  <= rare
Support for virtual memory?

Registers


How many?
How long?


Why registers
Data temporal locality => reuse of data

Instruction Classes

Operate tnstructions


arithmatic / logical
fetch / compute / store
implicit sequential ctrl flow

Data movement instr's


MV data between memory and register
PC++

Control flow instr's


JMP

Elements of ISA (Cont.)

Load/Store (L/S) vs Memoey-to-memory (M2M)


L/S: operate only on registers, must load/store to interact with memory.
M2M: can operate directly on mem, can also load/store.


L/S: MIPS, ARM, other RISCs.
M2M: x86, VAX, other CISCs.

Addressing Modes


Absolute: use immediate value (LW rt 10000)
Register indirect: reg as pointer (LW rt, r)
Displacement: reg as pointer + offset (LW rt, r[offset])
Indexed: LW rt, r, index, where r and index gen purpose
Mem indirect: reg -> mem[ptr] -> mem[data]
Auto inc/dec


Why more mem addr modes?
This is programmer-uarch tradeoff.
pro:

better mapping of high-level instr' to machine code
reduced # of instr' and code size (thus less mem bus band requirement)

e.g. auto increment is good for memory traverse
e.g. double indirect is good for ** and linked lists etc.
e.g. sparce matrix access


better support for complex data structure.

con:

compiler needs more reasoning to pick the right addr mode
uarch more impl pain


Orthogonol ISA

An orthogonol ISA allows all addressing modes to be used on all instr. types.

e.g. VAX:
~13 addr modes
>300 opcodes
2 formats (int/float)
=780 actual addressing impls for uarch

pro:

flexible
easy to write asm
compiler can pick whatever it likes

con:

uarch hard to impl

Other Elements of the ISA (cont.)


Interface with IO devices


mem mapped IO
special IO instructions (IN,OUT in x86)
Tradeoffs?


Privilege modes


user vs superuser
who can exe what instr.


Exception & Interrupt handling

vectored vs. non-vectored interrupts
vectored = knows who interrupted
non-vectored = only knows it's interrupted


Virtual Memory
Access Protection (Segfault?)

and more....
Semantic gap


+---------------------------+  HLL
|    | Compiler             |
|    V                      |
+---------------------------+--- CISC ISA  
|    | uarch                |
|    |                      |
|    |                      |
|    V                      |
+---------------------------+  Control Signals

+---------------------------+  HLL
|    | Compiler             |
|    |                      |
|    |                      |
|    V                      |
+---------------------------+--- RISC ISA  
|    | uarch                |
|    V                      |
+---------------------------+  Control Signals

CISC: VAX INDEX instr. can index 5D array with bounds check with one instr.

Semantic gap tradeoffs


Compiler simplicity: CISC wins¹
Hardware simplicity: RISC wins
Less burden of backwards compatibility: RISC wins

Instr length


Fixed
Variable

Uniformity


Uniform
Non-uniform


Usually:
Risc

Simple instr
Fixed length
Uniform decode
Few addr modes

Cisc

Complex instr
Variable length
Non-uniform decode
Many addr modes


References

Footnotes


Compiler has more options to choose from to perform the same job. So implementing a correct compiler is easier. But the compiler has to weigh all the choices to see which one best fits the program, so having a optimal compiler is not necessarily easier. ↩


## Lec 4 notes.md

      
    Raw
  

              Lec 4 notes.md
            
          
    Lecture 4: ISA tradeoffs (cont.) and the MIPS ISA

ISA tradeoffs (cont.)

Instruction lengths: Fixed vs variable tradeoff

Fixed:

Easier codec
Easier alignment
Indexable
Can decode multiple instructions concurrently

Variable:

More compact code (ergo lower mem bus bandwith requirement)
Better extensibility (if done right)


Intel: they profiles their programs and assign huffman encodings on the instructions!

Uniformity tradeoff

Uniform means that the same bits always represent the same meanings. I.e. opcodes always in the same location, so are operand specifiers, imm values, etc.
Uniform pro:

Easier codec (=> simpler hardware)
Enables parallelism: can start decoding target address before opcode is decoded

Con:

Restricts instr format
Wastes bits (and ergo wastes mem bandwidth)


Uniform decode usually means fixed length, probably can't have uniform for variable


Usually, RISC:

Simple instructions
Fixed length
Uniform decode
Few addressing modes

Usually, CISC:

Complex instructions
Variable length
Non-uniform decode
Many addressing modes


Number of registers tradeoff

The number of registers immediately decides how many bits you need to use to address the registers. More regs => more bits to reference a reg.
Affects uarch: size, access time, power comsumption of register file, etc.
Large number of registers Pro:

Better register allocation and optimization by compilers (because fewer saves and restores), essentially a larger "L0" cache
Potentially fewer instructions caused by spilling/filling*

Con:

Larger instr size per instr.
Larger register file size
More power consumption (since SRAM is impl'd by oscillation circuit)


*: If there is not enough registers for some value, it is pushed onto the stack (most compilers), then when there is room, it is brought back. This is called "Spilling"/"Filling"

Addressing modes tradeoff


Immediate (data = reg)
Register indirect (data = mem[reg])
Memory indirect (data = mem[mem[reg]])
More

Displacement, indexed, absolute, autoincrement, autodecrement, ...


having lots of modes:

pro
better support for programming constructs
Implements data structs easily, effeciently.
con
harder for uarch
too many choices for compiler?

Manyways to do the same thing complicates compiler design, see ¹


(Index * Scale) + Displacement
Base + Index + Displacement
Base + (Index * Scale) + Displacement

Other tradeoffs


Condition code vs not

Conditional code e.g. x86 e-flag


VLIW vs single instruction
Precise vs inprecise exceptions

Precise means if and when exceptions are raised, non of the code after the exception point is ececuted, and all the code before that is executed.
Pertains to OOO-E


Virtual memory or not
Aligned accesses?
Hardware interlocks vs software-guaranteed interlocking? (inter-instruction dependency checking)

MIPS = Microprocessors without Interlocked Pipeline Stages


Software vs hardware managed page fault handling
Cache coherence (HW vs. SW)
etc.....

Programmers vs. (Micro)architecture


Many ISA features designed to aid programmers, but complicate uarch, HW design

E.g. virtual memory.
Q: Should the programmer be concerned about the size of his codeblocks fitting into physical memory?
If yes, then you support no virtual mem
If no, then you support virtual mem


Mips requires mem access be aligned at 4-byte boundary.
LW/SW instructions must follow this requirement

Not designed to fetch memory bytes not within a word boundry
Does not offer rotation of unalgined bytes into registers.

MIPS provides separate opcodes for "infrequent" case of cross-boundary access


But LWL and LWR are slower
And they still could only fetch within boundary

x86 allows unaligned bytes, including cross boundary access. LD/ST automatically handles it, compilers need not worry. However with a caveat:


Image: x86 manual warning compilers: you should still try to align it because unaligned mem accesses require 2 separate undelying accesses just like mips. It's just that uarch handles that for you.

Exercise: What are the pros and cons for aligned/unaligned mem accesses?


Pros

 
Cons
 
 
Part 2: MIPS ISA

MIPS R2000 Program visible state:
[ PC ]
+---------------------+
| Program Counter     |
+---------------------+
32-bit

[ Memory ]
+---------------------+
| M[0]                |
| M[1]                |
| M[2]                |
| M[3]                |
| ......              |
| M[N-1]              |
+---------------------+
2^32 locations, 8 bits each,
represented by 32 bit address
(there's some magic going on)

[ Gen purpose regs ]
+---------------------+
| r0                  |
| r0                  |
| r2                  |
| ......              |
| r31                 |
+---------------------+
General purpose register file,
32 integers, 32 bits each

Data format


Most things are 32 bits

instructions and data addrs
signed and unsigned integers


Also exists 16-bit words and 8-bit words (aka bytes)
Floating-point numbers

IEEE 754
float: 8-bit exponent, 23-bit significand
double: 11-bit exponent, 52-bit significand


Endianness

           Big Endian
MSB                           LSB
[ byte0 | byte1 | byte2 | byte3 ]

           Little Endian
MSB                           LSB
[ byte3 | byte2 | byte1 | byte0 ]

Most of the time, endianness is simply a matter of convention and interoperation.

Endianness could impact performance (rarely and subtly). E.g. if wishes to obtain 16 LSB, LE could just set 16 MSB to 0, but BE must shift.

Instruction format


3 Simple formats

R-type, 3 register operands

[000000| rs  | rt  | rd  |shamt|funct ]
 6 bit   5     5     5    5     6  


I type, 2 register operands and 16-bit imm

[opcode| rs  | rt  | imm              ]
 6 bit   5     5     16 


J type, 26-bit imm

[opcode| imm                          ]
 6 bit   26


Simple Encoding

4 bytes per instruction
Must be 4-byte aligned


ALU instructions

E.g. ADD rd rs rt.

This is the intel syntax, where the above asm translates to rd = rs + rt.

MIPS encoding of the above asm:
[000000| rs  | rt  | rd  |00000| ADD  ] R-type
 6       5     5     5    5      6


Semantics:

rd := rs + rt
pc := pc + 4


Will throw exception if overflow


Unrelated sidenote

Q: how to load 32-bit immediate value if MIPS only supports 26 bit max immediate in its encoding?
A: addiu $5, $5, 0xbeb0063d is broken down into
lui $1, -16720 // 0xbeb00000
ori $1, $1, 1597  // 0x063d


References

Footnotes


Wulf, Compilers and Computer Architecture, IEEE Computer, 1981, [PDF1 Fast], [PDF2 HD] ↩