heejongahn/finalprep.md

## finalprep.md

      
    Raw
  

              finalprep.md
            
          
    1. Mark whether each of the following statements is true or false.

A. Increasing the block size of a cache always reduces the cache miss rate by increasing the chance to exploit spatial locality. [ T / F ]

False. 블락 사이즈가 너무 커지면 캐시 블락이 몇 개 못 들어가서 miss rate가 높아짐.
B. Increasing the cache size does not reduce coherence misses in shared memory multiprocessors. [T/F]

True. 캐시에 들어있는게 많으니까 invalidation이 늘어남.
C. Increasing the block size does not increase coherence misses in shared memory multiprocessors. [ T / F ]

False. 여러 프로세스가 한 블락에 접근할 확률이 높아서져 coherence miss 증가.
D. A multi-level page table allows faster TLB fills than a single-level page table for the same virtual memory size. [ T / F]

False. TLB 채우는거 자체는 single-level이 더 빠르다.
E. Loop unrolling by compilers improves the performance of dynamically scheduled out-of-order processors [ T / F ]

False. Statically scheduled processor.
F. Register renaming reduces RAW dependency as well as WAR and WAW dependencies. [ T / F ]

False. Out-of-Order execution에서 생기는 데이터 디펜던시만 없애줌..
G. To improve the performance of out-of-order processors, completed instructions can be deleted from the ROB in out-of-order. [ T / F ]

False. ROB retire는 in-order로 이루어져야함.
H. In snoop-based coherence, only the node which has a copy of the requested block needs to snoop its cache for a bus transaction. [T / F ]

False. 모든 블락이 snoop 해야 함.
I. Shared memory multiprocessors require the uniform access latency from a processor to the memory. [ T / F ]

False. UMA/NUMA
2. To support simultaneous multithreading, which of the following structures must be duplicated for each thread (context)? (You can ignore any performance or implementation issues. Choose only the ones which must essentially be duplicated to support multiple contexts with a processor.)

A. Branch predictor

B. Reservation station

C. ALUs

D. ROB

E. Program counter

F. Architectural register file

G. Instruction cache

B, E, F
3. To access the TLB and L1 cache in parallel, the L1 cache can be organized as virtually-indexed, and physically tagged cache. Suppose the L1 cache is a 32KB, 4-way associative cache with 32B block size. (Explain the reason for your answer)

A. How large the page size should be, if a physical address can be at most in one place (one set) in the cache?


L1 cache = 32kB, block size = 32B, line size = 32 * 4 B.
2^8 cache lines.
8 bits(index) + 3 bits(block offset) + 2 bits(byte offset) = 13 bits
minimum page size = 8kB

B. With the same cache configuration, how can you make the system support 1KB page size?

physically-indexed, physically-tagged cache로 만든다
4. Suppose you built a parallel system with 1000 processors to reduce the execution time of finishing task A.

A. To reduce the execution time with 1000 processors to 10% of the execution time with the single processor, how much of the original task must be perfectly parallelized? (i.e. how much of the execution time in the original serial run must be running only with a processor even in the parallelized run and how much must be parallelized perfectly with 1000 processors.)

비율 p 만큼이 perfectly parallelize 가능하다고 하자.

0.9 = (1-p) + p / 1000
p = 0.1 + p / 1000
999 p = 100
p = 100 / 999 = 0.1001

B. The question 4.A assumes perfect parallelization. What factors can make the parallelization imperfect in real systems? List as many possible factors as you can think of.


imperfect load balancing
instruction 간의 디펜던시
cache coherence
제한된 functional resources
branch condition / branch target / return address

5. With a two-level cache hierarchy, L2 miss rates (l2 misses/l2 references) are 10% for application A, and 50% for application B. However, AMAT (Average Memory Access Time) for application A is slower than application B. Describe a scenario that the above situation can happen. Assume the L1, L2, and memory access latencies do not change.

B가 A에 비해 locality를 매우 잘 활용하게 짜여 있어서 B의 L1 hit rate가 A의
그것에 비해 훨씬 높으면 충분히 가능하다.
6. Dependency and Scheduling.

X1:  add r1, r2, r3
X2:  or  r3, r2, r1
X3:  sub r2, r4, r5
X4:  sw  r2, 0(r6)
X5:  and r2, r2, r7
X6:  lw  r4, 0(r7)

A. Find all possible dependencies.


r1 RAW (X1-X2)
r3 WAR (X1-X2)
r2 WAR (X1-X3)
r2 WAR (X2-X3)
r2 WAW (X3-X4)
r2 WAW (X4-X5)
r2 RAW (X4-X5)

B. What is the minimum number of cycles to run the instructions with an ideal out-of-order processor? The execution of an instruction takes one cycle, and the unlimited number of instructions can be executed in parallel, if the instructions do not have dependencies. However, register renaming is not supported.

5 cycles (X5랑 X6만 동시에 실행 가능)
C. What is the minimum number of cycles if unlimited register renaming is supported? Use the same assumption as 6-B.

WAR, WAW dependency는 제거 가능. 남은 의존성:
X1:  add r1 , r2, r3
X2:  or  r3a, r2, r1
X3:  sub r2b, r4, r5
X4:  sw  r2b, 0(r6)
X5:  and r2c, r2b, r7
X6:  lw  r4, 0(r7)


r1  RAW (X1-X2)
r2b RAW (X3-X4)
r2b RAW (X3-X5)


X1, X3, X6
X2, X4, X5

2 사이클.
7. Suppose a RISC-type microprocessor, called x331, can fetch, issue and execute up-to W instructions at a cycle (W –wide superscalar out-of-order processor). The processor has a centralized reservation station and a reorder buffer. The number of entries in the reservation station is R, and the number of entries in the reorder buffer is N. The ISA of the processor supports up-to 2 input operands. The processor allows speculative out-of-order execution. (You must explain the reason for your answer)

모르겠습니다...

A. What is the maximum number of instructions which can be executed speculatively? (In other words, how many instructions can be issued and executed, but later squashed without commit?)

B. How may read ports and write ports for reading and writing register values are necessary in the reservation station and the ROB to sustain the maximum execution rate of W.

C. How many places the same value of a register can exist in the processor?

8. Memory dependency

A.  Executing load and store instructions involves address calculation and memory access (cache access) parts. In out-of-order execution processors, certain re-orderings of load and store executions can lead to an incorrect program execution. In the following sequence of store and load (in the program order), describe the condition that the execution of two instructions can lead to an incorrect out-of-order execution. (The condition must include the ordering of address calculation and memory access parts of two instructions.)

A: store  r1, 10(r2)
B: load   r5, 150(r6)


A, B 각각 memory calculation. 이 때 r2 + 10 = r6 + 150.
B memory access. r5 = [150(r6)].
A memory access. [10(r2)] = [150(r6)] = r1.
r5에 잘못된 값이 들어감.

B. To issue load instructions as early as possible, out-of-order processors can issue load speculatively, even if there is a chance of dependence violation (as described in problem 4.A). Explain the necessary operations for executing loads and stores to support such speculative load execution.


wrong speculation 일 시 rollback
load bypassing
store -> load forwarding

9. The following code fragment is the lock acquire part of a spin lock implementation. It spins until a variable (pointed by R1) becomes zero, and if the variable is zero, it is locked by setting it to 1.

        li    R2,   #1
lockit: lw    R3,   0(R1)
        bnez  R3,   lockit
        sw    R2,   0(R1)

A. Explain why the code fragment does not implement a spin lock correctly. Describe a scenario when the above implementation can allow multiple threads to enter the critical section simultaneously.

두 쓰레드가 동시에 lw-bnez를 이용한 체크를 통과할 수 있다.
B. How will you rewrite the code, if a new instruction called FAI (fetch-and-increment). “fai Rx, 0(Ry)” read the content of memory[Ry] into Rx, and increment the memory location by one atomically.

lw대신 fai를 사용
      li    R2, #1
lock: fai   R3, 0(R1)
      bnez  R3, lock

10. Using load-linked and store-conditional pair, implement the following atomic compare-and-increase:

CAI R1, 0(R3):

IF ( MEMORY [R3] == R1 ) THEN MEMORY[R3] <- MEM[R3] + 1
ELSE do nothing

You may use the following instructions (you may use additional MIPS-like instructions. You can use any registers other than R1 and R3. R0 has always zero value):

ll Rx, disp (Rsrc)
sc Rx, disp (Rdest)
beq Rx, Ry, label
beqz Rx, label

          ll    R2, 0(R3)
          bneq  R1, R2, store
          addi  R2, R2, 1
store:    sc    R2, 0(R3)

11. The complete following table to implement a simple MSI protocol for snoop-based coherence.


(12 pass)
13. Suppose you are designing a 8-core processor, and the total cache size is 4 MB. Explain at least two reasons why a 4MB shared cache may reduce the on-chip cache misses compared to 8 private caches with 512KB for each core.


여러 프로세스가 같은 메모리에 접근할 때, shared cache model 에서는 한 번의 cache miss가 일어나지만 private cache model에서는 프로세서 수만큼의 cache miss가 일어난다.
여러 프로세스의 working set 크기가 각기 다를 때 shared cache model에서 같은 전체 크기의 캐시를 더 효율적으로 사용할 수 있다.

14. Cache Coherence : The following table shows a simple MSI protocol for snoop-based coherence.

A. Explain why “Modified” state cannot receive “BusUp” command. (“Error” state in the table)

BusUp은 Shared -> Modified로 갈 때 나오는 커맨드인데, Modified와 Shared가 동시에 존재할 수 없다.
B. Extend the protocol to add Exclusive state. In Exclusive state, the cache block is clean (not modified), but no other caches have a copy of the cache block. Explain why the addition of exclusive state is beneficial by describing a scenario. You may specify extra conditions, if necessary.

Exclusive -> Modified가 Bus transaction을 만들지 않는다.
C. If you are allowed to add one more state, how will you reduce the number of write-backs to the memory, as much as possible? Briefly describe the new state.

MOESI의 Owned State를 추가한다...
15. GPUs may not have hardware caches. In the CUDA programming, explain how GPUs attempt to maximize the throughput even if each memory operation has a long memory latency without caches. (How does it hide the long memory latencies?)

쓰레드를 엄청 많이 띄워서 스톨이 나도 그 동안 functional unit이 놀지 않게 한다.
16. For certain applications, with the same chip size, GPUs or vector processors can performance much better than a conventional out-of-order superscalar processor. Explain the reason.

GPU는 다른 dependency identifying unit 이나 cache가 들어갈 자리에 그 대신 수많은 ALU를 가지고 있기 때문.