heejongahn/prep.md

## prep.md

      
    Raw
  

              prep.md
            
          
    1

1-1. The CPI of a single cycle design is always less than or equal to the CPI of a pipelined implementation.

True. singly cycle design 에서 CPI는 무조건 1. 파이프라인에서는 ideally 1, 실제로는 더 커질 수 있음.
1-2. Since RISC can access the memory only through load and store instructions, RISC must support many different addressing modes to satisfy various memory access needs by different applications.

False. 우리가 배운 MIPS만 보더라도 displacement mode 하나밖에 없음.
1-3. By reducing stalls due to data hazards, adding support for forwarding never increases the execution time of a program.

False. Forwarding을 하는것도 어쨌든 overhead. 운이 나쁘면 실행시간이 증가할수도 있다.
1-4. Latches change the stored value only at the clock edge (rising or falling).

False. (while clock is asserted)
1-5. Reading the stored value in a DRAM cell changes the stored value.

True. (reference 에서 "The read process in DRAM is destructive and removes the charge on the memory cells in an entire row")
1-6.The goal of pipelining is to reduce both the CPI and clock cycle time of single cycle designs.

False. CPI는 1보다 더 낮아지지 않는다.
1-7.The clock cycle time of a pipelined processor is determined by the fastest stage in the pipeline.

False. slowest stage에 의해 결정된다.
1-10. When an external interrupt occurs, the instructions in all five stages of the pipeline must be flushed. Assume the five stage MIPS pipeline used in the lecture.

False. Slide 08 page 24: "let the instructions currently active in the pipeline complete before passing control to the OS interrupt handler"
1-14. 5-to-1 multiplexer needs 2 bit control inputs.

False. 3비트 필요하다.
1-15. Microcode is used to improve the performance of RISC by adding internal instructions.

False. CISC machine에 해당하는 이야기.
1-16. Increasing the block size of a cache always reduces the cache miss rate by increasing the chance to exploit spatial locality.

False. 캐시에 들어갈 수 있는 블락 수가 줄어듬 (competition). 어느 수준까지는 miss rate 감소하지만 그 이후로 다시 증가.
2. From the following list, choose what belong to Instruction Set Architecture (not implementation). Mark all that apply.

A, B, E, F
3. A program uses a function f frequently. Adding a specialized vector instruction can speed up the execution of function f 10 times, but slow down the rest of program execution twice. On what condition does this optimization reduce the execution time of this program?

function f 의 실행 회수가 전체 프로그램에서 차지하는 비율을 p라 하자. f의 실행시간을 t(p), 나머지 모든 함수의 평균 실행시간을 t(r)이라 하자.
average execution time = p * t(p) + (1-p) * t(r)
average execution time with vector instruction = p * t(p) / 10 + 2 * (1-p) * t(r)
(1-p) * t(r) - 9 * p * t(p) / 10 < 0 일때 이득.
4. Assume the 5-stage pipeline used in the lecture. Suppose EX/MEM and MEM/WB to EX forwarding is supported, but MEM/WB to MEM forwarding is not supported. The following MIPS instructions are executed in the processor.

4-1. List all RAW (read-after-write) data dependencies in the code (you may mark it on the code fragment directly with arrows).

lw    R4, 0(R1)
sw    R4, 0(R5)
lw    r5, 0(r5)
addu  r6, r5,   r2

(위 인스트럭션들과 각각 그 앞의 인스트럭션 사이에 dependency
4-2. How many stalls are needed to execute the code fragment in the five-stage pipeline?

sw, addu에서 필요한 두 개.
5. Suppose the longest execution path for a single cycle design is Fetch (IF)Decode (DE) Execution (EX)Memory (MEM)Writeback (WB). The following table shows the latency for each part.

5-1. What is the clock cycle time for the above single cycle design?

450ps
5-2. Suppose each part becomes a stage in a five-stage pipeline design. If the latency overhead due to the pipeline register is 10ps, what is the clock cycle time for the pipeline design?

130ps
5-3. Suppose using a new type of memory can reduce the MEM latency to 60ps from 120ps. What is the clock cycle time after the improvement?

110ps
6. An ideal pipelined design can have the speed up of N from the single cycle design, when N is the number of pipeline stages. However, in reality, the speedup can be less than N. List all possible reasons why the pipeline may not achieve the ideal speedup.


Pipeline register 으로 인한 overhead
각 stage 간의 소모 시간이 다른 경우
Forwarding unit으로 인한 overhead

7. In the five stage pipeline with forwarding support to EX, the first operand of ALU comes from 3-to-1 input multiplexer. As the memory latency increases, MEM stage now takes 2 cycles (from 1 cycle) To adjust the pipeline for the latency increase, the MEM stage becomes MEM1 and MEM2.

7-1. How should the input mux for the ALU be changed, if forwarding to EX needs to be fully supported?

4개로 증가. (Forwarding source가 하나 증가)
7-2. Even with the full forwarding to EX, show an example where the number of stall cycles increase in the new pipeline compared to the old one.

모르겠습니다.
8. Show an example that 2-bit branch history table is better than 1-bit branch history table.

for (int i=0; i<3; i++) {
  if (i % 2 == 0) {
    foo();
  }
}

9. With a two-level cache hierarchy, L2 miss rates (l2 misses/l2 references) are 10% for application A, and 50% for application B. However, AMAT (Average Memory Access Time) for application A is slower than application B. Describe a scenario that the above situation can happen. Assume the L1, L2, and memory access latencies do not change.

B의 L1 miss rate가 A 보다 훨씬 낮으면 된다. A와 다르게 B가 굉장히 좁은 영역의 데이터만 필요로 한다면 B의 L1 hit rate가 훨씬 높을것이고, 이 경우 더 짧은 AMAT를 가질 수 있음.
10. A program uses matrix multiplication very frequently. Adding a specialized vector instruction can speed up the execution of the matrix multiplication 10 times, but slow down the rest of program execution twice. On what condition does this optimization reduce the execution time of this program?

3번 문제와 동일
11. Pipelining

A. What are the CPI of a single-cycle processor, which can execute an instruction every cycle, and the CPI of an ideally pipelined processor? (State the reason for the answer shortly)

1 (By definition)
B. What is the speedup of the ideally pipelined processor compared to the single-cycle design? You must add a factor to answer the question. State the missing factor and explain the reason using the execution time decomposition (Execution time = instruction count * CPI * clock cycle time)

Number of pipeline stages. Ideal case 에서는 모든 스테이지에서 같은 시간이 소요되므로 clock cycle time이 1 / number of pipeline stages 로 줄어든다.
C. Explain two reasons that the clock cycle time of a pipelined processor can be longer than the ideally pipelined processor.


Pipeline register 추가로 인한 overhead
각 stage 간 서로 다른 시간 소요

D. Explain any possible reasons that the CPI of a pipelined processor can be longer than the ideally pipelined processor.


Data hazard
Structural hazard
Control hazard

12. Suppose there are two different branch predictor designs, BF1 and BF2. BF1 can predict a certain pattern of branch sequences very well, and BF2 can predict a different type of patterns well. Explain how you will add both predictors to predict both type of branch patterns. Assume that you have 2048 bits of extra storage available for that, in addition to the two predictors.

2048 bit = 64 words. 한 번 등장했던 주소들의 branching 결과에 대해 BF1이 맞춘 주소들을 저장. 그 주소가 나오면 BF1으로 예측?
13. Suppose you are comparing two processors with different cache designs. Both processors have two-level caches (L1 and L2 caches). Assume a perfect instruction L1 cache. The cache configurations of the two processors are as follows:

Processor A: L1 cache: 16KB, 1 cycle hit latency, L2 cache: 512KB, 10 cycle hit latency,

Processor B: L1 cache: 64KB, 3 cycle hit latency, L2 cache: 2MB, 12 cycle hit latency

For both processors, external memory access latency is 100 cycles. Suppose you will run only two applications with the following characteristics.


Processor A

App1 : 0.8 + 0.1 * 10 + 0.1 * 100 = 11.8
App2 : 0.98 + 0.018 * 10 + 0.002 * 100 = 1.36


Processor B

App1 : 0.9 * 3 + 0.08 * 12 + 0.02 * 100 = 5.66
App2 : 0.99 * 3 + 0.0095 * 12 + 0.0005 * 100 = 3.134


App1 + App2 가 더 작은 B를 쓰는게 낫다.
14. Suppose the width of physical address of a system is 30 bits (the maximum physical address space = 2^30B). Calculate the size of storage (in Bits) for the tags for the following cache configurations. Each tag needs extra 2 bits for valid and modified bits. Ignore extra bits for LRU replacement for set-associative caches. You must show how you calculate the final answer (possibly with figures)

A. 32KB direct-mapped cache, block size = 32B

10 bits for index
3  bits for word offset
2  bits for byte offset
2  bits for valid / modified

tag: 13 bits
B, C: 안 배움
15. Write-back VS. write-through caches:

A. State reasons that write-back caches can be better than write-through caches in general.

write-through 는 cache update가 있을때마다 메모리에 접근해야 하는데 write-back 은 cache가 replace 될 때 한 번만 접근하면 됨.
B. Make a memory access pattern, where a write-back cache is not better than a write-through cache with the same capacity.

cache가 한 번 쓰여질 때마다 추가적인 접근 없이 바로 replace 되는 경우.
16. Stack management: the following code fragments show a factorial function in C and MIPS assembly. Fill A) and B) in MIPS instruction, and explain why those two lines of instructions are necessary.

lw $a0, 0($sp)
lw $ra, 4($sp)

recursive call에서 두 레지스터를 사용해야 하므로 일단 스택에 저장해준 뒤 맨 마지막에 restore 해서 사용한다.
17. Data hazard in pipeline: Assume the 5-stage pipeline used in the lecture.

A. How many stalls are needed to execute the program, if no forwarding is supported.