Skip to content

Instantly share code, notes, and snippets.

@edcote
Last active October 3, 2018 03:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save edcote/321214cef66674faa22c3cd068ee6ad5 to your computer and use it in GitHub Desktop.
Save edcote/321214cef66674faa22c3cd068ee6ad5 to your computer and use it in GitHub Desktop.
BOOM v2: An Open-Source OoO RISC-V Core

Notes

Link to tech report

Alpha 21264 has 15 FO4 delays. (FO4 delay is the delay of inverter, driven by an inverter 4x smaller than itself, and driving an interter 4x bigger than itself). BOOMv2 is 35 FO4.

BOOMv1 follows the 6-stage pipeline structure of MIPS R10K - fetch, decode/rename, issue/register-read, excute, memory, and writeback.

Frontend fetches instructions for execution in the backend.

Requires branch prediction techniques. BTB maintains set of table mapping from instruction addresses (PCs) to branch targets. Look up address checks BTB for tag hit. If hit, BTB makes prediction for frontend. Hysteris bits (T/NT) help guide the decision.

Return address stack (RAS) predicts function returns. Jump register instructions are difficiult to predict because the target depends on a register value. Underlying structure of RAS is a stack.

Conditional branch predictor (CBP) maintains set of prediction and hysteris tables becaused on look-up address. Only makes taken/not-taken predictions. Sometimes known as global history predictor.

Issue window holds all inflight and un-executed uops. Each ports selects from one of the available ready upons to be issued. The larger the window, the more instructions the scheduler can attempt to re-order.

Register file was manually crafted using a register bit out of the foundry-provided standard cells and let the placer automatically route the wires.

Source code review

  • 2bc-table.scala:

class TwobcCounterTable

Provides two-bit counter table for use by branch predictors, only updated during commit. Table contains P and H hits. Implemented as saturating counter. Strongly not-taken, weakly not-taken, weakly taken, strongly taken. Different implementation of the tables are provided: single or dual-port memory.

  • base-only.scala:

class BaseOnlyBrPredictor extends BrPredictor

Implements the base class. Does nothing except exercises the base predictor (BIM).

  • bim.scala:

class BimodalTable extends BoomModule

BIM is a table of 2 bit counters. Control logic in three pipeline stages:

  • S0: receive address to predict on
  • S1: perform lookup
  • S2: return the read data

Implemented as nBanks. Each bank has update queue and write queue.

  • bpd-pipeline.scala

Accesses BTB and BPD to feed predictions to the Fetch Unit.

  • F0: select the next PC
  • F1: access I$ and BTB RAMs. Perform BPD hashing.
  • F2: access BPD RAMs. Begin decoding instruction bits.

class BranchPredictionStage extends BoomModule

Contains BTB (BoomBTB <-- abstract class) and BPD (BrPredictor <-- abstract class). RAS appears to be disabled.

  • brpredictor.scala

Abstract class for branch predictors. F0-F4

  • btb-sa.scala

Parameterizable set-associated branch target buffer.

class BTBsa extends BoomBTB

Contains nWays. Each way has valid registers, tag mem, and data mem. Can be configured to include RAS module. Implementation appears straight forward.

  • btb.scala

Abstract base class for BTB. Includes implementation of RAS (return address stack).

abstract class BoomBTB extends BoomModule

  • dense-btb.scala

Dense BTB with RAS and BIM predictor.

class DenseBTB extends BoomBTB

  • gshare.scala

Implementation of a gshare branch predictor.

class GShareBrPredictor extends BrPredictor

  • tage-table.scala

TAGE table? used by TAGE branch predictor.

https://comparch.net/2013/06/30/why-tage-is-the-best/

  • tage.scala

Implementation of TAGE-based branch predictor/

class TageBrPredictor extends BrPredictor

  • configs.scala

Contains diplomacy configurations.

  • consts.scala

List of constants.

  • microop.scala

Not sure where these uops are used. More reading will help.

class MicroOp extends BoomBundle class CtrlSignals extends BoomBundle

  • parameters.scala

Main configuration object.

  • tile.scala

Interface between Boom and RocketChip. Instantiates core, HellaCache, ICacheFrontEnd, etc.

  • core.scala

Conceptual pipeline stages:

if0 - next-pc select
if1 - I$ access
if2 - instruction return
if3 - enqueue to fetch buffer
if4 - redirect from BPD
dec - decode
ren - rename
dis - dispatch
iss - issue
rrd - register read
exe - execute
mem - memory
sxt - sign extend
wb - writeback
com - commit

class BoomCore extends BoomModule

List of instantiated modules:

  • FpPipeline
  • DecodeUnit (# of decodeWidth)
  • BranchMaskGeneration Logic
  • RenameStage
  • new boom.exu.IssueUnits (implies there are multiple, but not expanded here?)
  • RegisterFile (incl. behavioral version!)
  • Arbiter(new ExeUnitResp) - not sure what for?
  • RegisterRead (believe this is its pipeline stage)
  • DCacheShim (what does Shim mean here?)
  • LoadStoreUnit
  • Rob
  • CsrFile

Contains hardware performance events

  • decode.scala

ChrisC. explains on riscv-hw mailing list why the implementation looks as such.

class DecodeUnit extends BoomModule

  • execute.scala

Execution units. The issue window schedules uops onto a specific execution pipeline.

abstract class ExecutionUnit

class ALUExeUnit extends ExecutionUnit, FPUExeUnit, MemExeUnit

  • fudecode.scala

Generates the functional unit control signals from the micro-op opcodes.

  • functional_unit.scala

abstract class FunctionalUnit ALUUnit extends PipelinedFunctionalUnit MemAddrCalcUnit extends PipelinedFunctionalUnit FPUUnit extends PipelinedFunctionalUnit PipelinedMulUnit extends PipelinedFunctionalUnit --> uses class IMul (below)

  • imul.scala

class IMul extends Module

For iterative or unpipelined units. These only only a single uop at a time. abstract class IterativeFunctionalUnit

  • issue.scala

abstract class IssueUnit

Contains # of issue slots table. Module IssueSlot,

  • issue_ageordered.scala

class IssueUnitCollapsing extends IssueUnit

Dispatch entry logic -> find a slot to store new dispatched uops into.

Issue select logic

  • issue_slot.scala

class IssueSlot extends BoomModule

  • issue_unordered.scala

class IssueUnitStatic extends IssueUnit

Contains issue table of # issue slots.

  • regfile_custom.scala

class Rf6r3wBitModel <- not synthesizable

  • regfile.scala

abstract class RegisterFile

class RegisterFileBehavioral

Highly configurable.

  • registerread.scala

Handles register read and bypass network for OoO backend. Interafaces with the issue window on the enqueue side and the execution pipelined on the dequeue side.

class RegisterRead extends BoomModule

  • rename-busytable.scala

class BusyTableHelper extends BoomModule

Implements a busy bit per register?

  • rename-freelist.scala

Which physical registers are free.

class RenameFreeListHelper extends BoomModule

class RenameFreeList extends BoomModule

List of unallocated physical? logical? registers

  • rename-maptable.scala

Map between logical and physical registers

  • rename.scala

Rename logic. Instantiate map table, free table, busy table for both GPR and FPR.

class RenameStage extends BoomModule

  • rob.scala

Reorder buffer. Each "dispatch" group gets its own row of the ROB and each instruction in the dispatch group goes to a different bank. Entries are added to the ROB when instruction is dispatched.

class Rob extends BoomModule

  • branchchecker.scala

Verify BTB prediction. Catch JAL..... Used by fetch unit.

class BranchChecker extends BoomModule

  • fetch.scala
  • F0: next PC select
  • F1: I$ access
  • F2: I$ response/predecode
  • F3: branch-check/verification/redirect
  • F4: take redirect

class FetchControlUnit extends BoomModule

  • fetchbuffer.scala

Takes a FetchBuffer and converts into a vector of MicroOps.

  • lsu.scala

Contains load address queue, store address queue, and store data queue.

Stores are sent to memory at commit, loads are executed ASAP. If misspeculation discovered, pipeline cleared. Loads "put to sleep" are retried. Load can receive its data by forwarding data out of the store-data queue.

Loads sent ASAP to memory. In parallel associate search of SAQ. If hit on SAQ, memory request is killed on the next cycle.

If store data is not present put data "to sleep?" in LAQ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment