edcote/research.md

## research.md

      
    Raw
  

              research.md
            
          
    A. Caulfield et al., "A Cloud-Scala Acceleration Architecture"

New cloud architecture that uses FPGA to accelerate network plane functions (encryption) and applications (search ranking).
Network flows can be transformed at line rate using FPGAs.
FPGA are placed between NIC and CPU in each node of network.  Three scenarios: local compute acceleration (through PCIe), network acceleration, and global application acceleration.
M. Abadi et al., "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems"

TensorFlow computation is described by directed graph that represents a dataflow computation, with extensions for maintaining/updating persistent state and for branching and looping control.  Each node has 0 or mode inputs and 0 or more outputs, represents and instance of an operation.  Values that flow along normal edges are called Tensors.  Special edges called control dependencies can also exist.  No data flows on such edges.
An operation has a name and represents an abstract computation (matrix mult. or add).  An operation can have attributes that are provided at graph-construction time.  A kernel is a implementation of an operation that can be run on a particular type of device.
Google TPU

(https://www.nextplatform.com/2017/04/05/first-depth-look-googles-tpu-architecture/)
TPU is programmable like a CPU or GPU.  Not designed for just one neural net model.  It executes CISC instructions on many networks (convolution, LSTM models, large, fully connected).  It is still programmable, but uses a matrix as primitive instead of a vector or scalar.
There is need to fetch many weights to feed matrix mult. unit.  There are two memories for the TPU; an external DRAM that is used for parameters in the model.  Those come in, are loaded in the matrix multiply from the top. And at the same time, it is possible to activations (or output from the "neurons) from the left.  Those go into the matrix unit in a systolic matter.
Essentially, a 256x256 systolic data flow engine.
Drilling down Xeon Skylake Architecture

(https://www.nextplatform.com/2017/08/04/drilling-xeon-skylake-architecture/)
Intel under pressure to increase I/O bandwidth.  Skylake offers 10% normalized IPC boost.  Added AVX-512 vector math units to core.  Skylake shifted to non-inclusive L3 cache.  Smaller L3 than previous, but larger L2 which is helping.
Deep Learning Chip Upstart Takes GPUs to Task

(https://www.nextplatform.com/2016/08/08/deep-learning-chip-upstart-set-take-gpus-task/)
Nervana Systems. Raveen Rao, CEO.  Stripped-down tensor-based architecture.  There is no floating point element and much of the data movement is handled by software. Nervana calls floating point capability as flexpoint.  Interconnect is non coherent, explicit message passing, everything managed from software.  SRAM to SRAM communication.
An Early Look at Startup Graphcore’s Deep Learning Chip

(https://www.nextplatform.com/2017/03/09/early-look-startup-graphcores-deep-learning-chip/)
Consider neural networks as massive graphs.  Graphcore has a graph processor or intelligent processing unit (IPU).
"If you look at the underlying machine learning workload, you’re trying to capture a digest of data—at set of the features and relationships of those features that you learn from the data. This can be expressed as a neural network model, or more correctly and universally, as a computational graph with a vertex representing some compute function on a set of edges that are representing data with weights associated,”
The model of the neural net is kept in the processor.  Minimize any interactions with memory?
A Dive into Deep Learning Chip Startup Graphcore’s Software Stack

(https://www.nextplatform.com/2017/05/08/dive-deep-learning-chip-startup-graphcores-software-stack/)
Drilling Into Microsoft’s BrainWave Soft Deep Learning Chip

But with the Moore’s Law pace in improvements in price/performance slowing, those who need to get the most work out of a unit of space and power have no choice but to make use of specialized components that are highly tuned for specific workloads.   The trick, it seems, is to get something that is malleable enough to be used for a long time and to do different jobs, even those outside of deep learning.
This is why Microsoft has decided to employ FPGA accelerators (soft DPU) in its infrastructure, and it is the foundation of its BrainWave deep learning stack.
The neat bit is that the BrainWave soft DPU has an adaptive ISA that is completely parameterized and supports no particular kind or level of precision, integer or floating point.
“The goal of our compiler is to take customer pretrained models, developed in CNTK or other frameworks, and seamlessly compile that down onto hardware microservices,”
First In-Depth View of Wave Computing’s DPU Architecture, Systems

The DPU has 16,000 processing elements, over 8,000 arithmetic units, and a unique self-timing mechanism. Everything (all cores) run at 6.7GHz using a coarse grained reconfigurable architecture—a very different animal than what we have seen from some other deep learning chip startups. When no data is being fed through, the DPUs go to sleep. The DPU can be considered as a hybrid FPGA and manycore processor that tackles static scheduling of data flow graphs across these thousands of elements.
SIMD Instructions Considered Harmful - David Patterson and Andrew Waterman on Sep 18, 2017

link
SIMD starts off innocently.  Existing registers are partitioned and ALU into many 8, 16, 32 bit pices and computation occurs in parallel.  Architects subsequently doubles the width of registers.  ISA is no longer backwards compatible.  More elegant option is use vector architectures to exploit data-level parallelism.  Vector computers gather objects from main memory and put them in long, sequential vector registers.  Pipelined execution units operate efficiently on these vector registers.  Vector architectures then scatter the results from registers to memory.
There is difference in code size between SIMD and vector processing.  The extra fetches and decoding means higher energy to perform the same task.
Vector Architectures: Past, Presents, and Future - Espasa, R. et al.

link
Good history perspective presented in this paper.  Future advantages include: reduction in instruction count, improved memory system performance because of vector-style of accessing memory.  Every item of data is actually used.  Information about memory access pattern is conveyed via stride value.  Lower power and easier control logic, etc.
RISC-V Vector Extension Proposal -- Asanovic, K. et al.

Supports auto-vectorization (OpenMP) and explicit SPMD (OpenCL) programming models.  Fits into standard 32-bit encoding space.
Uses Cray-style vectors.  Implementation-dependent vector length.  Same binary runs on different hardware vector lengths.
Up to 32 vector data registers v0-v31, of at least 4 elements each.  Up to 8 vector predicate registers, with 1 bit per element.  Info on predication.  Is an architectural feature that provides an alternative to conventional branch instructions.  Technique works by executing instruction from both paths and only permitting those instructions from the taken path to modify architectural state.  ISA also includes vector configuration CSRs.
Cray-1A Architecture

info
Architecture has 8 64-bit scalar (S) registers, 8 64-bit/64-word vector (V) registers, 8 24-bit address (A) registers.  It also supports a software managed cache.  Everything in the machine is fully pipelined.  An add operation might take only 5 cycles to start producing results.  We can take the output from the adder and chain it straight into another vector unit (say a multiplier).
source code
Software-Managed Caches: Architectural Support for Real-Time Embedded Systems: Jacob, B.

article
This article covers the case for real-time systems.  Caches are usually disabled in these systems for purpose of determinism.  Real-time system designs are concerned with worst-case behavior of system, not average- or best- case.  Caches provide a probabilitistic performance boost.
Software managed caches allow an OS to determine on a cacheline-by-cacheline basis whether or not to cache data.  For example, initialization code is never cached.  The application main loop can be cached.
The Case for VLIW-CMP as a Building Block for Exascale: Jacob, B.

paper
Getting to 1 EFLOP will be hard.
VLIW typically require data-forwarding between multiple pipelines.  This is expensive.  Architectures require a more complex register file to support their multiple pipelines.  Clustering of register files and software register renaming can be used to to address.  Clustering encodes the architecture description and makes applications not compatible between implementations. Not a great idea.  Register renaming solution in described in the paper.  Positive claims on energy use.
Vector Processing Lecture Notes: Patterson, D.

[lecture notes] https://people.eecs.berkeley.edu/~pattrsn/252F96/Lecture06.pdf
Vector processors have high-level operations that work on linear arrays of numbers: "vectors". For example, A=BxC where A,B,C are 64-element vectors of 64-bit floating point numbers.
Vectors have the following properties: each result is independent of previous result (long pipeline, compiler ensures no dependencies), single vector instruction implies lots of work (few instruction fetches), memory access happens with known pattern (highly interleaved, latency amortized, no caches required?,  reduces branches and branch problems in hardware.

Vector register is fixed length bank holding a single vector.  Has at least 2 read and 1 write port.  Has ~8-16 elements.
Vector functional units are fully pipelined, start a new operation on each clock.
Vector load store units are fully pipelined to load or store a vector

Example, DAXPY (Y = a*X+Y):
LD F0, a ; load scalar A
LV V1, Rx ; load vector X
MULTS V2, F0 V1 ; vector * scalar multiply
LV V3, Ry ; load vector Y
ADDV V4, V2 V3 ; add 
SV Ry, V4 ; store the result

Elements of vector execution time:

Initiation rate: rate at which each functional unit consumes vector elements
Convoy: set of vector instructions that can begin execution in the same clock
Chime: appx. time for a vector operation
m convoys take m chimes; if each vector length is n, then they take appx. m x n clock cycles (ignoring overhead)

Sparse Matrices
do 100 i = 1,n
A(K(i)) = A(K(i)) + C(M(i))


gather operation (LVI) takes an index vector (C) and fetches the vector whose elements are at the addresses [..]

(need additional clarification here)
Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL: Donggyu, K. et al.

(this is a really good paper)
paper
Sample-based energy simulation methodology.  Uses FPGA to simulate the performance of an RTL design and collect samples containing exact RTL state snapshots.  Guarantees a min of four-orders-of-magnitude improvement over CAD gate-level simulation tools.  Results believed to be within 5% of true average energy with 99% confidence.
Design performance is evaulated using full-system RTL simulation and set of replayable RTL snapshops are captured.  The average power is computed by replaying the samples on a gate-level power model.  This is used to build confidence.
Storber generates an FPGA  simulator with ability to capture a full replayable RTL snapshot at any capture point.
The improvement here is to generate an RTL design to FPGA RTL simulation.  I think this is based on previous work (FAME1, token based simulation). There is a FAME1 transform pass.  The simulators are instances of synchronous data flows.
A Case for FAME: FPGA Architecture Model Execution:  Zhangxi, T. et al.

[paper])(https://people.eecs.berkeley.edu/~krste/papers/fame-isca2010.pdf)
Architecture research needs dramatic increase in simulation capacity.  Offers 2x order of magn. increase in capacity over software architecture models.
Direct approach is mapping target machine's RTL description to gates.  Resynthesis provides guaranteed cycle-accurate model.  Quickturn was early example that provided 1-2MHz.  Problem is resynthesis.  It can take up-to 30 hours to resynthe a design.
Decouples approach is where a single target clock cycle can be implemented with multiple or variable number of host clock cycles.