FCLC/Understanding how modern processors got fast: SIMD, multiple pipes and Out of Order execution.md

## Understanding how modern processors got fast: SIMD, multiple pipes and Out of Order execution.md

      
    Raw
  

              Understanding how modern processors got fast: SIMD, multiple pipes and Out of Order execution.md
            
          
    Context

I was helping a few computer science students and enthusiasts understand “how” modern processors got to be “so fast” outside of clock speed increases.  
Here is the main ;p exert  
Acronyms:  

SIMD: Single Instruction, Multiple Data  
ALU: Arithmetic logic unit, AKA where math happens  
OoO: Out of Order (execution) 
 
 
Problem:

Over the last decades, the want/need for compute has grown exponentially, and to keep up with that need, vendors have come up with lots of ideas on how to keep up.  
 
Let’s take for a fact that memory is slow. That means that a not insignificant amount of time, we’ll be waiting for data to arrive. Therefore, we want to be working on things even when we're waiting on memory.  
 
 
Solutions

Modern processors have multiple things that are doing "concurrency". This post will illustrate how 3 of these things “together” can allow for a 16x increase in compute performance.  
 
 
Practical explanation:

First, our baseline example: if we have 3 arrays of equal length, say size =32.  
 
(If you prefer to think in “real” applications, think of the arrays as the bank accounts of individuals. a[] will be net worth, b[] will chequing account and c[] will be savings account. Replace user with “local bank branch manager”, single  entries as single customer, batches as groups of customers etc.) 
 
 
Suppose the user is doing  
for (i=0, i<size, i++) 

{ 

a[i]=b[i]+c[i]; 

} 
it's "easy" to see that the order we execute the i's in doesn't matter. You can convince yourself of this by thinking about if the results would be different executing with the code:  
 
for (i=(size-1), i>=0, i--) 

{ 

a[i]=b[i]+c[i]; 

} 
The loop above is the same, but computes in the opposite order, going from the 31st element backwards, instead of the first loop that goes upwards 
To keep things simple, we'll only cover 3 things modern processors do: SIMD, multi ALU and OoO  
What can our theoretical processor do?

Let's say our processor has 2 execution pipelines (ALUs) for math, as well as SIMD units that can work on 8 pieces of data at once, and supports OoO execution.
SIMD means that we can split the 32 entry problem into 4 different "batches", in this cases batch 0-7, 8-15, 16-23 and 24-31 
Multiple ALUs means we can process multiple "batches" at a time.  
OoO means that we don't care about the order we process data in, and can "reorder" things within the same program.  
What does SIMD look like?

Instead of doing  
for (i=0, i<size, i++) 

{ 

a[i]=b[i]+c[i]; 

} 
We can "split" the program into SIMD batches at the 8-element boundary.  
for (i=0, i<size/8, i++) 

{ 

a[i->i+7] =b[i->i+7] +c[i->i+7]; // here I'm saying we're processing in groups of 8 in SIMD 

} 
What does multiple ALUs+OoO execution look like?

Since we have 2 ALU pipes AND the order doesn't matter, we can have 2 batches at the same time:  
for (i=0, i<size/2, i=+2)   
{  

//important: pipe1 and pipe 2 are active at the same time

//pipe 1   

a[i]=b[i]+c[I];    

//pipe 2   

a[i+1] =b[i+1] +c[i+1];   

}  
SIMD, multiple ALUs and OoO at the same time

Bring them together and we get:  
for (i=0, i<size/16, i++) 

{ 

//pipe 1 

a[i->i+7] =b[i->i+7] +c[i->i+7]; // here I'm saying we're processing in groups of 8 in SIMD 

//pipe 2 

a[(i+8->i+15] =b[i+8->i+15] +c[i+8->i+15]; // here I'm saying we're processing in groups of 8 in SIMD 

 } 
Beyond the 32 entries

So how does the above help with our memory issue for “bigger” problems? 
 
Well, let’s say that instead of size=32, size is something like 2^25 (33 554 432), approximately 33.5M entries. That’s huge!  
Using the bank analogy from early on, there’s no way all 33.5M records for each savings and checking accounts are going to be in the same filing cabinet! That means that some “batches” are going to be waiting for the files to be fetched from other cabinets, or even other branches of the bank! We wouldn’t want to be stuck waiting for every single entry to be in numerical order to process them.
 
If our processor couldn’t do out of order execution, and a single “misplaced” entry was missing/slow to arrive, we’d have to wait idly, twiddling our thumbs waiting for its arrival.
 
To illustrate the point, think of what would happen if the record corresponding to the checking account for person #42 was misplaced. We process in batches of 8, meaning that we can do the first 5 batches (entries 0-39). Since we “have” to process in batches, and the 6th batch includes #42 (40-47), we must wait for it to arrive. But it gets worse! Since we can’t do anything out of order, we can’t start on batch 7 (48-55) either!  
 
Typically, calculations are around 3 “cycles”. Fetching data from memory is about ~100 cycles. In the time we wait for a single part of a calculation to arrive, we could have finished another 33 batches!  
And that’s if the data/entries are already in memory! Even the fastest modern storage systems are thousands if not tens of thousands of cycles to deliver data. 
 
In terms of our loop from earlier, by adding out of order execution we can now think of the loop as:  
 
while (batches_completed<(size/bacthes))

Do 
{  

//pipe 1  

Next available batch 
 

//pipe 2  

Next available batch 
} 
There's some work in the background to keep track of what you're waiting on, but that's a topic for another day.
Questions?

If you have questions, please feel free to reach out!
 
I’m https://mast.hpc.social/@fclc on mastodon and https://twitter.com/FelixCLC_ on twitter.  
In case I never get around to following this up with a piece on SMT, prefetch etc. this pair of papers should teach you more than you thought possible about:
how memory works https://akkadia.org/drepper/cpumemory.pdf
and
how floating point works: https://people.cs.pitt.edu/~cho/cs1541/current/handouts/goldberg.pdf