I have had working knowledge of GPUs for a very long time now. I remember playing around with NVIDIA RIVA 128 or something similar with DirectX when they were still 3D graphics accelerators. I have also tried to keep up with the times and did some basic shader programming on a contemporary NVIDIA or AMD GPU.
However, today's GPUs are necessary for another reason – the explosion of AI workloads, including large language models (LLMs). From a GPU perspective, AI workloads are just massive applications of tensor operations such as matrix addition and multiplication. However, how does the modern GPU execute them, which is much more efficient than running the workloads on a CPU?
Consider CUDA, NVIDIA's programming language that extends C to exploit data parallelism on GPUs. In CUDA, you write code for CPU (host code) or GPU (device code). CPU code is just mostly plain C, but CUDA extends the language in two ways: it allows you to define functions for GPUs (kernels) and also provides a way to launch kernels on the GPU.
If we take the example of vector addition where you perform an element-wise addition from vectors A and B in an output vector C, you would have something like the following as the kernel (the thing that runs on the GPU):
__global__
void vecAddKernel(float *A, float *B, float *C, int n) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}
The global keyword in CUDA marks a C function as a kernel that runs on the GPU. The input A and B vectors and the output C vector are the A, B, and C parameters. Each of them is of n elements in size. The part where the kernel is doing actual vector addition is
if (i < n) C[i] = A[i] + B[i];
But what are the bounds check and the unique variables determining the element index I?
To answer the question, let's first jump to the host code that is needed to launch the kernel, which would look something like this:
void vecAdd(float *A, float *B, float *C, int n) {
float *A_d, *B_d, *C_d;
int size = n * sizeof(float);
cudaMalloc(&A_d, size);
cudaMalloc(&B_d, size);
cudaMalloc(&C_d, size);
cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);
vecAddKernel<<<ceil(n/256.0), 256>>>(A_d, B_d, C_d, n);
cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);
cudaFree(A_d);
cudaFree(B_d);
cudaFree(C_d);
}
As there are no particular keywords, it's just a C function with the same parameters. The first thing you probably notice is the cudaMalloc,
cudaMemcpy,
and cudaFree
function calls, which are part of CUDA's heterogeneous memory management. Remember, this function executes on the CPU, which has access to the host memory. But GPUs have their memory, so you must allocate memory on the GPU and copy the input vectors. When the GPU kernel completes, you need to copy the results back to the host memory space and free up the memory used on the GPU.
Also, to launch the kernel, the CUDA syntax looks like a weird templatized function call, but what are the two additional configuration parameters? The first configures the number of blocks in a grid, and the second specifies the number of threads in a block. But what on earth is a grid? In CUDA, a kernel executes on a grid, which is blocks of threads. When we launch a kernel, we determine the grid with the two configuration parameters.
And if we go back to the kernel definition, we can see the unique variables blockDim,
blockIdx, "and
threadIdx` in use, which is the part where we define what part of the grid the kernel executes.
__global__
void vecAddKernel(float *A, float *B, float *C, int n) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}
The blockDim
variable specifies block dimensions because CUDA allows blocks to be three-dimensional. We only have a one-dimensional block for our vector example, so that part is straightforward. On the other hand, the blockade
variable tells us which block the kernel is executing. Similarly, the thread
variable tells us which hardware thread executes the kernel.
In the kernel launch, we had the following, which means there are n/256 blocks, rounded up to the nearest integer and 256 threads per block.
void vecAdd(float *A, float *B, float *C, int n) {
/// …
vecAddKernel<<<ceil(n/256.0), 256>>>(A_d, B_d, C_d, n);
/// …
}
For example, if the vector size n is 1000, we have four blocks and 256 threads per block, 1024 hardware threads. And that's where the boundary check i < n
in the kernel definition comes in handy: the first 1000 threads perform addition, but the remaining 24 threads are essentially a no-op.
So, how is the execution on the GPU different from the CPU?
Although the core logic of the vector addition is the same, instead of having an explicit loop to iterate over the vector elements:
void vecAdd(float *A, float *B, float *C, int n) {
for (int i = 0; i < n; i++) {
C[i] = A[i] + B[i];
}
}
In CUDA, we specify the parallelism of the task as part of the kernel launch and only define the serial part that runs on GPU hardware threads of the vector addition in CUDA.
But more importantly, the cost of a GPU hardware thread is minimal compared to operating system threads you would use on the CPU. That's because GPUs internally have a large register file compared to CPUs, meaning switching between threads is less expensive. You can expect a GPU thread creation and scheduling to take a few GPU cycles, whereas a POSIX thread, for example, can take microseconds to create and schedule. Furthermore, GPUs optimize for parallelism further with warps, which is a set of 32 hardware threads, which all execute the same instruction at a time.
Finally, a crucial part of making GPUs execute code fast is their on-chip memory. CUDA applications need heterogeneous memory management where the program explicitly transfers data in and out of the GPUs.
tl;dr; Although program execution on a GPU resembles something like a program running on CPU with lots and lots of threads, there are two key differences: (1) program configures data parallelism explicitly as part of kernel launch and (2) GPU hardware threads are few orders of magnitude cheaper and optimized for parallel execution with warps, making the approach practical.
Context is maintained by HW. Some details here: https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.