Skip to content

Instantly share code, notes, and snippets.

@ChrisKitching
Last active May 13, 2026 18:55
Show Gist options
  • Select an option

  • Save ChrisKitching/53d9b3503c11881c5ea395ec41aff419 to your computer and use it in GitHub Desktop.

Select an option

Save ChrisKitching/53d9b3503c11881c5ea395ec41aff419 to your computer and use it in GitHub Desktop.
Managed memory microbenchmark (HIP vs. SCALE)
#ifdef __HIP__
#include <hip/hip_runtime.h>
#define cudaError_t hipError_t
#define cudaSuccess hipSuccess
#define cudaMalloc hipMalloc
#define cudaMallocManaged hipMallocManaged
#define cudaDeviceSynchronize hipDeviceSynchronize
#endif
#include <linux/mman.h>
#include <sys/mman.h>
#include <chrono>
#include <cstdio>
using T = float;
//constexpr unsigned int blockSize = 1 << 10;
//constexpr unsigned int numBlocks = 1 << 10;
//constexpr size_t eltsPerThread = 1 << 6;
constexpr unsigned int blockSize = 1 << 10;
constexpr unsigned int numBlocks = 1 << 8;
constexpr size_t eltsPerThread = 1 << 12;
constexpr size_t bytes = blockSize * numBlocks * eltsPerThread * sizeof(T);
__global__ void warmup(){}
__global__ void k(T *dst, const T *src)
{
size_t tid = blockIdx.x * blockSize + threadIdx.x;
T acc = 0;
for (unsigned int i = 0; i < eltsPerThread; i++) {
acc += src[tid * eltsPerThread + i];
}
dst[tid] = acc;
}
void check(cudaError_t e)
{
if (e != cudaSuccess) {
abort();
}
}
int main()
{
printf("Dimensions: %u, %u, %zu\n", blockSize, numBlocks, eltsPerThread);
T *dst = nullptr;
T *src = nullptr;
warmup<<<1, 1>>>();
check(cudaDeviceSynchronize());
check(cudaMalloc(&dst, blockSize * numBlocks * sizeof(T)));
check(cudaMallocManaged(&src, bytes));
{
printf("Starting source data generation to %p (0x%llx bytes).\n", (void *)src, (unsigned long long)bytes);
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < blockSize * numBlocks * eltsPerThread; i++) {
src[i] = i;
}
auto end = std::chrono::high_resolution_clock::now();
printf("Source number generation took %0.03f ms.\n", std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() * 1e-6);
}
for (unsigned int i = 0; i < 20; i++) {
auto start = std::chrono::high_resolution_clock::now();
k<<<numBlocks, blockSize>>>(dst, src);
check(cudaDeviceSynchronize());
auto end = std::chrono::high_resolution_clock::now();
int64_t ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
printf("Iteration %u: %0.03f ms (%0.03f GiB/s).\n", i, ns * 1e-6, (double)bytes / ns);
}
return 0;
}
@ChrisKitching

ChrisKitching commented May 13, 2026

Copy link
Copy Markdown
Author

How to run

  • Use a machine containing only XNACK-capable GPUs (since the AMD driver refuses to use the hardware feature if any non-xnack devices are present). This prettymuch just means MI-series cards.
  • Build with hipcc, run the HIP version with HSA_XNACK=1 set
  • Run the SCALE version with SCALE nvcc with SCALE_AMD_XNACK=1 set.
  • Optionally mess with the 3 tuning parameters at the top to see the effect of different access patterns.

Commentary

The benchmark allocates a big chunk of managed memory, and then runs gpu kernels against it over and over. This is simulates the iterative style of computation done by CFD workloads, but is by no means unique to that kind of workload. After each iteration, the time gets printed out.

Managed memory is a CUDA feature which allows you to allocate memory once, and leave it up to the driver/runtime to handle moving it between CPU and GPU. This makes programs simpler because you don't have to manually move the data around, but it relies on the platform to do that relocation for you efficiently.

This benchmark represents a unidirectional transfer: the memory starts on the host, and is only used on the GPU. Once all of the memory has been relocated to the GPU, the timings stop changing, since migrations have stopped happening. A workload that alternates host/device use of the memory would be even more strongly affected by the differences in page relocation speed between the platforms.

SCALE performs the page relocations much faster than HIP. This is visible in the dramatically improved timings of the early iterations. Even for a workload that only does a unidirectional transfer, the effect on oveall timing is significant. Copying 32GB of input data at 0.007GB/s is liable to take longer than your entire CFD simulation!

Sample results

MI355

HIP (rocm 7.1)

Dimensions: 512, 256, 2048
Starting source data generation to 0x7ed68f800000 (0x100000000 bytes).
Source number generation took 430.545 ms.
Iteration 0: 648113.565 ms (0.007 GiB/s).
Iteration 1: 1.953 ms (2199.275 GiB/s).
Iteration 2: 1.875 ms (2291.209 GiB/s).
Iteration 3: 1.942 ms (2211.380 GiB/s).
Iteration 4: 1.865 ms (2302.867 GiB/s).
Iteration 5: 1.907 ms (2252.481 GiB/s).

SCALE 1.7.0:

Dimensions: 512, 256, 2048
Starting source data generation to 0x80000000000 (0x100000000 bytes).
Source number generation took 462.847 ms.
Iteration 0: 6215.198 ms (0.691 GiB/s).
Iteration 1: 1.953 ms (2231.218 GiB/s).
Iteration 2: 1.875 ms (2301.723 GiB/s).
Iteration 3: 1.942 ms (2162.029 GiB/s).
Iteration 4: 1.865 ms (2211.207 GiB/s).
Iteration 5: 1.907 ms (2175.330 GiB/s).

MI210

HIP (rocm 7.1)

Dimensions: 1024, 128, 4096
Starting source data generation to 0x7e2fd6800000 (0x80000000 bytes).
Source number generation took 465.251 ms.
Iteration 0: 19632.823 ms (0.109 GiB/s).
Iteration 1: 6459.566 ms (0.332 GiB/s).
Iteration 2: 375.071 ms (5.726 GiB/s).
Iteration 3: 5.745 ms (373.805 GiB/s).
Iteration 4: 5.756 ms (373.088 GiB/s).
Iteration 5: 5.731 ms (374.738 GiB/s).

SCALE 1.7.0:

Dimensions: 1024, 128, 4096
Starting source data generation to 0x80000000000 (0x80000000 bytes).
Source number generation took 472.872 ms.
Iteration 0: 3246.548 ms (0.661 GiB/s).
Iteration 1: 93.030 ms (23.084 GiB/s).
Iteration 2: 5.856 ms (366.704 GiB/s).
Iteration 3: 5.717 ms (375.630 GiB/s).
Iteration 4: 5.700 ms (376.737 GiB/s).
Iteration 5: 5.722 ms (375.282 GiB/s).

@ChrisKitching

Copy link
Copy Markdown
Author

Read more about XNACK on AMD GPUs with SCALE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment