Skip to content

Instantly share code, notes, and snippets.

@ouankou
ouankou / rex_declare_target.md
Last active May 8, 2023 17:55
REX declare target

declare target creates two copies of declared symbols. Given the following code:

#include <stdio.h>

#pragma omp declare target
int x = 100;
#pragma omp end declare target
@ouankou
ouankou / rex_reduction.md
Last active May 8, 2023 18:03
REX reduction

ROSE implements reduction on GPUs using a hybrid approach. The GPU reduces within blocks, and then the results from all blocks are passed back to the host, where they are reduced to the final result on the CPU. The host needs to prepare the reduction buffer for the GPU kernel. While this approach works, it is not fully aligned with the OpenMP specification. Reduction in a target region implies that all computations should take place on the GPU, regardless of speed. Furthermore, copying the partial results between the host and device may introduce additional overhead.

REX, on the other hand, implements reduction purely on the GPU. It begins with a driver kernel (1 block and 1 thread), which launches a child kernel using a user-specified configuration (e.g., 128 blocks and 16 threads per block). The driver kernel creates a reduction buffer on the heap for the child kernel, with a buffer size equal to the number of blocks. Each block of the child kernel computes the local reduced result and stores it in the glo

@ouankou
ouankou / rex_dp.md
Last active January 30, 2023 18:05
void foo() {
  int sum = 10000;
  int sum2 = 10;
//#pragma omp target map(tofrom : sum)
#pragma omp target teams map(tofrom : sum) num_teams(8)
  {
#pragma omp parallel for
    for (int i = 1; i <= 100; i++) {
      sum2 += i;

Source

#include <stdio.h>

int main(int argc, char *argv[]) {
#pragma omp parallel
  {
#pragma omp single
    {

Kernel generation

  1. LLVM OpenMP runtime API __tgt_target_teams includes the data transferring, so we can't overlap the computing and data copy while using this API. However, it's possible to overlap multiple computing kernels while using dynamic parallelism because REX generates all the code of CUDA kernel.
  2. Parallel region expansion can be performed on the UPIR level.
  3. To eliminate redandunt barriers, we can use nowait version of LLVM OpenMP runtime API.

OpenACC to OpenMP

  1. acc loop seq does not have an equivalent construct, we need to determine its clauses and generate suitable sequential code.

Dialect conversion

void axpy (float* x, float* y, float a, int n) {
    int i;
#pragma omp parallel for num_threads(6)
    for (i = 0; i < n; i++) {
        y[i] = y[i] + a * x[i];
    }
}
@ouankou
ouankou / MLIR-From-C.md
Created June 8, 2021 22:16 — forked from kaushikcfd/MLIR-From-C.md
Calling MLIR kernels from C

Step 1. Lower axpy.mlir to axpy.ll

mlir-opt -lower-affine -convert-loop-to-std -convert-std-to-llvm='emit-c-wrappers=1' axpy.mlir | mlir-translate --mlir-to-llvmir -o axpy.ll

Step 2. Get bitcode for the caller C.

clang -emit-llvm call_axpy.c -S -o call_axpy.bc
// simple linear regression
void simple_linear_regression(double* x, double* y, int amount, double* result) {
    double sumx = 0.0, sumy = 0.0, sumxy = 0.0, sumx2 = 0.0;
    double a, b;
    int i;

    for (i = 0; i < amount; i++) {
        sumx = sumx + x[i];
#!/bin/bash
if [ -z "$1" ]; then
LLVM=$HOME/llvm
else
LLVM=$1
fi
if [ -z "$2" ]; then
LLVM_VERSION=10