Anjia Wang ouankou

## rex_declare_target.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ouankou
                / rex_declare_target.md
            
            
              Last active
              May 8, 2023 17:55
            
              
                REX declare target 
              
          
    declare target creates two copies of declared symbols.
Given the following code:
#include <stdio.h>

#pragma omp declare target
int x = 100;
#pragma omp end declare target

  
## rex_reduction.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ouankou
                / rex_reduction.md
            
            
              Last active
              May 8, 2023 18:03
            
              
                REX reduction
              
          
    ROSE implements reduction on GPUs using a hybrid approach. The GPU reduces within blocks, and then the results from all blocks are passed back to the host, where they are reduced to the final result on the CPU. The host needs to prepare the reduction buffer for the GPU kernel. While this approach works, it is not fully aligned with the OpenMP specification. Reduction in a target region implies that all computations should take place on the GPU, regardless of speed. Furthermore, copying the partial results between the host and device may introduce additional overhead.
REX, on the other hand, implements reduction purely on the GPU. It begins with a driver kernel (1 block and 1 thread), which launches a child kernel using a user-specified configuration (e.g., 128 blocks and 16 threads per block). The driver kernel creates a reduction buffer on the heap for the child kernel, with a buffer size equal to the number of blocks. Each block of the child kernel computes the local reduced result and stores it in the glo

  
## rex_dp.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ouankou
                / rex_dp.md
            
            
              Last active
              January 30, 2023 18:05
            
          
    void foo() {
  int sum = 10000;
  int sum2 = 10;
//#pragma omp target map(tofrom : sum)
#pragma omp target teams map(tofrom : sum) num_teams(8)
  {
#pragma omp parallel for
    for (int i = 1; i <= 100; i++) {
      sum2 += i;

  
## rex-task.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ouankou
                / rex-task.md
            
            
              Created
              April 13, 2022 01:13
            
          
    Source

#include <stdio.h>

int main(int argc, char *argv[]) {
#pragma omp parallel
  {
#pragma omp single
    {

  
## rex-clang.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ouankou
                / rex-clang.md
            
            
              Last active
              December 23, 2021 02:02
            
          
    Build Clang/LLVM 13.x

Check https://github.com/ouankou/rose-tools/blob/main/install_llvm.sh.
GPU offloading support is optional for now. -DLLVM_ENABLE_ASSERTIONS must be set to OFF or we could just use RELEASE mode for building.
Build REX with Clang as frontend

Most steps are similar to the original instruction of building REX.
https://github.com/passlab/rexompiler/wiki/REX-compiler-compilation

  
## cuda-dynamic-parallelism.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ouankou
                / cuda-dynamic-parallelism.md
            
            
              Last active
              February 14, 2022 16:23
            
          
    Kernel generation


LLVM OpenMP runtime API __tgt_target_teams includes the data transferring, so we can't overlap the computing and data copy while using this API. However, it's possible to overlap multiple computing kernels while using dynamic parallelism because REX generates all the code of CUDA kernel.
Parallel region expansion can be performed on the UPIR level.
To eliminate redandunt barriers, we can use nowait version of LLVM OpenMP runtime API.

OpenACC to OpenMP


acc loop seq does not have an equivalent construct, we need to determine its clauses and generate suitable sequential code.


## pirg.markdown

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ouankou
                / pirg.markdown
            
            
              Last active
              July 27, 2021 12:02
            
          
    Dialect conversion

void axpy (float* x, float* y, float a, int n) {
    int i;
#pragma omp parallel for num_threads(6)
    for (i = 0; i < n; i++) {
        y[i] = y[i] + a * x[i];
    }
}

  
## MLIR-From-C.md

      
              3 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ouankou
                / MLIR-From-C.md
            
            
              Created
              June 8, 2021 22:16
                — forked from kaushikcfd/MLIR-From-C.md
            
              
                Calling MLIR kernels from C
              
          
    Step 1. Lower axpy.mlir to axpy.ll
mlir-opt -lower-affine -convert-loop-to-std -convert-std-to-llvm='emit-c-wrappers=1' axpy.mlir | mlir-translate --mlir-to-llvmir -o axpy.ll

Step 2. Get bitcode for the caller C.
clang -emit-llvm call_axpy.c -S -o call_axpy.bc


## online_linear_regression.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ouankou
                / online_linear_regression.md
            
            
              Created
              January 27, 2021 12:48
            
          
    // simple linear regression
void simple_linear_regression(double* x, double* y, int amount, double* result) {
    double sumx = 0.0, sumy = 0.0, sumxy = 0.0, sumx2 = 0.0;
    double a, b;
    int i;

    for (i = 0; i < amount; i++) {
        sumx = sumx + x[i];

  
## install_llvm.sh
#!/bin/bash

if [ -z "$1" ]; then
    LLVM=$HOME/llvm
else
    LLVM=$1
fi

if [ -z "$2" ]; then
    LLVM_VERSION=10
	#!/bin/bash

	if [ -z "$1" ]; then
	LLVM=$HOME/llvm
	else
	LLVM=$1
	fi

	if [ -z "$2" ]; then
	LLVM_VERSION=10