Skip to content

Instantly share code, notes, and snippets.

View kaushikcfd's full-sized avatar

Kaushik Kulkarni kaushikcfd

View GitHub Profile
632284386 function calls (625416884 primitive calls) in 415.576 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
4208407 51.837 0.000 51.837 0.000 {built-in method <None>.isl_set_intersect}
1 22.305 22.305 307.282 307.282 check.py:511(_check_variable_access_ordered_inner)
15994440 20.396 0.000 65.230 0.000 _isl.py:76(_setup)
4208407 20.060 0.000 164.580 0.000 _isl.py:69123(intersect)
8384512 18.351 0.000 44.778 0.000 symbolic.py:1851(_get_access_range_for_var)
211024990 function calls (207570482 primitive calls) in 131.849 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1055463 12.394 0.000 12.394 0.000 {built-in method <None>.isl_set_intersect}
4852104 6.103 0.000 19.733 0.000 _isl.py:76(_setup)
1 5.677 5.677 79.083 79.083 check.py:511(_check_variable_access_ordered_inner)
4852108 5.145 0.000 11.936 0.000 _isl.py:96(_set_ctx_data)
1055463 5.066 0.000 40.374 0.000 _isl.py:69123(intersect)
1799025626 function calls (1778754086 primitive calls) in 745.897 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
125134140 148.805 0.000 249.071 0.000 __init__.py:750(iname_tags_of_type)
7911 80.098 0.010 563.041 0.071 bounds.py:58(get_usable_inames_for_conditional)
125134144 76.108 0.000 90.871 0.000 data.py:59(filter_iname_tags_by_type)
62838785 44.125 0.000 73.564 0.000 __init__.py:817(insn_inames)
470119215 40.188 0.000 40.188 0.000 {built-in method builtins.isinstance}
1507298925 function calls (1503844394 primitive calls) in 997.954 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 498.631 498.631 604.558 604.558 check.py:511(_check_variable_access_ordered_inner)
7911 72.852 0.009 338.405 0.043 bounds.py:58(get_usable_inames_for_conditional)
200487128/200321772 47.319 0.000 77.196 0.000 __init__.py:576(wrapper)
62830879 41.093 0.000 65.206 0.000 __init__.py:819(insn_inames)
470080723 39.970 0.000 39.970 0.000 {built-in method builtins.isinstance}
5541811597 function calls (5524629947 primitive calls) in 1768.885 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
15797 284.875 0.018 1402.438 0.089 bounds.py:58(get_usable_inames_for_conditional)
249966580 188.420 0.000 288.115 0.000 __init__.py:819(insn_inames)
749420543/749104786 173.127 0.000 255.054 0.000 __init__.py:576(wrapper)
1832461657 153.097 0.000 153.097 0.000 {built-in method builtins.isinstance}
494066972 150.587 0.000 437.911 0.000 bounds.py:86(<genexpr>)
@kaushikcfd
kaushikcfd / MLIR-From-C.md
Last active February 14, 2025 02:55
Calling MLIR kernels from C

Step 1. Lower axpy.mlir to axpy.ll

mlir-opt -lower-affine -convert-loop-to-std -convert-std-to-llvm='emit-c-wrappers=1' axpy.mlir | mlir-translate --mlir-to-llvmir -o axpy.ll

Step 2. Get bitcode for the caller C.

clang -emit-llvm call_axpy.c -S -o call_axpy.bc
---------------------------------------------------------------------------
KERNEL: wrap_form0_cell_integral_otherwise
---------------------------------------------------------------------------
ARGUMENTS:
dat0: type: np:dtype('float64'), shape: (None), dim_tags: (N0:stride:1) aspace: global
dat1: type: np:dtype('float64'), shape: (None, 3), dim_tags: (N1:stride:3, N0:stride:1) aspace: global
dat2: type: np:dtype('float64'), shape: (None), dim_tags: (N0:stride:1) aspace: global
end: ValueArg, type: np:dtype('int32')
layers: type: np:dtype('int32'), shape: (1, 2), dim_tags: (N1:stride:2, N0:stride:1) aspace: global
map0: type: np:dtype('int32'), shape: (None, 125), dim_tags: (N1:stride:125, N0:stride:1) aspace: global
#include <vector>
#include <cstdio>
#include <sys/time.h>
# define TIME_DIFF(t2, t1) ((t2).tv_sec - (t1).tv_sec + ((t2).tv_usec - (t1).tv_usec)*1e-6)
void scpt_gemm(const double *A, const double *B, double*C, int m, int n, int k)
/*
* Performs C = A x B.

Strategy chosen on kernel name

  • Kernels named zero, expression_kernel, uniform_extrusion, etc. are always chosen for outer-loop parallelization i.e. work of one iteration of “n”(outer loop variable name in PyOP2) is performed by a thread.
  • For the kernel form0_cell_integral_otherwise (matvec kernel) parallelization strategy is selected through the configuration parameter ~gpu_strategy~ which can be one of:
    • scpt: Single Cell Per thread aka outer loop parallelization.
    • user_specified_tile: A tiling parallelization strategy that expects a set of parameters from the user.
#include <math.h>
extern "C" __global__ void __launch_bounds__(54) wrap_form0_cell_integral_otherwise(int const start, int const end, double *__restrict__ dat2, double const *__restrict__ dat1, double const *__restrict__ dat0, int const *__restrict__ map0, int const *__restrict__ map1, double const *__restrict__ form_t13, double const *__restrict__ form_t14, double const *__restrict__ form_t15)
{
__shared__ double basis_cnst_mtrix_prftch[64l];
__shared__ double basis_cnst_mtrix_prftch_0[64l];
double form_t0;
double form_t1;
double form_t10;