Kaushik Kulkarni kaushikcfd

## m2l3DOrder8.398.cprofile
         632284386 function calls (625416884 primitive calls) in 415.576 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  4208407   51.837    0.000   51.837    0.000 {built-in method <None>.isl_set_intersect}
        1   22.305   22.305  307.282  307.282 check.py:511(_check_variable_access_ordered_inner)
 15994440   20.396    0.000   65.230    0.000 _isl.py:76(_setup)
  4208407   20.060    0.000  164.580    0.000 _isl.py:69123(intersect)
  8384512   18.351    0.000   44.778    0.000 symbolic.py:1851(_get_access_range_for_var)

## m2l3DOrder6.398.cProfile
         211024990 function calls (207570482 primitive calls) in 131.849 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1055463   12.394    0.000   12.394    0.000 {built-in method <None>.isl_set_intersect}
  4852104    6.103    0.000   19.733    0.000 _isl.py:76(_setup)
        1    5.677    5.677   79.083   79.083 check.py:511(_check_variable_access_ordered_inner)
  4852108    5.145    0.000   11.936    0.000 _isl.py:96(_set_ctx_data)
  1055463    5.066    0.000   40.374    0.000 _isl.py:69123(intersect)

## m2l3DOrder6.master.cProfile
         1799025626 function calls (1778754086 primitive calls) in 745.897 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
125134140  148.805    0.000  249.071    0.000 __init__.py:750(iname_tags_of_type)
     7911   80.098    0.010  563.041    0.071 bounds.py:58(get_usable_inames_for_conditional)
125134144   76.108    0.000   90.871    0.000 data.py:59(filter_iname_tags_by_type)
 62838785   44.125    0.000   73.564    0.000 __init__.py:817(insn_inames)
470119215   40.188    0.000   40.188    0.000 {built-in method builtins.isinstance}

## m2l3DOrder6.cProfile
         1507298925 function calls (1503844394 primitive calls) in 997.954 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  498.631  498.631  604.558  604.558 check.py:511(_check_variable_access_ordered_inner)
     7911   72.852    0.009  338.405    0.043 bounds.py:58(get_usable_inames_for_conditional)
200487128/200321772   47.319    0.000   77.196    0.000 __init__.py:576(wrapper)
 62830879   41.093    0.000   65.206    0.000 __init__.py:819(insn_inames)
470080723   39.970    0.000   39.970    0.000 {built-in method builtins.isinstance}

## big-kernel-cProfile.log
         5541811597 function calls (5524629947 primitive calls) in 1768.885 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    15797  284.875    0.018 1402.438    0.089 bounds.py:58(get_usable_inames_for_conditional)
249966580  188.420    0.000  288.115    0.000 __init__.py:819(insn_inames)
749420543/749104786  173.127    0.000  255.054    0.000 __init__.py:576(wrapper)
1832461657  153.097    0.000  153.097    0.000 {built-in method builtins.isinstance}
494066972  150.587    0.000  437.911    0.000 bounds.py:86(<genexpr>)

## MLIR-From-C.md

      
              3 files
            
          
              2 forks
            
          
                0 comments
              
            
              3 stars
            
          
                kaushikcfd
                / MLIR-From-C.md
            
            
              Last active
              February 14, 2025 02:55
            
              
                Calling MLIR kernels from C
              
          
    Step 1. Lower axpy.mlir to axpy.ll
mlir-opt -lower-affine -convert-loop-to-std -convert-std-to-llvm='emit-c-wrappers=1' axpy.mlir | mlir-translate --mlir-to-llvmir -o axpy.ll

Step 2. Get bitcode for the caller C.
clang -emit-llvm call_axpy.c -S -o call_axpy.bc


## laplace_CG_hex_q4
---------------------------------------------------------------------------
KERNEL: wrap_form0_cell_integral_otherwise
---------------------------------------------------------------------------
ARGUMENTS:
dat0: type: np:dtype('float64'), shape: (None), dim_tags: (N0:stride:1) aspace: global
dat1: type: np:dtype('float64'), shape: (None, 3), dim_tags: (N1:stride:3, N0:stride:1) aspace: global
dat2: type: np:dtype('float64'), shape: (None), dim_tags: (N0:stride:1) aspace: global
end: ValueArg, type: np:dtype('int32')
layers: type: np:dtype('int32'), shape: (1, 2), dim_tags: (N1:stride:2, N0:stride:1) aspace: global
map0: type: np:dtype('int32'), shape: (None, 125), dim_tags: (N1:stride:125, N0:stride:1) aspace: global

## cross_element_vectorization_vs_libxsmm.cpp
#include <vector>
#include <cstdio>
#include <sys/time.h>

# define TIME_DIFF(t2, t1) ((t2).tv_sec - (t1).tv_sec + ((t2).tv_usec - (t1).tv_usec)*1e-6)


void scpt_gemm(const double *A, const double *B, double*C, int m, int n, int k)
/*
 * Performs C = A x B.

## gpu-progress-overview.org

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                kaushikcfd
                / gpu-progress-overview.org
            
            
              Last active
              December 30, 2019 05:45
            
          
    Strategy chosen on kernel name


  Kernels named zero, expression_kernel, uniform_extrusion, etc. are always chosen for
    outer-loop parallelization i.e. work of one iteration of “n”(outer loop
    variable name in PyOP2) is performed by a thread.
  For the kernel form0_cell_integral_otherwise (matvec kernel) parallelization
    strategy is selected through the configuration parameter ~gpu_strategy~ which
    can be one of:
    
      scpt: Single Cell Per thread aka outer loop parallelization.
      user_specified_tile: A tiling parallelization strategy that expects a set of parameters
        from the user.
    
  
## wrong_bounds.cu
#include <math.h>


extern "C" __global__ void __launch_bounds__(54) wrap_form0_cell_integral_otherwise(int const start, int const end, double *__restrict__ dat2, double const *__restrict__ dat1, double const *__restrict__ dat0, int const *__restrict__ map0, int const *__restrict__ map1, double const *__restrict__ form_t13, double const *__restrict__ form_t14, double const *__restrict__ form_t15)
{
  __shared__ double basis_cnst_mtrix_prftch[64l];
  __shared__ double basis_cnst_mtrix_prftch_0[64l];
  double form_t0;
  double form_t1;
  double form_t10;
	632284386 function calls (625416884 primitive calls) in 415.576 seconds

	Ordered by: internal time

	ncalls tottime percall cumtime percall filename:lineno(function)
	4208407 51.837 0.000 51.837 0.000 {built-in method <None>.isl_set_intersect}
	1 22.305 22.305 307.282 307.282 check.py:511(_check_variable_access_ordered_inner)
	15994440 20.396 0.000 65.230 0.000 _isl.py:76(_setup)
	4208407 20.060 0.000 164.580 0.000 _isl.py:69123(intersect)
	8384512 18.351 0.000 44.778 0.000 symbolic.py:1851(_get_access_range_for_var)
	211024990 function calls (207570482 primitive calls) in 131.849 seconds

	Ordered by: internal time

	ncalls tottime percall cumtime percall filename:lineno(function)
	1055463 12.394 0.000 12.394 0.000 {built-in method <None>.isl_set_intersect}
	4852104 6.103 0.000 19.733 0.000 _isl.py:76(_setup)
	1 5.677 5.677 79.083 79.083 check.py:511(_check_variable_access_ordered_inner)
	4852108 5.145 0.000 11.936 0.000 _isl.py:96(_set_ctx_data)
	1055463 5.066 0.000 40.374 0.000 _isl.py:69123(intersect)
	1799025626 function calls (1778754086 primitive calls) in 745.897 seconds

	Ordered by: internal time

	ncalls tottime percall cumtime percall filename:lineno(function)
	125134140 148.805 0.000 249.071 0.000 __init__.py:750(iname_tags_of_type)
	7911 80.098 0.010 563.041 0.071 bounds.py:58(get_usable_inames_for_conditional)
	125134144 76.108 0.000 90.871 0.000 data.py:59(filter_iname_tags_by_type)
	62838785 44.125 0.000 73.564 0.000 __init__.py:817(insn_inames)
	470119215 40.188 0.000 40.188 0.000 {built-in method builtins.isinstance}
	1507298925 function calls (1503844394 primitive calls) in 997.954 seconds

	Ordered by: internal time

	ncalls tottime percall cumtime percall filename:lineno(function)
	1 498.631 498.631 604.558 604.558 check.py:511(_check_variable_access_ordered_inner)
	7911 72.852 0.009 338.405 0.043 bounds.py:58(get_usable_inames_for_conditional)
	200487128/200321772 47.319 0.000 77.196 0.000 __init__.py:576(wrapper)
	62830879 41.093 0.000 65.206 0.000 __init__.py:819(insn_inames)
	470080723 39.970 0.000 39.970 0.000 {built-in method builtins.isinstance}
	5541811597 function calls (5524629947 primitive calls) in 1768.885 seconds

	Ordered by: internal time

	ncalls tottime percall cumtime percall filename:lineno(function)
	15797 284.875 0.018 1402.438 0.089 bounds.py:58(get_usable_inames_for_conditional)
	249966580 188.420 0.000 288.115 0.000 __init__.py:819(insn_inames)
	749420543/749104786 173.127 0.000 255.054 0.000 __init__.py:576(wrapper)
	1832461657 153.097 0.000 153.097 0.000 {built-in method builtins.isinstance}
	494066972 150.587 0.000 437.911 0.000 bounds.py:86(<genexpr>)
	---------------------------------------------------------------------------
	KERNEL: wrap_form0_cell_integral_otherwise
	---------------------------------------------------------------------------
	ARGUMENTS:
	dat0: type: np:dtype('float64'), shape: (None), dim_tags: (N0:stride:1) aspace: global
	dat1: type: np:dtype('float64'), shape: (None, 3), dim_tags: (N1:stride:3, N0:stride:1) aspace: global
	dat2: type: np:dtype('float64'), shape: (None), dim_tags: (N0:stride:1) aspace: global
	end: ValueArg, type: np:dtype('int32')
	layers: type: np:dtype('int32'), shape: (1, 2), dim_tags: (N1:stride:2, N0:stride:1) aspace: global
	map0: type: np:dtype('int32'), shape: (None, 125), dim_tags: (N1:stride:125, N0:stride:1) aspace: global
	#include <vector>
	#include <cstdio>
	#include <sys/time.h>

	# define TIME_DIFF(t2, t1) ((t2).tv_sec - (t1).tv_sec + ((t2).tv_usec - (t1).tv_usec)*1e-6)


	void scpt_gemm(const double A, const double B, double*C, int m, int n, int k)
	/*
	* Performs C = A x B.
	#include <math.h>


	extern "C" __global__ void __launch_bounds__(54) wrap_form0_cell_integral_otherwise(int const start, int const end, double __restrict__ dat2, double const __restrict__ dat1, double const __restrict__ dat0, int const __restrict__ map0, int const __restrict__ map1, double const __restrict__ form_t13, double const __restrict__ form_t14, double const __restrict__ form_t15)
	{
	__shared__ double basis_cnst_mtrix_prftch[64l];
	__shared__ double basis_cnst_mtrix_prftch_0[64l];
	double form_t0;
	double form_t1;
	double form_t10;