Mezzano/gist:bb7384cb00f40c6272e24a0af7099bba

## gistfile1.txt
iCar-G3-446:/dev/disk/mounted/sda1/opencl/caffe # make runtest -j4
CXX/LD -o .build_release/test/test_all.testbin src/caffe/test/test_caffe_main.cpp
In file included from ../caffe_requirements/ViennaCL-1.7.1/viennacl/scheduler/preset.hpp:21:0,
                 from ../caffe_requirements/ViennaCL-1.7.1/viennacl/linalg/opencl/kernels/vector.hpp:25,
                 from ../caffe_requirements/ViennaCL-1.7.1/viennacl/linalg/opencl/vector_operations.hpp:35,
                 from ../caffe_requirements/ViennaCL-1.7.1/viennacl/linalg/vector_operations.hpp:39,
                 from ../caffe_requirements/ViennaCL-1.7.1/viennacl/vector.hpp:33,
                 from ./include/caffe/greentea/greentea.hpp:40,
                 from ./include/caffe/common.hpp:25,
                 from ./include/caffe/blob.hpp:8,
                 from ./include/caffe/caffe.hpp:7,
                 from src/caffe/test/test_caffe_main.cpp:3:
../caffe_requirements/ViennaCL-1.7.1/viennacl/device_specific/forwards.h:129:20: warning: ‘std::string viennacl::device_specific::generate_value_kernel_argument(const string&, const string&)’ defined but not used [-Wunused-function]
 static std::string generate_value_kernel_argument(std::string const & scalartype, std::string const & name)
                    ^
../caffe_requirements/ViennaCL-1.7.1/viennacl/device_specific/forwards.h:135:20: warning: ‘std::string viennacl::device_specific::generate_pointer_kernel_argument(const string&, const string&, const string&)’ defined but not used [-Wunused-function]
 static std::string generate_pointer_kernel_argument(std::string const & address_space, std::string const & scalartype, std::string const & name)
                    ^
.build_release/tools/caffe
caffe: command line brew
usage: caffe <command> <args>

commands:
  train           train or finetune a model
  test            score a model
  device_query    show GPU diagnostic information
  time            benchmark model execution time  autotune        autotune a model

  Flags from tools/caffe.cpp:
    -gpu (Optional; run in GPU mode on given device IDs separated by ','.Use
      '-gpu all' to run on all available GPUs. The effective training batch
      size is multiplied by the number of devices.) type: string default: ""
    -iterations (The number of iterations to run.) type: int32 default: 50
    -level (Optional; network level.) type: int32 default: 0
    -model (The model definition protocol buffer text file.) type: string
      default: ""
    -phase (Optional; network phase (TRAIN or TEST). Only used for 'time'.)
      type: string default: ""
    -sighup_effect (Optional; action to take when a SIGHUP signal is received:
      snapshot, stop or none.) type: string default: "snapshot"
    -sigint_effect (Optional; action to take when a SIGINT signal is received:
      snapshot, stop or none.) type: string default: "stop"
    -snapshot (Optional; the snapshot solver state to resume training.)
      type: string default: ""
    -solver (The solver definition protocol buffer text file.) type: string
      default: ""
    -stage (Optional; network stages (not to be confused with phase), separated
      by ','.) type: string default: ""
    -weights (Optional; the pretrained weights to initialize finetuning,
      separated by ','. Cannot be set simultaneously with snapshot.)
      type: string default: ""
.build_release/test/test_all.testbin 0 --gtest_shuffle
Setting to use device 0
Note: Randomizing tests' orders with a seed of 49319 .
[==========] Running 2046 tests from 274 test cases.
[----------] Global test environment set-up.
[----------] 5 tests from BenchmarkTest/1, where TypeParam = caffe::CPUDevice<double>
[ RUN      ] BenchmarkTest/1.TestTimerSeconds
[       OK ] BenchmarkTest/1.TestTimerSeconds (318 ms)
[ RUN      ] BenchmarkTest/1.TestTimerConstructor
[       OK ] BenchmarkTest/1.TestTimerConstructor (0 ms)
[ RUN      ] BenchmarkTest/1.TestTimerMilliSeconds
[       OK ] BenchmarkTest/1.TestTimerMilliSeconds (314 ms)
[ RUN      ] BenchmarkTest/1.TestTimerStop
[       OK ] BenchmarkTest/1.TestTimerStop (0 ms)
[ RUN      ] BenchmarkTest/1.TestTimerStart
[       OK ] BenchmarkTest/1.TestTimerStart (1 ms)
[----------] 5 tests from BenchmarkTest/1 (634 ms total)

[----------] 9 tests from InnerProductLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] InnerProductLayerTest/0.TestSetUpTranposeTrue
[       OK ] InnerProductLayerTest/0.TestSetUpTranposeTrue (9 ms)
[ RUN      ] InnerProductLayerTest/0.TestForward
[       OK ] InnerProductLayerTest/0.TestForward (16 ms)
[ RUN      ] InnerProductLayerTest/0.TestSetUp
[       OK ] InnerProductLayerTest/0.TestSetUp (1 ms)
[ RUN      ] InnerProductLayerTest/0.TestGradient
[       OK ] InnerProductLayerTest/0.TestGradient (1950 ms)
[ RUN      ] InnerProductLayerTest/0.TestGradientTranspose
[       OK ] InnerProductLayerTest/0.TestGradientTranspose (2308 ms)
[ RUN      ] InnerProductLayerTest/0.TestBackwardTranspose
[       OK ] InnerProductLayerTest/0.TestBackwardTranspose (2 ms)
[ RUN      ] InnerProductLayerTest/0.TestForwardTranspose
[       OK ] InnerProductLayerTest/0.TestForwardTranspose (1 ms)
[ RUN      ] InnerProductLayerTest/0.TestForwardNoBatch
[       OK ] InnerProductLayerTest/0.TestForwardNoBatch (0 ms)
[ RUN      ] InnerProductLayerTest/0.TestSetUpTranposeFalse
[       OK ] InnerProductLayerTest/0.TestSetUpTranposeFalse (1 ms)
[----------] 9 tests from InnerProductLayerTest/0 (4291 ms total)

[----------] 5 tests from ImageDataLayerTest/1, where TypeParam = caffe::CPUDevice<double>
[ RUN      ] ImageDataLayerTest/1.TestResize
[       OK ] ImageDataLayerTest/1.TestResize (764 ms)
[ RUN      ] ImageDataLayerTest/1.TestSpace
[       OK ] ImageDataLayerTest/1.TestSpace (247 ms)
[ RUN      ] ImageDataLayerTest/1.TestShuffle
[       OK ] ImageDataLayerTest/1.TestShuffle (1038 ms)
[ RUN      ] ImageDataLayerTest/1.TestReshape
[       OK ] ImageDataLayerTest/1.TestReshape (242 ms)
[ RUN      ] ImageDataLayerTest/1.TestRead
[       OK ] ImageDataLayerTest/1.TestRead (1012 ms)
[----------] 5 tests from ImageDataLayerTest/1 (3304 ms total)

[----------] 2 tests from SigmoidCrossEntropyLossLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] SigmoidCrossEntropyLossLayerTest/3.TestGradient
[       OK ] SigmoidCrossEntropyLossLayerTest/3.TestGradient (1 ms)
[ RUN      ] SigmoidCrossEntropyLossLayerTest/3.TestSigmoidCrossEntropyLoss
[       OK ] SigmoidCrossEntropyLossLayerTest/3.TestSigmoidCrossEntropyLoss (0 ms)
[----------] 2 tests from SigmoidCrossEntropyLossLayerTest/3 (2 ms total)

[----------] 2 tests from HingeLossLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] HingeLossLayerTest/3.TestGradientL2
[       OK ] HingeLossLayerTest/3.TestGradientL2 (0 ms)
[ RUN      ] HingeLossLayerTest/3.TestGradientL1
[       OK ] HingeLossLayerTest/3.TestGradientL1 (0 ms)
[----------] 2 tests from HingeLossLayerTest/3 (1 ms total)

[----------] 8 tests from RMSPropSolverTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithWeightDecay
[       OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithWeightDecay (303 ms)
[ RUN      ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccumShare
[       OK ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccumShare (44 ms)
[ RUN      ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccum
[       OK ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccum (31 ms)
[ RUN      ] RMSPropSolverTest/0.TestSnapshotShare
[       OK ] RMSPropSolverTest/0.TestSnapshotShare (270 ms)
[ RUN      ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverything
[       OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverything (721 ms)
[ RUN      ] RMSPropSolverTest/0.TestSnapshot
[       OK ] RMSPropSolverTest/0.TestSnapshot (192 ms)
[ RUN      ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverythingShare
[       OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverythingShare (1171 ms)
[ RUN      ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithRmsDecay
[       OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithRmsDecay (721 ms)
[----------] 8 tests from RMSPropSolverTest/0 (3455 ms total)

[----------] 1 test from SolverTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] SolverTest/0.TestInitTrainTestNets
[       OK ] SolverTest/0.TestInitTrainTestNets (34 ms)
[----------] 1 test from SolverTest/0 (34 ms total)

[----------] 3 tests from BlobMathTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN      ] BlobMathTest/2.TestSumOfSquares
[       OK ] BlobMathTest/2.TestSumOfSquares (74 ms)
[ RUN      ] BlobMathTest/2.TestScaleData
[       OK ] BlobMathTest/2.TestScaleData (86 ms)
[ RUN      ] BlobMathTest/2.TestAsum
[       OK ] BlobMathTest/2.TestAsum (38 ms)
[----------] 3 tests from BlobMathTest/2 (199 ms total)

[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN      ] SGDSolverTest/2.TestLeastSquaresUpdate
OpenCL compiler error/warning:
(1491:0) : error : function: 'InitAccRegisters' hasn't the corresponding declaration
(1536:0) : error : function: 'MultiplyAccumulate' hasn't the corresponding declaration
(1664:0) : error : function: 'XgemmBody' hasn't the corresponding declaration
(1668:0) : error : function: 'StoreResults' hasn't the corresponding declaration

OpenCL code program was:
#define KWG 16
#define KWI 2
#define MDIMA 8
#define MDIMC 8
#define STRM 0
#define MWG 32
#define NDIMC 8
#define SB 0
#define STRN 0
#define VWM 1
#define PAD_WPTX 1
#define VWN 1
#define NWG 64
#define PAD_WPTY 1
#define COPY_VW 1
#define PAD_DIMX 8
#define COPY_DIMY 8
#define NDIMB 8
#define COPY_DIMX 8
#define SA 0
#define COPY_WPT 1
#define TRA_SHUFFLE 0
#define PAD_DIMY 8
#define TRA_PAD 0
#define TRA_DIM 4
#define PADTRA_WPT 1
#define PADTRA_PAD 0
#define TRA_WPT 1
#define PADTRA_TILE 8
#define PRECISION 32
#define ROUTINE_GEMM

// =================================================================================================

// Parameters set by the tuner or by the database. Here they are given a basic default value in case
// this file is used outside of the CLBlast library.
#ifndef PRECISION
  #define PRECISION 32      // Data-types: half, single or double precision, complex or regular
#endif

// =================================================================================================

// Enable support for double-precision
#if PRECISION == 16
  #pragma OPENCL EXTENSION cl_khr_fp16: enable
#endif

// Enable support for double-precision
#if PRECISION == 64 || PRECISION == 6464
  #if __OPENCL_VERSION__ <= CL_VERSION_1_1
     #pragma OPENCL EXTENSION cl_khr_fp64: enable
  #endif
#endif

// Half-precision
#if PRECISION == 16
  typedef half real;
  typedef half2 real2;
  typedef half4 real4;
  typedef half8 real8;
  typedef half16 real16;
  #define ZERO 0
  #define ONE 1
  #define SMALLEST -1.0e14

// Single-precision
#elif PRECISION == 32
  typedef float real;
  typedef float2 real2;
  typedef float4 real4;
  typedef float8 real8;
  typedef float16 real16;
  #define ZERO 0.0f
  #define ONE 1.0f
  #define SMALLEST -1.0e37f

// Double-precision
#elif PRECISION == 64
  typedef double real;
  typedef double2 real2;
  typedef double4 real4;
  typedef double8 real8;
  typedef double16 real16;
  #define ZERO 0.0
  #define ONE 1.0
  #define SMALLEST -1.0e37

// Complex single-precision
#elif PRECISION == 3232
  typedef struct cfloat {float x; float y;} real;
  typedef struct cfloat2 {real x; real y;} real2;
  typedef struct cfloat4 {real x; real y; real z; real w;} real4;
  typedef struct cfloat8 {real s0; real s1; real s2; real s3;
                          real s4; real s5; real s6; real s7;} real8;
  typedef struct cfloat16 {real s0; real s1; real s2; real s3;
                           real s4; real s5; real s6; real s7;
                           real s8; real s9; real sA; real sB;
                           real sC; real sD; real sE; real sF;} real16;
  #define ZERO 0.0f
  #define ONE 1.0f
  #define SMALLEST -1.0e37f

// Complex double-precision
#elif PRECISION == 6464
  typedef struct cdouble {double x; double y;} real;
  typedef struct cdouble2 {real x; real y;} real2;
  typedef struct cdouble4 {real x; real y; real z; real w;} real4;
  typedef struct cdouble8 {real s0; real s1; real s2; real s3;
                           real s4; real s5; real s6; real s7;} real8;
  typedef struct cdouble16 {real s0; real s1; real s2; real s3;
                            real s4; real s5; real s6; real s7;
                            real s8; real s9; real sA; real sB;
                            real sC; real sD; real sE; real sF;} real16;
  #define ZERO 0.0
  #define ONE 1.0
  #define SMALLEST -1.0e37
#endif

// Single-element version of a complex number
#if PRECISION == 3232
  typedef float singlereal;
#elif PRECISION == 6464
  typedef double singlereal;
#else
  typedef real singlereal;
#endif

// =================================================================================================

// Don't use the non-IEEE754 compliant OpenCL built-in mad() instruction per default. For specific
// devices, this is enabled (see src/routine.cc).
#ifndef USE_CL_MAD
  #define USE_CL_MAD 0
#endif

// Sets a variable to zero
#if PRECISION == 3232 || PRECISION == 6464
  #define SetToZero(a) a.x = ZERO; a.y = ZERO
#else
  #define SetToZero(a) a = ZERO
#endif

// Sets a variable to zero (only the imaginary part)
#if PRECISION == 3232 || PRECISION == 6464
  #define ImagToZero(a) a.y = ZERO
#else
  #define ImagToZero(a)
#endif

// Sets a variable to one
#if PRECISION == 3232 || PRECISION == 6464
  #define SetToOne(a) a.x = ONE; a.y = ZERO
#else
  #define SetToOne(a) a = ONE
#endif

// The absolute value (component-wise)
#if PRECISION == 3232 || PRECISION == 6464
  #define AbsoluteValue(value) value.x = fabs(value.x); value.y = fabs(value.y)
#else
  #define AbsoluteValue(value) value = fabs(value)
#endif

// Adds two complex variables
#if PRECISION == 3232 || PRECISION == 6464
  #define Add(c, a, b) c.x = a.x + b.x; c.y = a.y + b.y
#else
  #define Add(c, a, b) c = a + b
#endif

// Multiply two complex variables (used in the defines below)
#if PRECISION == 3232 || PRECISION == 6464
  #define MulReal(a, b) a.x*b.x - a.y*b.y
  #define MulImag(a, b) a.x*b.y + a.y*b.x
#endif

// The scalar multiply function
#if PRECISION == 3232 || PRECISION == 6464
  #define Multiply(c, a, b) c.x = MulReal(a,b); c.y = MulImag(a,b)
#else
  #define Multiply(c, a, b) c = a * b
#endif

// The scalar multiply-add function
#if PRECISION == 3232 || PRECISION == 6464
  #define MultiplyAdd(c, a, b) c.x += MulReal(a,b); c.y += MulImag(a,b)
#else
  #if USE_CL_MAD == 1
    #define MultiplyAdd(c, a, b) c = mad(a, b, c)
  #else
    #define MultiplyAdd(c, a, b) c += a * b
  #endif
#endif

// The scalar AXPBY function
#if PRECISION == 3232 || PRECISION == 6464
  #define AXPBY(e, a, b, c, d) e.x = MulReal(a,b) + MulReal(c,d); e.y = MulImag(a,b) + MulImag(c,d)
#else
  #define AXPBY(e, a, b, c, d) e = a*b + c*d
#endif

// The complex conjugate operation for complex transforms
#if PRECISION == 3232 || PRECISION == 6464
  #define COMPLEX_CONJUGATE(value) value.x = value.x; value.y = -value.y
#else
  #define COMPLEX_CONJUGATE(value) value = value
#endif

// =================================================================================================

// Shuffled workgroup indices to avoid partition camping, see below. For specific devices, this is
// enabled (see src/routine.cc).
#ifndef USE_STAGGERED_INDICES
  #define USE_STAGGERED_INDICES 0
#endif

// Staggered/shuffled group indices to avoid partition camping (AMD GPUs). Formula's are taken from:
// http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf
// More details: https://github.com/CNugteren/CLBlast/issues/53
#if USE_STAGGERED_INDICES == 1
  inline size_t GetGroupIDFlat() {
    return get_group_id(0) + get_num_groups(0) * get_group_id(1);
  }
  inline size_t GetGroupID1() {
    return (GetGroupIDFlat()) % get_num_groups(1);
  }
  inline size_t GetGroupID0() {
    return ((GetGroupIDFlat() / get_num_groups(1)) + GetGroupID1()) % get_num_groups(0);
  }
#else
  inline size_t GetGroupID1() { return get_group_id(1); }
  inline size_t GetGroupID0() { return get_group_id(0); }
#endif

// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================

// Parameters set by the tuner or by the database. Here they are given a basic default value in case
// this kernel file is used outside of the CLBlast library.

// For the 'fast' copy kernel
#ifndef COPY_DIMX
  #define COPY_DIMX 8      // Local workgroup size in the first dimension (x)
#endif
#ifndef COPY_DIMY
  #define COPY_DIMY 8      // Local workgroup size in the second dimension (y)
#endif
#ifndef COPY_WPT
  #define COPY_WPT 1       // Work per thread in the first dimension (x)
#endif
#ifndef COPY_VW
  #define COPY_VW 1        // Vector width in the second dimension (y)
#endif

// For the padding/copy kernels and the conversion kernels
#ifndef PAD_DIMX
  #define PAD_DIMX 8      // Local workgroup size in the first dimension (x)
#endif
#ifndef PAD_DIMY
  #define PAD_DIMY 8      // Local workgroup size in the second dimension (y)
#endif
#ifndef PAD_WPTX
  #define PAD_WPTX 1      // Work per thread in the first dimension (x)
#endif
#ifndef PAD_WPTY
  #define PAD_WPTY 1      // Work per thread in the second dimension (y)
#endif

// For the 'fast' transpose kernel
#ifndef TRA_DIM
  #define TRA_DIM 8       // Number of local threads in the two dimensions (x,y)
#endif
#ifndef TRA_WPT
  #define TRA_WPT 1       // Work per thread in one dimension and vector-width in the other
#endif
#ifndef TRA_PAD
  #define TRA_PAD 0       // Padding of the local memory to avoid bank-conflicts
#endif
#ifndef TRA_SHUFFLE
  #define TRA_SHUFFLE 0   // Shuffling of the global indices to avoid global memory bank-conflicts
#endif

// For the padding/transpose kernels
#ifndef PADTRA_TILE
  #define PADTRA_TILE 8   // Number of local threads in the two dimensions (x,y)
#endif
#ifndef PADTRA_WPT
  #define PADTRA_WPT 1    // Amount of work per thread
#endif
#ifndef PADTRA_PAD
  #define PADTRA_PAD 0    // Padding of the local memory to avoid bank-conflicts
#endif

// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================

// Data-widths
#if COPY_VW == 1
  typedef real realC;
#elif COPY_VW == 2
  typedef real2 realC;
#elif COPY_VW == 4
  typedef real4 realC;
#elif COPY_VW == 8
  typedef real8 realC;
#elif COPY_VW == 16
  typedef real16 realC;
#endif

// =================================================================================================

// Fast copy kernel. Requires 'ld' and the number of threads in dimension 0 to be a multiple of
// COPY_VW. Also requires both matrices to be of the same dimensions and without offset.
__kernel __attribute__((reqd_work_group_size(COPY_DIMX, COPY_DIMY, 1)))
void CopyMatrixFast(const int ld,
                             __global const realC* restrict src,
                             __global realC* dest,
                             const __constant real* restrict arg_alpha) {
  const real alpha = arg_alpha[0];
  #pragma unroll
  for (int w_one=0; w_one<COPY_WPT; ++w_one) {
    const int id_one = get_global_id(0);
    const int id_two = (get_group_id(1)*COPY_WPT + w_one) * COPY_DIMY + get_local_id(1);
    const int id = id_two*(ld/COPY_VW) + id_one;
    realC result;
    #if COPY_VW == 1
      Multiply(result, alpha, src[id]);
    #elif COPY_VW == 2
      Multiply(result.x, alpha, src[id].x);
      Multiply(result.y, alpha, src[id].y);
    #elif COPY_VW == 4
      Multiply(result.x, alpha, src[id].x);
      Multiply(result.y, alpha, src[id].y);
      Multiply(result.z, alpha, src[id].z);
      Multiply(result.w, alpha, src[id].w);
    #elif COPY_VW == 8
      Multiply(result.s0, alpha, src[id].s0);
      Multiply(result.s1, alpha, src[id].s1);
      Multiply(result.s2, alpha, src[id].s2);
      Multiply(result.s3, alpha, src[id].s3);
      Multiply(result.s4, alpha, src[id].s4);
      Multiply(result.s5, alpha, src[id].s5);
      Multiply(result.s6, alpha, src[id].s6);
      Multiply(result.s7, alpha, src[id].s7);
    #elif COPY_VW == 16
      Multiply(result.s0, alpha, src[id].s0);
      Multiply(result.s1, alpha, src[id].s1);
      Multiply(result.s2, alpha, src[id].s2);
      Multiply(result.s3, alpha, src[id].s3);
      Multiply(result.s4, alpha, src[id].s4);
      Multiply(result.s5, alpha, src[id].s5);
      Multiply(result.s6, alpha, src[id].s6);
      Multiply(result.s7, alpha, src[id].s7);
      Multiply(result.s8, alpha, src[id].s8);
      Multiply(result.s9, alpha, src[id].s9);
      Multiply(result.sA, alpha, src[id].sA);
      Multiply(result.sB, alpha, src[id].sB);
      Multiply(result.sC, alpha, src[id].sC);
      Multiply(result.sD, alpha, src[id].sD);
      Multiply(result.sE, alpha, src[id].sE);
      Multiply(result.sF, alpha, src[id].sF);
    #endif
    dest[id] = result;;
  }
}

// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================

// Copies a matrix from source to destination. The output is padded with zero values in case the
// destination matrix dimensions are larger than the source matrix dimensions. Additionally, the ld
// value and offset can be different.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void CopyPadMatrix(const int src_one, const int src_two,
                            const int src_ld, const int src_offset,
                            __global const real* restrict src,
                            const int dest_one, const int dest_two,
                            const int dest_ld, const int dest_offset,
                            __global real* dest,
                            const __constant real* restrict arg_alpha,
                            const int do_conjugate) {
  const real alpha = arg_alpha[0];

  // Loops over the work per thread in both dimensions
  #pragma unroll
  for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
    const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
    #pragma unroll
    for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
      const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
      if (id_two < dest_two && id_one < dest_one) {

        // Loads data if the thread IDs are within bounds of the source matrix. Otherwise, set the
        // value to be written to zero.
        real value;
        SetToZero(value);
        if (id_two < src_two && id_one < src_one) {
          value = src[id_two*src_ld + id_one + src_offset];
        }

        // Stores the value in the destination matrix
        if (do_conjugate == 1) { COMPLEX_CONJUGATE(value); }
        Multiply(dest[id_two*dest_ld + id_one + dest_offset], alpha, value);
      }
    }
  }
}

// =================================================================================================

// Same as above, but now un-pads a matrix. This kernel reads data from a padded source matrix, but
// writes only the actual data back to the destination matrix. Again, the ld value and offset can
// be different.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void CopyMatrix(const int src_one, const int src_two,
                         const int src_ld, const int src_offset,
                         __global const real* restrict src,
                         const int dest_one, const int dest_two,
                         const int dest_ld, const int dest_offset,
                         __global real* dest,
                         const __constant real* restrict arg_alpha,
                         const int upper, const int lower,
                         const int diagonal_imag_zero) {
  const real alpha = arg_alpha[0];

  // Loops over the work per thread in both dimensions
  #pragma unroll
  for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
    const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
    #pragma unroll
    for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
      const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);

      // Masking in case of triangular matrices: updates only the upper or lower part
      bool condition = true;
      #if defined(ROUTINE_SYRK) || defined(ROUTINE_HERK) || defined(ROUTINE_SYR2K) || defined(ROUTINE_HER2K)
        if (upper == 1) { condition = (id_two >= id_one); }
        else if (lower == 1) { condition = (id_two <= id_one); }
      #endif
      if (condition) {

        // Copies the value into the destination matrix. This is always within bounds of the source
        // matrix, as we know that the destination matrix is smaller or equal to the source.
        if (id_two < dest_two && id_one < dest_one) {
          real value = src[id_two*src_ld + id_one + src_offset];
          if (diagonal_imag_zero == 1 && id_one == id_two) { ImagToZero(value); }
          Multiply(dest[id_two*dest_ld + id_one + dest_offset], alpha, value);
        }
      }
    }
  }
}

// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================

// Data-widths
#if TRA_WPT == 1
  typedef real realT;
#elif TRA_WPT == 2
  typedef real2 realT;
#elif TRA_WPT == 4
  typedef real4 realT;
#elif TRA_WPT == 8
  typedef real8 realT;
#elif TRA_WPT == 16
  typedef real16 realT;
#endif

// =================================================================================================

// Transposes and copies a matrix. Requires both matrices to be of the same dimensions and without
// offset. A more general version is available in 'padtranspose.opencl'.
__kernel __attribute__((reqd_work_group_size(TRA_DIM, TRA_DIM, 1)))
void TransposeMatrixFast(const int ld,
                                  __global const realT* restrict src,
                                  __global realT* dest,
                                  const __constant real* restrict arg_alpha) {
  const real alpha = arg_alpha[0];

  // Sets the group identifiers. They might be 'shuffled' around to distribute work in a different
  // way over workgroups, breaking memory-bank dependencies.
  const int gid0 = get_group_id(0);
  #if TRA_SHUFFLE == 1
    const int gid1 = (get_group_id(0) + get_group_id(1)) % get_num_groups(0);
  #else
    const int gid1 = get_group_id(1);
  #endif

  // Local memory to store a tile of the matrix (for coalescing)
  __local realT tile[TRA_WPT*TRA_DIM][TRA_DIM + TRA_PAD];

  // Loops over the work per thread
  #pragma unroll
  for (int w_one=0; w_one<TRA_WPT; ++w_one) {

    // Computes the identifiers for the source matrix. Note that the local and global dimensions
    // do not correspond to each other!
    const int id_one = gid1 * TRA_DIM + get_local_id(0);
    const int id_two = (gid0 * TRA_DIM + get_local_id(1))*TRA_WPT + w_one;

    // Loads data into the local memory
    realT value = src[id_two*(ld/TRA_WPT) + id_one];
    tile[get_local_id(0)*TRA_WPT + w_one][get_local_id(1)] = value;
  }

  // Synchronizes all threads in a workgroup
  barrier(CLK_LOCAL_MEM_FENCE);

  // Loads transposed data from the local memory
  realT v[TRA_WPT];
  #pragma unroll
  for (int w_one=0; w_one<TRA_WPT; ++w_one) {
    v[w_one] = tile[get_local_id(1)*TRA_WPT + w_one][get_local_id(0)];
  }

  // Performs the register-level transpose of the vectorized data
  realT results[TRA_WPT];
  #if TRA_WPT == 1
    results[0] = v[0];
  #elif TRA_WPT == 2
    results[0] = (realT) {v[0].x, v[1].x};
    results[1] = (realT) {v[0].y, v[1].y};
  #elif TRA_WPT == 4
    results[0] = (realT) {v[0].x, v[1].x, v[2].x, v[3].x};
    results[1] = (realT) {v[0].y, v[1].y, v[2].y, v[3].y};
    results[2] = (realT) {v[0].z, v[1].z, v[2].z, v[3].z};
    results[3] = (realT) {v[0].w, v[1].w, v[2].w, v[3].w};
  #elif TRA_WPT == 8
    results[0] = (realT) {v[0].s0, v[1].s0, v[2].s0, v[3].s0, v[4].s0, v[5].s0, v[6].s0, v[7].s0};
    results[1] = (realT) {v[0].s1, v[1].s1, v[2].s1, v[3].s1, v[4].s1, v[5].s1, v[6].s1, v[7].s1};
    results[2] = (realT) {v[0].s2, v[1].s2, v[2].s2, v[3].s2, v[4].s2, v[5].s2, v[6].s2, v[7].s2};
    results[3] = (realT) {v[0].s3, v[1].s3, v[2].s3, v[3].s3, v[4].s3, v[5].s3, v[6].s3, v[7].s3};
    results[4] = (realT) {v[0].s4, v[1].s4, v[2].s4, v[3].s4, v[4].s4, v[5].s4, v[6].s4, v[7].s4};
    results[5] = (realT) {v[0].s5, v[1].s5, v[2].s5, v[3].s5, v[4].s5, v[5].s5, v[6].s5, v[7].s5};
    results[6] = (realT) {v[0].s6, v[1].s6, v[2].s6, v[3].s6, v[4].s6, v[5].s6, v[6].s6, v[7].s6};
    results[7] = (realT) {v[0].s7, v[1].s7, v[2].s7, v[3].s7, v[4].s7, v[5].s7, v[6].s7, v[7].s7};
  #elif TRA_WPT == 16
    results[ 0] = (realT) {v[0].s0, v[1].s0, v[2].s0, v[3].s0, v[4].s0, v[5].s0, v[6].s0, v[7].s0, v[8].s0, v[9].s0, v[10].s0, v[11].s0, v[12].s0, v[13].s0, v[14].s0, v[15].s0};
    results[ 1] = (realT) {v[0].s1, v[1].s1, v[2].s1, v[3].s1, v[4].s1, v[5].s1, v[6].s1, v[7].s1, v[8].s1, v[9].s1, v[10].s1, v[11].s1, v[12].s1, v[13].s1, v[14].s1, v[15].s1};
    results[ 2] = (realT) {v[0].s2, v[1].s2, v[2].s2, v[3].s2, v[4].s2, v[5].s2, v[6].s2, v[7].s2, v[8].s2, v[9].s2, v[10].s2, v[11].s2, v[12].s2, v[13].s2, v[14].s2, v[15].s2};
    results[ 3] = (realT) {v[0].s3, v[1].s3, v[2].s3, v[3].s3, v[4].s3, v[5].s3, v[6].s3, v[7].s3, v[8].s3, v[9].s3, v[10].s3, v[11].s3, v[12].s3, v[13].s3, v[14].s3, v[15].s3};
    results[ 4] = (realT) {v[0].s4, v[1].s4, v[2].s4, v[3].s4, v[4].s4, v[5].s4, v[6].s4, v[7].s4, v[8].s4, v[9].s4, v[10].s4, v[11].s4, v[12].s4, v[13].s4, v[14].s4, v[15].s4};
    results[ 5] = (realT) {v[0].s5, v[1].s5, v[2].s5, v[3].s5, v[4].s5, v[5].s5, v[6].s5, v[7].s5, v[8].s5, v[9].s5, v[10].s5, v[11].s5, v[12].s5, v[13].s5, v[14].s5, v[15].s5};
    results[ 6] = (realT) {v[0].s6, v[1].s6, v[2].s6, v[3].s6, v[4].s6, v[5].s6, v[6].s6, v[7].s6, v[8].s6, v[9].s6, v[10].s6, v[11].s6, v[12].s6, v[13].s6, v[14].s6, v[15].s6};
    results[ 7] = (realT) {v[0].s7, v[1].s7, v[2].s7, v[3].s7, v[4].s7, v[5].s7, v[6].s7, v[7].s7, v[8].s7, v[9].s7, v[10].s7, v[11].s7, v[12].s7, v[13].s7, v[14].s7, v[15].s7};
    results[ 8] = (realT) {v[0].s8, v[1].s8, v[2].s8, v[3].s8, v[4].s8, v[5].s8, v[6].s8, v[7].s8, v[8].s8, v[9].s8, v[10].s8, v[11].s8, v[12].s8, v[13].s8, v[14].s8, v[15].s8};
    results[ 9] = (realT) {v[0].s9, v[1].s9, v[2].s9, v[3].s9, v[4].s9, v[5].s9, v[6].s9, v[7].s9, v[8].s9, v[9].s9, v[10].s9, v[11].s9, v[12].s9, v[13].s9, v[14].s9, v[15].s9};
    results[10] = (realT) {v[0].sA, v[1].sA, v[2].sA, v[3].sA, v[4].sA, v[5].sA, v[6].sA, v[7].sA, v[8].sA, v[9].sA, v[10].sA, v[11].sA, v[12].sA, v[13].sA, v[14].sA, v[15].sA};
    results[11] = (realT) {v[0].sB, v[1].sB, v[2].sB, v[3].sB, v[4].sB, v[5].sB, v[6].sB, v[7].sB, v[8].sB, v[9].sB, v[10].sB, v[11].sB, v[12].sB, v[13].sB, v[14].sB, v[15].sB};
    results[12] = (realT) {v[0].sC, v[1].sC, v[2].sC, v[3].sC, v[4].sC, v[5].sC, v[6].sC, v[7].sC, v[8].sC, v[9].sC, v[10].sC, v[11].sC, v[12].sC, v[13].sC, v[14].sC, v[15].sC};
    results[13] = (realT) {v[0].sD, v[1].sD, v[2].sD, v[3].sD, v[4].sD, v[5].sD, v[6].sD, v[7].sD, v[8].sD, v[9].sD, v[10].sD, v[11].sD, v[12].sD, v[13].sD, v[14].sD, v[15].sD};
    results[14] = (realT) {v[0].sE, v[1].sE, v[2].sE, v[3].sE, v[4].sE, v[5].sE, v[6].sE, v[7].sE, v[8].sE, v[9].sE, v[10].sE, v[11].sE, v[12].sE, v[13].sE, v[14].sE, v[15].sE};
    results[15] = (realT) {v[0].sF, v[1].sF, v[2].sF, v[3].sF, v[4].sF, v[5].sF, v[6].sF, v[7].sF, v[8].sF, v[9].sF, v[10].sF, v[11].sF, v[12].sF, v[13].sF, v[14].sF, v[15].sF};
  #endif

  // Multiplies by alpha and then stores the results into the destination matrix
  #pragma unroll
  for (int w_two=0; w_two<TRA_WPT; ++w_two) {
    realT result;
    #if TRA_WPT == 1
      Multiply(result, alpha, results[w_two]);
    #elif TRA_WPT == 2
      Multiply(result.x, alpha, results[w_two].x);
      Multiply(result.y, alpha, results[w_two].y);
    #elif TRA_WPT == 4
      Multiply(result.x, alpha, results[w_two].x);
      Multiply(result.y, alpha, results[w_two].y);
      Multiply(result.z, alpha, results[w_two].z);
      Multiply(result.w, alpha, results[w_two].w);
    #elif TRA_WPT == 8
      Multiply(result.s0, alpha, results[w_two].s0);
      Multiply(result.s1, alpha, results[w_two].s1);
      Multiply(result.s2, alpha, results[w_two].s2);
      Multiply(result.s3, alpha, results[w_two].s3);
      Multiply(result.s4, alpha, results[w_two].s4);
      Multiply(result.s5, alpha, results[w_two].s5);
      Multiply(result.s6, alpha, results[w_two].s6);
      Multiply(result.s7, alpha, results[w_two].s7);
    #elif TRA_WPT == 16
      Multiply(result.s0, alpha, results[w_two].s0);
      Multiply(result.s1, alpha, results[w_two].s1);
      Multiply(result.s2, alpha, results[w_two].s2);
      Multiply(result.s3, alpha, results[w_two].s3);
      Multiply(result.s4, alpha, results[w_two].s4);
      Multiply(result.s5, alpha, results[w_two].s5);
      Multiply(result.s6, alpha, results[w_two].s6);
      Multiply(result.s7, alpha, results[w_two].s7);
      Multiply(result.s8, alpha, results[w_two].s8);
      Multiply(result.s9, alpha, results[w_two].s9);
      Multiply(result.sA, alpha, results[w_two].sA);
      Multiply(result.sB, alpha, results[w_two].sB);
      Multiply(result.sC, alpha, results[w_two].sC);
      Multiply(result.sD, alpha, results[w_two].sD);
      Multiply(result.sE, alpha, results[w_two].sE);
      Multiply(result.sF, alpha, results[w_two].sF);
    #endif
    const int id_one = gid0*TRA_DIM + get_local_id(0);
    const int id_two = (gid1*TRA_DIM + get_local_id(1))*TRA_WPT + w_two;
    dest[id_two*(ld/TRA_WPT) + id_one] = result;
  }
}

// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================

// Transposes a matrix from source to destination. The output is padded with zero values in case the
// destination matrix dimensions are larger than the transposed source matrix dimensions.
__kernel __attribute__((reqd_work_group_size(PADTRA_TILE, PADTRA_TILE, 1)))
void TransposePadMatrix(const int src_one, const int src_two,
                                 const int src_ld, const int src_offset,
                                 __global const real* restrict src,
                                 const int dest_one, const int dest_two,
                                 const int dest_ld, const int dest_offset,
                                 __global real* dest,
                                 const __constant real* restrict arg_alpha,
                                 const int do_conjugate) {
  const real alpha = arg_alpha[0];

  // Local memory to store a tile of the matrix (for coalescing)
  __local real tile[PADTRA_WPT*PADTRA_TILE][PADTRA_WPT*PADTRA_TILE + PADTRA_PAD];

  // Loop over the work per thread
  #pragma unroll
  for (int w_one=0; w_one<PADTRA_WPT; ++w_one) {
    #pragma unroll
    for (int w_two=0; w_two<PADTRA_WPT; ++w_two) {

      // Computes the identifiers for the source matrix. Note that the local and global dimensions
      // do not correspond to each other!
      const int id_src_one = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(0);
      const int id_src_two = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(1);

      // Loads data into the local memory if the thread IDs are within bounds of the source matrix.
      // Otherwise, set the local memory value to zero.
      real value;
      SetToZero(value);
      if (id_src_two < src_two && id_src_one < src_one) {
        value = src[id_src_two*src_ld + id_src_one + src_offset];
      }
      tile[get_local_id(1)*PADTRA_WPT + w_two][get_local_id(0)*PADTRA_WPT + w_one] = value;
    }
  }

  // Synchronizes all threads in a workgroup
  barrier(CLK_LOCAL_MEM_FENCE);

  // Loop over the work per thread
  #pragma unroll
  for (int w_one=0; w_one<PADTRA_WPT; ++w_one) {
    #pragma unroll
    for (int w_two=0; w_two<PADTRA_WPT; ++w_two) {

      // Computes the identifiers for the destination matrix
      const int id_dest_one = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(0);
      const int id_dest_two = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(1);

      // Stores the transposed value in the destination matrix
      if ((id_dest_one < dest_one) && (id_dest_two < dest_two)) {
        real value = tile[get_local_id(0)*PADTRA_WPT + w_two][get_local_id(1)*PADTRA_WPT + w_one];
        if (do_conjugate == 1) { COMPLEX_CONJUGATE(value); }
        Multiply(dest[id_dest_two*dest_ld + id_dest_one + dest_offset], alpha, value);
      }
    }
  }
}

// =================================================================================================

// Transposes a matrix, while considering possible padding in the source matrix. Data is read from a
// padded source matrix, but only the actual data is written back to the transposed destination
// matrix. This kernel optionally checks for upper/lower triangular matrices.
__kernel __attribute__((reqd_work_group_size(PADTRA_TILE, PADTRA_TILE, 1)))
void TransposeMatrix(const int src_one, const int src_two,
                              const int src_ld, const int src_offset,
                              __global const real* restrict src,
                              const int dest_one, const int dest_two,
                              const int dest_ld, const int dest_offset,
                              __global real* dest,
                              const __constant real* restrict arg_alpha,
                              const int upper, const int lower,
                              const int diagonal_imag_zero) {
  const real alpha = arg_alpha[0];

  // Local memory to store a tile of the matrix (for coalescing)
  __local real tile[PADTRA_WPT*PADTRA_TILE][PADTRA_WPT*PADTRA_TILE + PADTRA_PAD];

  // Loop over the work per thread
  #pragma unroll
  for (int w_one=0; w_one<PADTRA_WPT; ++w_one) {
    #pragma unroll
    for (int w_two=0; w_two<PADTRA_WPT; ++w_two) {

      // Computes the identifiers for the source matrix. Note that the local and global dimensions
      // do not correspond to each other!
      const int id_src_one = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(0);
      const int id_src_two = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(1);

      // Loads data into the local memory if the thread IDs are within bounds of the source matrix.
      if ((id_src_one < src_one) && (id_src_two < src_two)) {
        real value = src[id_src_two*src_ld + id_src_one + src_offset];
        tile[get_local_id(1)*PADTRA_WPT + w_two][get_local_id(0)*PADTRA_WPT + w_one] = value;
      }
    }
  }

  // Synchronizes all threads in a workgroup
  barrier(CLK_LOCAL_MEM_FENCE);

  // Loop over the work per thread
  #pragma unroll
  for (int w_one=0; w_one<PADTRA_WPT; ++w_one) {
    #pragma unroll
    for (int w_two=0; w_two<PADTRA_WPT; ++w_two) {

      // Computes the identifiers for the destination matrix
      const int id_dest_one = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(0);
      const int id_dest_two = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(1);

      // Masking in case of triangular matrices: updates only the upper or lower part
      bool condition = true;
      #if defined(ROUTINE_SYRK) || defined(ROUTINE_HERK) || defined(ROUTINE_SYR2K) || defined(ROUTINE_HER2K)
        if (upper == 1) { condition = (id_dest_one >= id_dest_two); }
        else if (lower == 1) { condition = (id_dest_one <= id_dest_two); }
      #endif
      if (condition) {

        // Stores the transposed value in the destination matrix
        if ((id_dest_one < dest_one) && (id_dest_two < dest_two)) {
          real value = tile[get_local_id(0)*PADTRA_WPT + w_two][get_local_id(1)*PADTRA_WPT + w_one];
          if (diagonal_imag_zero == 1 && id_dest_one == id_dest_two) { ImagToZero(value); }
          Multiply(dest[id_dest_two*dest_ld + id_dest_one + dest_offset], alpha, value);
        }
      }
    }
  }
}

// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================
#if defined(ROUTINE_SYMM)

// Kernel to populate a squared symmetric matrix, given that the triangle which holds the data is
// stored as the lower-triangle of the input matrix. This uses the padding kernel's parameters.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void SymmLowerToSquared(const int src_dim,
                                 const int src_ld, const int src_offset,
                                 __global const real* restrict src,
                                 const int dest_dim,
                                 const int dest_ld, const int dest_offset,
                                 __global real* dest) {

  // Loops over the work per thread in both dimensions
  #pragma unroll
  for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
    const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
    #pragma unroll
    for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
      const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
      if (id_two < dest_dim && id_one < dest_dim) {

        // Loads data from the lower-symmetric matrix
        real result;
        SetToZero(result);
        if (id_two < src_dim && id_one < src_dim) {
          if (id_two <= id_one) { result = src[id_two*src_ld + id_one + src_offset]; }
          else                  { result = src[id_one*src_ld + id_two + src_offset]; }
        }

        // Stores the result in the destination matrix
        dest[id_two*dest_ld + id_one + dest_offset] = result;
      }
    }
  }
}

// Same as above, but now the matrix' data is stored in the upper-triangle
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void SymmUpperToSquared(const int src_dim,
                                 const int src_ld, const int src_offset,
                                 __global const real* restrict src,
                                 const int dest_dim,
                                 const int dest_ld, const int dest_offset,
                                 __global real* dest) {

  // Loops over the work per thread in both dimensions
  #pragma unroll
  for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
    const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
    #pragma unroll
    for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
      const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
      if (id_two < dest_dim && id_one < dest_dim) {

        // Loads data from the upper-symmetric matrix
        real result;
        SetToZero(result);
        if (id_two < src_dim && id_one < src_dim) {
          if (id_one <= id_two) { result = src[id_two*src_ld + id_one + src_offset]; }
          else                  { result = src[id_one*src_ld + id_two + src_offset]; }
        }

        // Stores the result in the destination matrix
        dest[id_two*dest_ld + id_one + dest_offset] = result;
      }
    }
  }
}

#endif
// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================
#if defined(ROUTINE_TRMM)

// Kernel to populate a squared triangular matrix, given that the triangle which holds the data is
// stored as the lower-triangle of the input matrix. This uses the padding kernel's parameters.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void TriaLowerToSquared(const int src_dim,
                                 const int src_ld, const int src_offset,
                                 __global const real* restrict src,
                                 const int dest_dim,
                                 const int dest_ld, const int dest_offset,
                                 __global real* dest,
                                 const int unit_diagonal) {

  // Loops over the work per thread in both dimensions
  #pragma unroll
  for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
    const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
    #pragma unroll
    for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
      const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
      if (id_two < dest_dim && id_one < dest_dim) {

        // Loads data from the lower-triangular matrix
        real result;
        SetToZero(result);
        if (id_two < src_dim && id_one < src_dim) {
          if (id_two <= id_one) { result = src[id_two*src_ld + id_one + src_offset]; }
          if (id_two == id_one && unit_diagonal) { SetToOne(result); }
          // Else: result is zero
        }

        // Stores the result in the destination matrix
        dest[id_two*dest_ld + id_one + dest_offset] = result;
      }
    }
  }
}

// Same as above, but now the matrix' data is stored in the upper-triangle
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void TriaUpperToSquared(const int src_dim,
                                 const int src_ld, const int src_offset,
                                 __global const real* restrict src,
                                 const int dest_dim,
                                 const int dest_ld, const int dest_offset,
                                 __global real* dest,
                                 const int unit_diagonal) {

  // Loops over the work per thread in both dimensions
  #pragma unroll
  for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
    const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
    #pragma unroll
    for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
      const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
      if (id_two < dest_dim && id_one < dest_dim) {

        // Loads data from the upper-triangular matrix
        real result;
        SetToZero(result);
        if (id_two < src_dim && id_one < src_dim) {
          if (id_one <= id_two) { result = src[id_two*src_ld + id_one + src_offset]; }
          if (id_one == id_two && unit_diagonal) { SetToOne(result); }
          // Else: result is zero
        }

        // Stores the result in the destination matrix
        dest[id_two*dest_ld + id_one + dest_offset] = result;
      }
    }
  }
}

#endif
// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================
#if defined(ROUTINE_HEMM) && (PRECISION == 3232 || PRECISION == 6464)

// Kernel to populate a squared hermitian matrix, given that the triangle which holds the data is
// stored as the lower-triangle of the input matrix. This uses the padding kernel's parameters.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void HermLowerToSquared(const int src_dim,
                                 const int src_ld, const int src_offset,
                                 __global const real* restrict src,
                                 const int dest_dim,
                                 const int dest_ld, const int dest_offset,
                                 __global real* dest) {

  // Loops over the work per thread in both dimensions
  #pragma unroll
  for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
    const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
    #pragma unroll
    for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
      const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
      if (id_two < dest_dim && id_one < dest_dim) {

        // Loads data from the lower-hermitian matrix
        real result;
        SetToZero(result);
        if (id_two < src_dim && id_one < src_dim) {
          if (id_two <= id_one) {
            result = src[id_two*src_ld + id_one + src_offset];
            if (id_one == id_two) { result.y = ZERO; }
          }
          else {
            result = src[id_one*src_ld + id_two + src_offset];
            COMPLEX_CONJUGATE(result);
          }
        }

        // Stores the result in the destination matrix
        dest[id_two*dest_ld + id_one + dest_offset] = result;
      }
    }
  }
}

// Same as above, but now the matrix' data is stored in the upper-triangle
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void HermUpperToSquared(const int src_dim,
                                 const int src_ld, const int src_offset,
                                 __global const real* restrict src,
                                 const int dest_dim,
                                 const int dest_ld, const int dest_offset,
                                 __global real* dest) {

  // Loops over the work per thread in both dimensions
  #pragma unroll
  for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
    const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
    #pragma unroll
    for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
      const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
      if (id_two < dest_dim && id_one < dest_dim) {

        // Loads data from the upper-hermitian matrix
        real result;
        SetToZero(result);
        if (id_two < src_dim && id_one < src_dim) {
          if (id_one <= id_two) {
            result = src[id_two*src_ld + id_one + src_offset];
            if (id_one == id_two) { result.y = ZERO; }
          }
          else {
            result = src[id_one*src_ld + id_two + src_offset];
            COMPLEX_CONJUGATE(result);
          }
        }

        // Stores the result in the destination matrix
        dest[id_two*dest_ld + id_one + dest_offset] = result;
      }
    }
  }
}

#endif
// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================

// Parameters set by the tuner or by the database. Here they are given a basic default value in case
// this kernel file is used outside of the CLBlast library.
#ifndef MWG
  #define MWG 8      // Tile-size in dimension M (e.g. 64, 128)
#endif
#ifndef NWG
  #define NWG 8      // Tile-size in dimension N (e.g. 64, 128)
#endif
#ifndef KWG
  #define KWG 8      // Tile-size in dimension K (e.g. 8, 16)
#endif
#ifndef MDIMC
  #define MDIMC 8    // Threads per workgroup in M-dimension (e.g. 8, 16, 32)
#endif
#ifndef NDIMC
  #define NDIMC 8    // Threads per workgroup in N-dimension (e.g. 8, 16, 32)
#endif
#ifndef MDIMA
  #define MDIMA 8    // Re-shaped tile dimension of matrix A: KDIMA * MDIMA
#endif
#ifndef NDIMB
  #define NDIMB 8    // Re-shaped tile dimension of matrix B: KDIMB * NDIMB
#endif
#ifndef KWI
  #define KWI 1      // Unroll factor of the KWG loop (smaller or equal than KWG)
#endif
#ifndef VWM
  #define VWM 1      // Vector width of matrices A and C
#endif
#ifndef VWN
  #define VWN 1      // Vector width of matrix B
#endif
#ifndef STRM
  #define STRM 0     // Use strided access within a thread in the M-dimension (1) or not (0)
#endif
#ifndef STRN
  #define STRN 0     // Use strided access within a thread in the N-dimension (1) or not (0)
#endif
#ifndef SA
  #define SA 0       // Use local/shared memory to cache matrix A (1) or not (0)
#endif
#ifndef SB
  #define SB 0       // Use local/shared memory to cache matrix B (1) or not (0)
#endif

// Helper parameters based on the above tuning parameters
#define MWI (MWG/MDIMC)               // Work per work-item (M-dimension)
#define NWI (NWG/NDIMC)               // Work per work-item (N-dimension)
#define KDIMA ((MDIMC*NDIMC)/(MDIMA)) // Re-shaped tile dimension of matrix A: KDIMA * MDIMA
#define KDIMB ((MDIMC*NDIMC)/(NDIMB)) // Re-shaped tile dimension of matrix B: KDIMB * NDIMB
#define MWA (MWG/MDIMA)               // Amount of loads-per-thread for matrix A (M-dimension)
#define KWA (KWG/KDIMA)               // Amount of loads-per-thread for matrix A (K-dimension)
#define KWB (KWG/KDIMB)               // Amount of loads-per-thread for matrix B (K-dimension)
#define NWB (NWG/NDIMB)               // Amount of loads-per-thread for matrix B (N-dimension)

// Settings
#ifndef USE_VECTOR_MAD
  #define USE_VECTOR_MAD 0      // Unroll (0) or don't (1) unroll the vector MAD manually
#endif
#ifndef GLOBAL_MEM_FENCE
  #define GLOBAL_MEM_FENCE 0    // Global synchronisation barrier for potential better performance
#endif

// =================================================================================================

// Data-widths in dimension M
#if VWM == 1
    typedef real realM;
#elif VWM == 2
    typedef real2 realM;
#elif VWM == 4
    typedef real4 realM;
#elif VWM == 8
    typedef real8 realM;
#elif VWM == 16
    typedef real16 realM;
#endif

// Data-widths in dimension N
#if VWN == 1
    typedef real realN;
#elif VWN == 2
    typedef real2 realN;
#elif VWN == 4
    typedef real4 realN;
#elif VWN == 8
    typedef real8 realN;
#elif VWN == 16
    typedef real16 realN;
#endif

// =================================================================================================

// Initializes the accumulation registers to zero
void InitAccRegisters(realM cpm[NWI][MWI/VWM]) {
  #pragma unroll
  for (int mi=0; mi<MWI/VWM; ++mi) {
    #pragma unroll
    for (int ni=0; ni<NWI; ++ni) {
      #if VWM == 1
        SetToZero(cpm[ni][mi]);
      #elif VWM == 2
        SetToZero(cpm[ni][mi].x);
        SetToZero(cpm[ni][mi].y);
      #elif VWM == 4
        SetToZero(cpm[ni][mi].x);
        SetToZero(cpm[ni][mi].y);
        SetToZero(cpm[ni][mi].z);
        SetToZero(cpm[ni][mi].w);
      #elif VWM == 8
        SetToZero(cpm[ni][mi].s0);
        SetToZero(cpm[ni][mi].s1);
        SetToZero(cpm[ni][mi].s2);
        SetToZero(cpm[ni][mi].s3);
        SetToZero(cpm[ni][mi].s4);
        SetToZero(cpm[ni][mi].s5);
        SetToZero(cpm[ni][mi].s6);
        SetToZero(cpm[ni][mi].s7);
      #elif VWM == 16
        SetToZero(cpm[ni][mi].s0);
        SetToZero(cpm[ni][mi].s1);
        SetToZero(cpm[ni][mi].s2);
        SetToZero(cpm[ni][mi].s3);
        SetToZero(cpm[ni][mi].s4);
        SetToZero(cpm[ni][mi].s5);
        SetToZero(cpm[ni][mi].s6);
        SetToZero(cpm[ni][mi].s7);
        SetToZero(cpm[ni][mi].s8);
        SetToZero(cpm[ni][mi].s9);
        SetToZero(cpm[ni][mi].sA);
        SetToZero(cpm[ni][mi].sB);
        SetToZero(cpm[ni][mi].sC);
        SetToZero(cpm[ni][mi].sD);
        SetToZero(cpm[ni][mi].sE);
        SetToZero(cpm[ni][mi].sF);
      #endif
    }
  }
}

// =================================================================================================

// Caches global off-chip memory into local (shared) memory on-chip. This function is specific for
// caching the A input matrix.
#if SA == 1
inline void GlobalToLocalA(const __global realM* restrict agm, __local realM* alm,
                           const int kSizeM, const int tid, const int kwg) {
  const int la0 = tid % MDIMA;
  const int la1 = tid / MDIMA;
  #pragma unroll
  for (int mia=0; mia<MWA/VWM; ++mia) {
    #pragma unroll
    for (int kia=0; kia<KWA; ++kia) {

      // Computes the indices based on strided/non-strided access
      #if STRM == 0
        int mg = mia + la0*(MWA/VWM);
      #elif STRM == 1
        int mg = la0 + mia*MDIMA;
      #endif

      // Computes the indices for the global memory
      int kg = kia + la1*KWA;
      int idm = mg + GetGroupID0() * (MWG/VWM);
      int idk = kg + kwg;

      // Loads the data from global memory (not transposed) into the local memory
      alm[kg*(MWG/VWM) + mg] = agm[idk*(kSizeM/VWM) + idm];
    }
  }
}
#endif

// Same as above, but now for the B input matrix
#if SB == 1
inline void GlobalToLocalB(const __global realN* restrict bgm, __local realN* blm,
                           const int kSizeN, const int tid, const int kwg) {
  const int lb0 = tid % NDIMB;
  const int lb1 = tid / NDIMB;
  #pragma unroll
  for (int kib=0; kib<KWB; ++kib) {
    #pragma unroll
    for (int nib=0; nib<NWB/VWN; ++nib) {

      // Computes the indices based on strided/non-strided access
      #if STRN == 0
        int ng = nib + lb0*(NWB/VWN);
      #elif STRN == 1
        int ng = lb0 + nib*NDIMB;
      #endif

      // Computes the indices for the global memory
      int kg = kib + lb1*KWB;
      int idn = ng + GetGroupID1() * (NWG/VWN);
      int idk = kg + kwg;

      // Loads the data from global memory (transposed) into the local memory
      blm[kg*(NWG/VWN) + ng] = bgm[idk*(kSizeN/VWN) + idn];
    }
  }
}
#endif

// =================================================================================================

// Caches global off-chip memory directly into per-thread private memory (registers). This function
// is specific for caching the A input matrix.
#if SA == 0
inline void GlobalToPrivateA(const __global realM* restrict agm, realM apm[MWI/VWM],
                             const int kSizeM, const int idk, const int kwg) {
  #pragma unroll
  for (int mi=0; mi<MWI/VWM; ++mi) {

    // Computes the indices based on strided/non-strided access
    #if STRM == 0
      int mg = mi + get_local_id(0)*(MWI/VWM);
    #elif STRM == 1
      int mg = get_local_id(0) + mi*MDIMC;
    #endif

    // Computes the indices for the global memory
    int idm = mg + GetGroupID0() * (MWG/VWM);

    // Loads the data from global memory (not transposed) and stores into registers
    apm[mi] = agm[idk*(kSizeM/VWM) + idm];
  }
}
#endif

// Same as above, but now for the B input matrix
#if SB == 0
inline void GlobalToPrivateB(const __global realN* restrict bgm, realN bpm[NWI/VWN],
                             const int kSizeN, const int idk) {
  #pragma unroll
  for (int ni=0; ni<NWI/VWN; ++ni) {

    // Computes the indices based on strided/non-strided access
    #if STRN == 0
      int ng = ni + get_local_id(1)*(NWI/VWN);
    #elif STRN == 1
      int ng = get_local_id(1) + ni*NDIMC;
    #endif

    // Computes the indices for the global memory
    int idn = ng + GetGroupID1() * (NWG/VWN);

    // Loads the data from global memory (transposed) and stores into registers
    bpm[ni] = bgm[idk*(kSizeN/VWN) + idn];
  }
}
#endif

// =================================================================================================

// Caches on-chip local memory into per-thread private memory (registers). This function is specific
// for caching the A input matrix.
#if SA == 1
inline void LocalToPrivateA(__local realM* alm, realM apm[MWI/VWM], const int kg) {
  #pragma unroll
  for (int mi=0; mi<MWI/VWM; ++mi) {
    #if STRM == 0
      int mg = mi + get_local_id(0)*(MWI/VWM);
    #elif STRM == 1
      int mg = get_local_id(0) + mi*MDIMC;
    #endif
    apm[mi] = alm[kg*(MWG/VWM) + mg];
  }
}
#endif

// Same as above, but now for the B input matrix
#if SB == 1
inline void LocalToPrivateB(__local realN* blm, realN bpm[NWI/VWN], const int kg) {
  #pragma unroll
  for (int ni=0; ni<NWI/VWN; ++ni) {
    #if STRN == 0
      int ng = ni + get_local_id(1)*(NWI/VWN);
    #elif STRN == 1
      int ng = get_local_id(1) + ni*NDIMC;
    #endif
    bpm[ni] = blm[kg*(NWG/VWN) + ng];
  }
}
#endif

// =================================================================================================

// End of the C++11 raw string literal


// =================================================================================================

// The vectorised multiply-add function
inline realM MultiplyAddVector(realM cvec, const realM avec, const real bval) {
  #if USE_VECTOR_MAD == 1
    cvec += avec * bval;
  #else
    #if VWM == 1
      MultiplyAdd(cvec,    avec,    bval);
    #elif VWM == 2
      MultiplyAdd(cvec.x , avec.x,  bval);
      MultiplyAdd(cvec.y , avec.y,  bval);
    #elif VWM == 4
      MultiplyAdd(cvec.x , avec.x,  bval);
      MultiplyAdd(cvec.y , avec.y,  bval);
      MultiplyAdd(cvec.z , avec.z,  bval);
      MultiplyAdd(cvec.w , avec.w,  bval);
    #elif VWM == 8
      MultiplyAdd(cvec.s0, avec.s0, bval);
      MultiplyAdd(cvec.s1, avec.s1, bval);
      MultiplyAdd(cvec.s2, avec.s2, bval);
      MultiplyAdd(cvec.s3, avec.s3, bval);
      MultiplyAdd(cvec.s4, avec.s4, bval);
      MultiplyAdd(cvec.s5, avec.s5, bval);
      MultiplyAdd(cvec.s6, avec.s6, bval);
      MultiplyAdd(cvec.s7, avec.s7, bval);
    #elif VWM == 16
      MultiplyAdd(cvec.s0, avec.s0, bval);
      MultiplyAdd(cvec.s1, avec.s1, bval);
      MultiplyAdd(cvec.s2, avec.s2, bval);
      MultiplyAdd(cvec.s3, avec.s3, bval);
      MultiplyAdd(cvec.s4, avec.s4, bval);
      MultiplyAdd(cvec.s5, avec.s5, bval);
      MultiplyAdd(cvec.s6, avec.s6, bval);
      MultiplyAdd(cvec.s7, avec.s7, bval);
      MultiplyAdd(cvec.s8, avec.s8, bval);
      MultiplyAdd(cvec.s9, avec.s9, bval);
      MultiplyAdd(cvec.sA, avec.sA, bval);
      MultiplyAdd(cvec.sB, avec.sB, bval);
      MultiplyAdd(cvec.sC, avec.sC, bval);
      MultiplyAdd(cvec.sD, avec.sD, bval);
      MultiplyAdd(cvec.sE, avec.sE, bval);
      MultiplyAdd(cvec.sF, avec.sF, bval);
    #endif
  #endif
  return cvec;
}

// Performs the actual computation: Cpm += Apm * Bpm
inline void MultiplyAccumulate(realM cpm[NWI][MWI/VWM], realM apm[MWI/VWM], realN bpm[NWI/VWN]) {
  #pragma unroll
  for (int ni=0; ni<NWI/VWN; ++ni) {
    #pragma unroll
    for (int mi=0; mi<MWI/VWM; ++mi) {
      const realM aval = apm[mi];
      #if VWN == 1
        cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni]);
      #elif VWN == 2
        cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni].x);
        cpm[ni*VWN + 1][mi] = MultiplyAddVector(cpm[ni*VWN + 1][mi], aval, bpm[ni].y);
      #elif VWN == 4
        cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni].x);
        cpm[ni*VWN + 1][mi] = MultiplyAddVector(cpm[ni*VWN + 1][mi], aval, bpm[ni].y);
        cpm[ni*VWN + 2][mi] = MultiplyAddVector(cpm[ni*VWN + 2][mi], aval, bpm[ni].z);
        cpm[ni*VWN + 3][mi] = MultiplyAddVector(cpm[ni*VWN + 3][mi], aval, bpm[ni].w);
      #elif VWN == 8
        cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni].s0);
        cpm[ni*VWN + 1][mi] = MultiplyAddVector(cpm[ni*VWN + 1][mi], aval, bpm[ni].s1);
        cpm[ni*VWN + 2][mi] = MultiplyAddVector(cpm[ni*VWN + 2][mi], aval, bpm[ni].s2);
        cpm[ni*VWN + 3][mi] = MultiplyAddVector(cpm[ni*VWN + 3][mi], aval, bpm[ni].s3);
        cpm[ni*VWN + 4][mi] = MultiplyAddVector(cpm[ni*VWN + 4][mi], aval, bpm[ni].s4);
        cpm[ni*VWN + 5][mi] = MultiplyAddVector(cpm[ni*VWN + 5][mi], aval, bpm[ni].s5);
        cpm[ni*VWN + 6][mi] = MultiplyAddVector(cpm[ni*VWN + 6][mi], aval, bpm[ni].s6);
        cpm[ni*VWN + 7][mi] = MultiplyAddVector(cpm[ni*VWN + 7][mi], aval, bpm[ni].s7);
      #elif VWN == 16
        cpm[ni*VWN + 0 ][mi] = MultiplyAddVector(cpm[ni*VWN + 0 ][mi], aval, bpm[ni].s0);
        cpm[ni*VWN + 1 ][mi] = MultiplyAddVector(cpm[ni*VWN + 1 ][mi], aval, bpm[ni].s1);
        cpm[ni*VWN + 2 ][mi] = MultiplyAddVector(cpm[ni*VWN + 2 ][mi], aval, bpm[ni].s2);
        cpm[ni*VWN + 3 ][mi] = MultiplyAddVector(cpm[ni*VWN + 3 ][mi], aval, bpm[ni].s3);
        cpm[ni*VWN + 4 ][mi] = MultiplyAddVector(cpm[ni*VWN + 4 ][mi], aval, bpm[ni].s4);
        cpm[ni*VWN + 5 ][mi] = MultiplyAddVector(cpm[ni*VWN + 5 ][mi], aval, bpm[ni].s5);
        cpm[ni*VWN + 6 ][mi] = MultiplyAddVector(cpm[ni*VWN + 6 ][mi], aval, bpm[ni].s6);
        cpm[ni*VWN + 7 ][mi] = MultiplyAddVector(cpm[ni*VWN + 7 ][mi], aval, bpm[ni].s7);
        cpm[ni*VWN + 8 ][mi] = MultiplyAddVector(cpm[ni*VWN + 8 ][mi], aval, bpm[ni].s8);
        cpm[ni*VWN + 9 ][mi] = MultiplyAddVector(cpm[ni*VWN + 9 ][mi], aval, bpm[ni].s9);
        cpm[ni*VWN + 10][mi] = MultiplyAddVector(cpm[ni*VWN + 10][mi], aval, bpm[ni].sA);
        cpm[ni*VWN + 11][mi] = MultiplyAddVector(cpm[ni*VWN + 11][mi], aval, bpm[ni].sB);
        cpm[ni*VWN + 12][mi] = MultiplyAddVector(cpm[ni*VWN + 12][mi], aval, bpm[ni].sC);
        cpm[ni*VWN + 13][mi] = MultiplyAddVector(cpm[ni*VWN + 13][mi], aval, bpm[ni].sD);
        cpm[ni*VWN + 14][mi] = MultiplyAddVector(cpm[ni*VWN + 14][mi], aval, bpm[ni].sE);
        cpm[ni*VWN + 15][mi] = MultiplyAddVector(cpm[ni*VWN + 15][mi], aval, bpm[ni].sF);
      #endif
    }
  }
}

// =================================================================================================

// Merges the results in Cpm with the global array in Cgm. This also performs the multiplication
// with the constants: Cgm = alpha*A*B + beta*Cgm = alpha*Cpm + beta*Cgm
inline void StoreResults(__global realM* cgm, realM cpm[NWI][MWI/VWM], const int kSizeM,
                         const real alpha, const real beta) {
  #pragma unroll
  for (int ni=0; ni<NWI; ++ni) {
    #pragma unroll
    for (int mi=0; mi<MWI/VWM; ++mi) {
      #if STRM == 0
        int mg = mi + get_local_id(0)*(MWI/VWM);
      #elif STRM == 1
        int mg = get_local_id(0) + mi*MDIMC;
      #endif
      #if STRN == 0
        int ng = ni + get_local_id(1)*NWI;
      #elif STRN == 1
        int ng = ni%VWN + get_local_id(1)*VWN + (ni/VWN)*VWN*NDIMC;
      #endif
      int idm = mg + GetGroupID0() * (MWG/VWM);
      int idn = ng + GetGroupID1() * NWG;

      // The final multiplication with alpha and the addition with beta*C
      int index = idn*(kSizeM/VWM) + idm;
      realM result;
      realM xval = cpm[ni][mi];
      realM yval = cgm[index];
      #if VWM == 1
        AXPBY(result, alpha, xval, beta, yval);
      #elif VWM == 2
        AXPBY(result.x, alpha, xval.x, beta, yval.x);
        AXPBY(result.y, alpha, xval.y, beta, yval.y);
      #elif VWM == 4
        AXPBY(result.x, alpha, xval.x, beta, yval.x);
        AXPBY(result.y, alpha, xval.y, beta, yval.y);
        AXPBY(result.z, alpha, xval.z, beta, yval.z);
        AXPBY(result.w, alpha, xval.w, beta, yval.w);
      #elif VWM == 8
        AXPBY(result.s0, alpha, xval.s0, beta, yval.s0);
        AXPBY(result.s1, alpha, xval.s1, beta, yval.s1);
        AXPBY(result.s2, alpha, xval.s2, beta, yval.s2);
        AXPBY(result.s3, alpha, xval.s3, beta, yval.s3);
        AXPBY(result.s4, alpha, xval.s4, beta, yval.s4);
        AXPBY(result.s5, alpha, xval.s5, beta, yval.s5);
        AXPBY(result.s6, alpha, xval.s6, beta, yval.s6);
        AXPBY(result.s7, alpha, xval.s7, beta, yval.s7);
      #elif VWM == 16
        AXPBY(result.s0, alpha, xval.s0, beta, yval.s0);
        AXPBY(result.s1, alpha, xval.s1, beta, yval.s1);
        AXPBY(result.s2, alpha, xval.s2, beta, yval.s2);
        AXPBY(result.s3, alpha, xval.s3, beta, yval.s3);
        AXPBY(result.s4, alpha, xval.s4, beta, yval.s4);
        AXPBY(result.s5, alpha, xval.s5, beta, yval.s5);
        AXPBY(result.s6, alpha, xval.s6, beta, yval.s6);
        AXPBY(result.s7, alpha, xval.s7, beta, yval.s7);
        AXPBY(result.s8, alpha, xval.s8, beta, yval.s8);
        AXPBY(result.s9, alpha, xval.s9, beta, yval.s9);
        AXPBY(result.sA, alpha, xval.sA, beta, yval.sA);
        AXPBY(result.sB, alpha, xval.sB, beta, yval.sB);
        AXPBY(result.sC, alpha, xval.sC, beta, yval.sC);
        AXPBY(result.sD, alpha, xval.sD, beta, yval.sD);
        AXPBY(result.sE, alpha, xval.sE, beta, yval.sE);
        AXPBY(result.sF, alpha, xval.sF, beta, yval.sF);
      #endif
      cgm[index] = result;
    }
  }
}

// =================================================================================================

// Main body of the matrix-multiplication algorithm. It calls the (inlined) functions above.
inline void XgemmBody(const int kSizeM, const int kSizeN, const int kSizeK,
                      const __global realM* restrict agm, const __global realN* restrict bgm,
                      __global realM* cgm, realM cpm[NWI][MWI/VWM]
                      #if SA == 1 && SB == 1
                        , __local realM* alm, __local realN* blm
                      #elif SA == 1
                        , __local realM* alm
                      #elif SB == 1
                        , __local realN* blm
                      #endif
                      ) {

  // Allocates workitem-private memory (registers)
  realM apm[MWI/VWM];
  realN bpm[NWI/VWN];

  // Combined thread identifier (volatile to disable caching)
  #if SA == 1 || SB == 1
    volatile int tid = get_local_id(0) + MDIMC*get_local_id(1);
  #endif

  // Initializes the accumulation registers
  InitAccRegisters(cpm);

  // Loops over all workgroup tiles
  for (int kwg=0; kwg<kSizeK; kwg+=KWG) {

    // Loads data: off-chip --> local (matrix A)
    #if SA == 1
      GlobalToLocalA(agm, alm, kSizeM, tid, kwg);
    #endif
    // Loads data: off-chip --> local (matrix B)
    #if SB == 1
      GlobalToLocalB(bgm, blm, kSizeN, tid, kwg);
    #endif
    #if SA == 1 || SB == 1
      barrier(CLK_LOCAL_MEM_FENCE);
    #endif

    // Loops over all workitem tiles, unrolled by a factor KWI
    for (int pwi=0; pwi<KWG; pwi+=KWI) {
      #pragma unroll
      for (int pit=0; pit<KWI; ++pit) {
        #if SA == 0 || SB == 0
          int idk = kwg + pwi + pit;
        #endif
        #if SA == 1 || SB == 1
          int kg = pwi+pit;
        #endif

        // Loads data: local --> private (matrix A)
        #if SA == 1
          LocalToPrivateA(alm, apm, kg);
        // Loads data: off-chip --> private (matrix A)
        #else
          GlobalToPrivateA(agm, apm, kSizeM, idk, kwg);
        #endif

        // Loads data: local --> private (matrix B)
        #if SB == 1
          LocalToPrivateB(blm, bpm, kg);
        // Loads data: off-chip --> private (matrix B)
        #else
          GlobalToPrivateB(bgm, bpm, kSizeN, idk);
        #endif

        // Performs the accumulation (Cpm += Apm * Bpm)
        MultiplyAccumulate(cpm, apm, bpm);
      }
    }
    #if SA == 1 || SB == 1
      barrier(CLK_LOCAL_MEM_FENCE);
    #endif
  }
  #if GLOBAL_MEM_FENCE == 1
    barrier(CLK_GLOBAL_MEM_FENCE);
  #endif
}

// =================================================================================================
// The upper-triangular and lower-triangular kernels are only used in special cases
#if defined(ROUTINE_SYRK) || defined(ROUTINE_HERK) || defined(ROUTINE_SYR2K) || defined(ROUTINE_HER2K)

// Main entry point of the kernel. This is the upper-triangular version.
__kernel __attribute__((reqd_work_group_size(MDIMC, NDIMC, 1)))
void XgemmUpper(const int kSizeN, const int kSizeK,
                         const __constant real* restrict arg_alpha,
                         const __constant real* restrict arg_beta,
                         const __global realM* restrict agm,
                         const __global realN* restrict bgm,
                         __global realM* cgm) {
  const real alpha = arg_alpha[0];
  const real beta = arg_beta[0];

  // Skip these threads if they do not contain threads contributing to the upper-triangle
  if (GetGroupID1()*NWG < GetGroupID0()*MWG) {
    return;
  }

  // Allocates workgroup-private memory (local memory)
  #if SA == 1
    __local realM alm[KWG * MWG/VWM];
  #endif
  #if SB == 1
    __local realN blm[KWG * NWG/VWN];
  #endif

  // Computes the matrix-multiplication and stores the result in register memory
  realM cpm[NWI][MWI/VWM];
  #if SA == 1 && SB == 1
    XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm, blm);
  #elif SA == 1
    XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm);
  #elif SB == 1
    XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, blm);
  #else
    XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm);
  #endif

  // Stores an MWG * NWG tile of results and performs the multiplication with alpha and beta
  StoreResults(cgm, cpm, kSizeN, alpha, beta);
}

// Main entry point of the kernel. This is the lower-triangular version.
__kernel __attribute__((reqd_work_group_size(MDIMC, NDIMC, 1)))
void XgemmLower(const int kSizeN, const int kSizeK,
                         const __constant real* restrict arg_alpha,
                         const __constant real* restrict arg_beta,
                         const __global realM* restrict agm,
                         const __global realN* restrict bgm,
                         __global realM* cgm) {
  const real alpha = arg_alpha[0];
  const real beta = arg_beta[0];

  // Skip these threads if they do not contain threads contributing to the lower-triangle
  if (GetGroupID1()*NWG > GetGroupID0()*MWG) {
    return;
  }

  // Allocates workgroup-private memory (local memory)
  #if SA == 1
    __local realM alm[KWG * MWG/VWM];
  #endif
  #if SB == 1
    __local realN blm[KWG * NWG/VWN];
  #endif

  // Computes the matrix-multiplication and stores the result in register memory
  realM cpm[NWI][MWI/VWM];
  #if SA == 1 && SB == 1
    XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm, blm);
  #elif SA == 1
    XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm);
  #elif SB == 1
    XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, blm);
  #else
    XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm);
  #endif

  // Stores an MWG * NWG tile of results and performs the multiplication with alpha and beta
  StoreResults(cgm, cpm, kSizeN, alpha, beta);
}

// =================================================================================================
// If not using a triangular version, include the regular kernel
#else

// Main entry point of the kernel. This is the regular full version.
__kernel __attribute__((reqd_work_group_size(MDIMC, NDIMC, 1)))
void Xgemm(const int kSizeM, const int kSizeN, const int kSizeK,
                    const __constant real* restrict arg_alpha,
                    const __constant real* restrict arg_beta,
                    const __global realM* restrict agm,
                    const __global realN* restrict bgm,
                    __global realM* cgm) {
  const real alpha = arg_alpha[0];
  const real beta = arg_beta[0];

  // Allocates workgroup-private memory (local memory)
  #if SA == 1
    __local realM alm[KWG * MWG/VWM];
  #endif
  #if SB == 1
    __local realN blm[KWG * NWG/VWN];
  #endif

  // Computes the matrix-multiplication and stores the result in register memory
  realM cpm[NWI][MWI/VWM];
  #if SA == 1 && SB == 1
    XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm, blm);
  #elif SA == 1
    XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm);
  #elif SB == 1
    XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm, blm);
  #else
    XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm);
  #endif

  // Stores an MWG * NWG tile of results and performs the multiplication with alpha and beta
  StoreResults(cgm, cpm, kSizeM, alpha, beta);
}

#endif
// =================================================================================================

// End of the C++11 raw string literal

F0829 11:54:00.688835  5841 greentea_math_functions.cpp:254] Check failed: static_cast<int>(status) == static_cast<int>(clblast::StatusCode::kSuccess) (-11 vs. 0) GREENTEA ERROR: CLBlast error
*** Check failure stack trace: ***
    @ 0x76ea4c24  (unknown)
    @ 0x76ea6cfc  (unknown)
    @ 0x76ea4814  (unknown)
    @ 0x76ea7648  (unknown)
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
    @ 0x74bdc138  caffe::greentea_gpu_gemm<>()
Makefile:657: recipe for target 'runtest' failed
make: *** [runtest] Aborted
iCar-G3-446:/dev/disk/mounted/sda1/opencl/caffe #