Skip to content

Instantly share code, notes, and snippets.

@Mezzano
Last active August 29, 2016 16:03
Show Gist options
  • Save Mezzano/bb7384cb00f40c6272e24a0af7099bba to your computer and use it in GitHub Desktop.
Save Mezzano/bb7384cb00f40c6272e24a0af7099bba to your computer and use it in GitHub Desktop.
caffe make runtest -j4
iCar-G3-446:/dev/disk/mounted/sda1/opencl/caffe # make runtest -j4
CXX/LD -o .build_release/test/test_all.testbin src/caffe/test/test_caffe_main.cpp
In file included from ../caffe_requirements/ViennaCL-1.7.1/viennacl/scheduler/preset.hpp:21:0,
from ../caffe_requirements/ViennaCL-1.7.1/viennacl/linalg/opencl/kernels/vector.hpp:25,
from ../caffe_requirements/ViennaCL-1.7.1/viennacl/linalg/opencl/vector_operations.hpp:35,
from ../caffe_requirements/ViennaCL-1.7.1/viennacl/linalg/vector_operations.hpp:39,
from ../caffe_requirements/ViennaCL-1.7.1/viennacl/vector.hpp:33,
from ./include/caffe/greentea/greentea.hpp:40,
from ./include/caffe/common.hpp:25,
from ./include/caffe/blob.hpp:8,
from ./include/caffe/caffe.hpp:7,
from src/caffe/test/test_caffe_main.cpp:3:
../caffe_requirements/ViennaCL-1.7.1/viennacl/device_specific/forwards.h:129:20: warning: ‘std::string viennacl::device_specific::generate_value_kernel_argument(const string&, const string&)’ defined but not used [-Wunused-function]
static std::string generate_value_kernel_argument(std::string const & scalartype, std::string const & name)
^
../caffe_requirements/ViennaCL-1.7.1/viennacl/device_specific/forwards.h:135:20: warning: ‘std::string viennacl::device_specific::generate_pointer_kernel_argument(const string&, const string&, const string&)’ defined but not used [-Wunused-function]
static std::string generate_pointer_kernel_argument(std::string const & address_space, std::string const & scalartype, std::string const & name)
^
.build_release/tools/caffe
caffe: command line brew
usage: caffe <command> <args>
commands:
train train or finetune a model
test score a model
device_query show GPU diagnostic information
time benchmark model execution time autotune autotune a model
Flags from tools/caffe.cpp:
-gpu (Optional; run in GPU mode on given device IDs separated by ','.Use
'-gpu all' to run on all available GPUs. The effective training batch
size is multiplied by the number of devices.) type: string default: ""
-iterations (The number of iterations to run.) type: int32 default: 50
-level (Optional; network level.) type: int32 default: 0
-model (The model definition protocol buffer text file.) type: string
default: ""
-phase (Optional; network phase (TRAIN or TEST). Only used for 'time'.)
type: string default: ""
-sighup_effect (Optional; action to take when a SIGHUP signal is received:
snapshot, stop or none.) type: string default: "snapshot"
-sigint_effect (Optional; action to take when a SIGINT signal is received:
snapshot, stop or none.) type: string default: "stop"
-snapshot (Optional; the snapshot solver state to resume training.)
type: string default: ""
-solver (The solver definition protocol buffer text file.) type: string
default: ""
-stage (Optional; network stages (not to be confused with phase), separated
by ','.) type: string default: ""
-weights (Optional; the pretrained weights to initialize finetuning,
separated by ','. Cannot be set simultaneously with snapshot.)
type: string default: ""
.build_release/test/test_all.testbin 0 --gtest_shuffle
Setting to use device 0
Note: Randomizing tests' orders with a seed of 49319 .
[==========] Running 2046 tests from 274 test cases.
[----------] Global test environment set-up.
[----------] 5 tests from BenchmarkTest/1, where TypeParam = caffe::CPUDevice<double>
[ RUN ] BenchmarkTest/1.TestTimerSeconds
[ OK ] BenchmarkTest/1.TestTimerSeconds (318 ms)
[ RUN ] BenchmarkTest/1.TestTimerConstructor
[ OK ] BenchmarkTest/1.TestTimerConstructor (0 ms)
[ RUN ] BenchmarkTest/1.TestTimerMilliSeconds
[ OK ] BenchmarkTest/1.TestTimerMilliSeconds (314 ms)
[ RUN ] BenchmarkTest/1.TestTimerStop
[ OK ] BenchmarkTest/1.TestTimerStop (0 ms)
[ RUN ] BenchmarkTest/1.TestTimerStart
[ OK ] BenchmarkTest/1.TestTimerStart (1 ms)
[----------] 5 tests from BenchmarkTest/1 (634 ms total)
[----------] 9 tests from InnerProductLayerTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN ] InnerProductLayerTest/0.TestSetUpTranposeTrue
[ OK ] InnerProductLayerTest/0.TestSetUpTranposeTrue (9 ms)
[ RUN ] InnerProductLayerTest/0.TestForward
[ OK ] InnerProductLayerTest/0.TestForward (16 ms)
[ RUN ] InnerProductLayerTest/0.TestSetUp
[ OK ] InnerProductLayerTest/0.TestSetUp (1 ms)
[ RUN ] InnerProductLayerTest/0.TestGradient
[ OK ] InnerProductLayerTest/0.TestGradient (1950 ms)
[ RUN ] InnerProductLayerTest/0.TestGradientTranspose
[ OK ] InnerProductLayerTest/0.TestGradientTranspose (2308 ms)
[ RUN ] InnerProductLayerTest/0.TestBackwardTranspose
[ OK ] InnerProductLayerTest/0.TestBackwardTranspose (2 ms)
[ RUN ] InnerProductLayerTest/0.TestForwardTranspose
[ OK ] InnerProductLayerTest/0.TestForwardTranspose (1 ms)
[ RUN ] InnerProductLayerTest/0.TestForwardNoBatch
[ OK ] InnerProductLayerTest/0.TestForwardNoBatch (0 ms)
[ RUN ] InnerProductLayerTest/0.TestSetUpTranposeFalse
[ OK ] InnerProductLayerTest/0.TestSetUpTranposeFalse (1 ms)
[----------] 9 tests from InnerProductLayerTest/0 (4291 ms total)
[----------] 5 tests from ImageDataLayerTest/1, where TypeParam = caffe::CPUDevice<double>
[ RUN ] ImageDataLayerTest/1.TestResize
[ OK ] ImageDataLayerTest/1.TestResize (764 ms)
[ RUN ] ImageDataLayerTest/1.TestSpace
[ OK ] ImageDataLayerTest/1.TestSpace (247 ms)
[ RUN ] ImageDataLayerTest/1.TestShuffle
[ OK ] ImageDataLayerTest/1.TestShuffle (1038 ms)
[ RUN ] ImageDataLayerTest/1.TestReshape
[ OK ] ImageDataLayerTest/1.TestReshape (242 ms)
[ RUN ] ImageDataLayerTest/1.TestRead
[ OK ] ImageDataLayerTest/1.TestRead (1012 ms)
[----------] 5 tests from ImageDataLayerTest/1 (3304 ms total)
[----------] 2 tests from SigmoidCrossEntropyLossLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN ] SigmoidCrossEntropyLossLayerTest/3.TestGradient
[ OK ] SigmoidCrossEntropyLossLayerTest/3.TestGradient (1 ms)
[ RUN ] SigmoidCrossEntropyLossLayerTest/3.TestSigmoidCrossEntropyLoss
[ OK ] SigmoidCrossEntropyLossLayerTest/3.TestSigmoidCrossEntropyLoss (0 ms)
[----------] 2 tests from SigmoidCrossEntropyLossLayerTest/3 (2 ms total)
[----------] 2 tests from HingeLossLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN ] HingeLossLayerTest/3.TestGradientL2
[ OK ] HingeLossLayerTest/3.TestGradientL2 (0 ms)
[ RUN ] HingeLossLayerTest/3.TestGradientL1
[ OK ] HingeLossLayerTest/3.TestGradientL1 (0 ms)
[----------] 2 tests from HingeLossLayerTest/3 (1 ms total)
[----------] 8 tests from RMSPropSolverTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithWeightDecay
[ OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithWeightDecay (303 ms)
[ RUN ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccumShare
[ OK ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccumShare (44 ms)
[ RUN ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccum
[ OK ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccum (31 ms)
[ RUN ] RMSPropSolverTest/0.TestSnapshotShare
[ OK ] RMSPropSolverTest/0.TestSnapshotShare (270 ms)
[ RUN ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverything
[ OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverything (721 ms)
[ RUN ] RMSPropSolverTest/0.TestSnapshot
[ OK ] RMSPropSolverTest/0.TestSnapshot (192 ms)
[ RUN ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverythingShare
[ OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverythingShare (1171 ms)
[ RUN ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithRmsDecay
[ OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithRmsDecay (721 ms)
[----------] 8 tests from RMSPropSolverTest/0 (3455 ms total)
[----------] 1 test from SolverTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN ] SolverTest/0.TestInitTrainTestNets
[ OK ] SolverTest/0.TestInitTrainTestNets (34 ms)
[----------] 1 test from SolverTest/0 (34 ms total)
[----------] 3 tests from BlobMathTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN ] BlobMathTest/2.TestSumOfSquares
[ OK ] BlobMathTest/2.TestSumOfSquares (74 ms)
[ RUN ] BlobMathTest/2.TestScaleData
[ OK ] BlobMathTest/2.TestScaleData (86 ms)
[ RUN ] BlobMathTest/2.TestAsum
[ OK ] BlobMathTest/2.TestAsum (38 ms)
[----------] 3 tests from BlobMathTest/2 (199 ms total)
[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice<float>
[ RUN ] SGDSolverTest/2.TestLeastSquaresUpdate
OpenCL compiler error/warning:
(1491:0) : error : function: 'InitAccRegisters' hasn't the corresponding declaration
(1536:0) : error : function: 'MultiplyAccumulate' hasn't the corresponding declaration
(1664:0) : error : function: 'XgemmBody' hasn't the corresponding declaration
(1668:0) : error : function: 'StoreResults' hasn't the corresponding declaration
OpenCL code program was:
#define KWG 16
#define KWI 2
#define MDIMA 8
#define MDIMC 8
#define STRM 0
#define MWG 32
#define NDIMC 8
#define SB 0
#define STRN 0
#define VWM 1
#define PAD_WPTX 1
#define VWN 1
#define NWG 64
#define PAD_WPTY 1
#define COPY_VW 1
#define PAD_DIMX 8
#define COPY_DIMY 8
#define NDIMB 8
#define COPY_DIMX 8
#define SA 0
#define COPY_WPT 1
#define TRA_SHUFFLE 0
#define PAD_DIMY 8
#define TRA_PAD 0
#define TRA_DIM 4
#define PADTRA_WPT 1
#define PADTRA_PAD 0
#define TRA_WPT 1
#define PADTRA_TILE 8
#define PRECISION 32
#define ROUTINE_GEMM
// =================================================================================================
// Parameters set by the tuner or by the database. Here they are given a basic default value in case
// this file is used outside of the CLBlast library.
#ifndef PRECISION
#define PRECISION 32 // Data-types: half, single or double precision, complex or regular
#endif
// =================================================================================================
// Enable support for double-precision
#if PRECISION == 16
#pragma OPENCL EXTENSION cl_khr_fp16: enable
#endif
// Enable support for double-precision
#if PRECISION == 64 || PRECISION == 6464
#if __OPENCL_VERSION__ <= CL_VERSION_1_1
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif
#endif
// Half-precision
#if PRECISION == 16
typedef half real;
typedef half2 real2;
typedef half4 real4;
typedef half8 real8;
typedef half16 real16;
#define ZERO 0
#define ONE 1
#define SMALLEST -1.0e14
// Single-precision
#elif PRECISION == 32
typedef float real;
typedef float2 real2;
typedef float4 real4;
typedef float8 real8;
typedef float16 real16;
#define ZERO 0.0f
#define ONE 1.0f
#define SMALLEST -1.0e37f
// Double-precision
#elif PRECISION == 64
typedef double real;
typedef double2 real2;
typedef double4 real4;
typedef double8 real8;
typedef double16 real16;
#define ZERO 0.0
#define ONE 1.0
#define SMALLEST -1.0e37
// Complex single-precision
#elif PRECISION == 3232
typedef struct cfloat {float x; float y;} real;
typedef struct cfloat2 {real x; real y;} real2;
typedef struct cfloat4 {real x; real y; real z; real w;} real4;
typedef struct cfloat8 {real s0; real s1; real s2; real s3;
real s4; real s5; real s6; real s7;} real8;
typedef struct cfloat16 {real s0; real s1; real s2; real s3;
real s4; real s5; real s6; real s7;
real s8; real s9; real sA; real sB;
real sC; real sD; real sE; real sF;} real16;
#define ZERO 0.0f
#define ONE 1.0f
#define SMALLEST -1.0e37f
// Complex double-precision
#elif PRECISION == 6464
typedef struct cdouble {double x; double y;} real;
typedef struct cdouble2 {real x; real y;} real2;
typedef struct cdouble4 {real x; real y; real z; real w;} real4;
typedef struct cdouble8 {real s0; real s1; real s2; real s3;
real s4; real s5; real s6; real s7;} real8;
typedef struct cdouble16 {real s0; real s1; real s2; real s3;
real s4; real s5; real s6; real s7;
real s8; real s9; real sA; real sB;
real sC; real sD; real sE; real sF;} real16;
#define ZERO 0.0
#define ONE 1.0
#define SMALLEST -1.0e37
#endif
// Single-element version of a complex number
#if PRECISION == 3232
typedef float singlereal;
#elif PRECISION == 6464
typedef double singlereal;
#else
typedef real singlereal;
#endif
// =================================================================================================
// Don't use the non-IEEE754 compliant OpenCL built-in mad() instruction per default. For specific
// devices, this is enabled (see src/routine.cc).
#ifndef USE_CL_MAD
#define USE_CL_MAD 0
#endif
// Sets a variable to zero
#if PRECISION == 3232 || PRECISION == 6464
#define SetToZero(a) a.x = ZERO; a.y = ZERO
#else
#define SetToZero(a) a = ZERO
#endif
// Sets a variable to zero (only the imaginary part)
#if PRECISION == 3232 || PRECISION == 6464
#define ImagToZero(a) a.y = ZERO
#else
#define ImagToZero(a)
#endif
// Sets a variable to one
#if PRECISION == 3232 || PRECISION == 6464
#define SetToOne(a) a.x = ONE; a.y = ZERO
#else
#define SetToOne(a) a = ONE
#endif
// The absolute value (component-wise)
#if PRECISION == 3232 || PRECISION == 6464
#define AbsoluteValue(value) value.x = fabs(value.x); value.y = fabs(value.y)
#else
#define AbsoluteValue(value) value = fabs(value)
#endif
// Adds two complex variables
#if PRECISION == 3232 || PRECISION == 6464
#define Add(c, a, b) c.x = a.x + b.x; c.y = a.y + b.y
#else
#define Add(c, a, b) c = a + b
#endif
// Multiply two complex variables (used in the defines below)
#if PRECISION == 3232 || PRECISION == 6464
#define MulReal(a, b) a.x*b.x - a.y*b.y
#define MulImag(a, b) a.x*b.y + a.y*b.x
#endif
// The scalar multiply function
#if PRECISION == 3232 || PRECISION == 6464
#define Multiply(c, a, b) c.x = MulReal(a,b); c.y = MulImag(a,b)
#else
#define Multiply(c, a, b) c = a * b
#endif
// The scalar multiply-add function
#if PRECISION == 3232 || PRECISION == 6464
#define MultiplyAdd(c, a, b) c.x += MulReal(a,b); c.y += MulImag(a,b)
#else
#if USE_CL_MAD == 1
#define MultiplyAdd(c, a, b) c = mad(a, b, c)
#else
#define MultiplyAdd(c, a, b) c += a * b
#endif
#endif
// The scalar AXPBY function
#if PRECISION == 3232 || PRECISION == 6464
#define AXPBY(e, a, b, c, d) e.x = MulReal(a,b) + MulReal(c,d); e.y = MulImag(a,b) + MulImag(c,d)
#else
#define AXPBY(e, a, b, c, d) e = a*b + c*d
#endif
// The complex conjugate operation for complex transforms
#if PRECISION == 3232 || PRECISION == 6464
#define COMPLEX_CONJUGATE(value) value.x = value.x; value.y = -value.y
#else
#define COMPLEX_CONJUGATE(value) value = value
#endif
// =================================================================================================
// Shuffled workgroup indices to avoid partition camping, see below. For specific devices, this is
// enabled (see src/routine.cc).
#ifndef USE_STAGGERED_INDICES
#define USE_STAGGERED_INDICES 0
#endif
// Staggered/shuffled group indices to avoid partition camping (AMD GPUs). Formula's are taken from:
// http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf
// More details: https://github.com/CNugteren/CLBlast/issues/53
#if USE_STAGGERED_INDICES == 1
inline size_t GetGroupIDFlat() {
return get_group_id(0) + get_num_groups(0) * get_group_id(1);
}
inline size_t GetGroupID1() {
return (GetGroupIDFlat()) % get_num_groups(1);
}
inline size_t GetGroupID0() {
return ((GetGroupIDFlat() / get_num_groups(1)) + GetGroupID1()) % get_num_groups(0);
}
#else
inline size_t GetGroupID1() { return get_group_id(1); }
inline size_t GetGroupID0() { return get_group_id(0); }
#endif
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
// Parameters set by the tuner or by the database. Here they are given a basic default value in case
// this kernel file is used outside of the CLBlast library.
// For the 'fast' copy kernel
#ifndef COPY_DIMX
#define COPY_DIMX 8 // Local workgroup size in the first dimension (x)
#endif
#ifndef COPY_DIMY
#define COPY_DIMY 8 // Local workgroup size in the second dimension (y)
#endif
#ifndef COPY_WPT
#define COPY_WPT 1 // Work per thread in the first dimension (x)
#endif
#ifndef COPY_VW
#define COPY_VW 1 // Vector width in the second dimension (y)
#endif
// For the padding/copy kernels and the conversion kernels
#ifndef PAD_DIMX
#define PAD_DIMX 8 // Local workgroup size in the first dimension (x)
#endif
#ifndef PAD_DIMY
#define PAD_DIMY 8 // Local workgroup size in the second dimension (y)
#endif
#ifndef PAD_WPTX
#define PAD_WPTX 1 // Work per thread in the first dimension (x)
#endif
#ifndef PAD_WPTY
#define PAD_WPTY 1 // Work per thread in the second dimension (y)
#endif
// For the 'fast' transpose kernel
#ifndef TRA_DIM
#define TRA_DIM 8 // Number of local threads in the two dimensions (x,y)
#endif
#ifndef TRA_WPT
#define TRA_WPT 1 // Work per thread in one dimension and vector-width in the other
#endif
#ifndef TRA_PAD
#define TRA_PAD 0 // Padding of the local memory to avoid bank-conflicts
#endif
#ifndef TRA_SHUFFLE
#define TRA_SHUFFLE 0 // Shuffling of the global indices to avoid global memory bank-conflicts
#endif
// For the padding/transpose kernels
#ifndef PADTRA_TILE
#define PADTRA_TILE 8 // Number of local threads in the two dimensions (x,y)
#endif
#ifndef PADTRA_WPT
#define PADTRA_WPT 1 // Amount of work per thread
#endif
#ifndef PADTRA_PAD
#define PADTRA_PAD 0 // Padding of the local memory to avoid bank-conflicts
#endif
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
// Data-widths
#if COPY_VW == 1
typedef real realC;
#elif COPY_VW == 2
typedef real2 realC;
#elif COPY_VW == 4
typedef real4 realC;
#elif COPY_VW == 8
typedef real8 realC;
#elif COPY_VW == 16
typedef real16 realC;
#endif
// =================================================================================================
// Fast copy kernel. Requires 'ld' and the number of threads in dimension 0 to be a multiple of
// COPY_VW. Also requires both matrices to be of the same dimensions and without offset.
__kernel __attribute__((reqd_work_group_size(COPY_DIMX, COPY_DIMY, 1)))
void CopyMatrixFast(const int ld,
__global const realC* restrict src,
__global realC* dest,
const __constant real* restrict arg_alpha) {
const real alpha = arg_alpha[0];
#pragma unroll
for (int w_one=0; w_one<COPY_WPT; ++w_one) {
const int id_one = get_global_id(0);
const int id_two = (get_group_id(1)*COPY_WPT + w_one) * COPY_DIMY + get_local_id(1);
const int id = id_two*(ld/COPY_VW) + id_one;
realC result;
#if COPY_VW == 1
Multiply(result, alpha, src[id]);
#elif COPY_VW == 2
Multiply(result.x, alpha, src[id].x);
Multiply(result.y, alpha, src[id].y);
#elif COPY_VW == 4
Multiply(result.x, alpha, src[id].x);
Multiply(result.y, alpha, src[id].y);
Multiply(result.z, alpha, src[id].z);
Multiply(result.w, alpha, src[id].w);
#elif COPY_VW == 8
Multiply(result.s0, alpha, src[id].s0);
Multiply(result.s1, alpha, src[id].s1);
Multiply(result.s2, alpha, src[id].s2);
Multiply(result.s3, alpha, src[id].s3);
Multiply(result.s4, alpha, src[id].s4);
Multiply(result.s5, alpha, src[id].s5);
Multiply(result.s6, alpha, src[id].s6);
Multiply(result.s7, alpha, src[id].s7);
#elif COPY_VW == 16
Multiply(result.s0, alpha, src[id].s0);
Multiply(result.s1, alpha, src[id].s1);
Multiply(result.s2, alpha, src[id].s2);
Multiply(result.s3, alpha, src[id].s3);
Multiply(result.s4, alpha, src[id].s4);
Multiply(result.s5, alpha, src[id].s5);
Multiply(result.s6, alpha, src[id].s6);
Multiply(result.s7, alpha, src[id].s7);
Multiply(result.s8, alpha, src[id].s8);
Multiply(result.s9, alpha, src[id].s9);
Multiply(result.sA, alpha, src[id].sA);
Multiply(result.sB, alpha, src[id].sB);
Multiply(result.sC, alpha, src[id].sC);
Multiply(result.sD, alpha, src[id].sD);
Multiply(result.sE, alpha, src[id].sE);
Multiply(result.sF, alpha, src[id].sF);
#endif
dest[id] = result;;
}
}
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
// Copies a matrix from source to destination. The output is padded with zero values in case the
// destination matrix dimensions are larger than the source matrix dimensions. Additionally, the ld
// value and offset can be different.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void CopyPadMatrix(const int src_one, const int src_two,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_one, const int dest_two,
const int dest_ld, const int dest_offset,
__global real* dest,
const __constant real* restrict arg_alpha,
const int do_conjugate) {
const real alpha = arg_alpha[0];
// Loops over the work per thread in both dimensions
#pragma unroll
for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
#pragma unroll
for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
if (id_two < dest_two && id_one < dest_one) {
// Loads data if the thread IDs are within bounds of the source matrix. Otherwise, set the
// value to be written to zero.
real value;
SetToZero(value);
if (id_two < src_two && id_one < src_one) {
value = src[id_two*src_ld + id_one + src_offset];
}
// Stores the value in the destination matrix
if (do_conjugate == 1) { COMPLEX_CONJUGATE(value); }
Multiply(dest[id_two*dest_ld + id_one + dest_offset], alpha, value);
}
}
}
}
// =================================================================================================
// Same as above, but now un-pads a matrix. This kernel reads data from a padded source matrix, but
// writes only the actual data back to the destination matrix. Again, the ld value and offset can
// be different.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void CopyMatrix(const int src_one, const int src_two,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_one, const int dest_two,
const int dest_ld, const int dest_offset,
__global real* dest,
const __constant real* restrict arg_alpha,
const int upper, const int lower,
const int diagonal_imag_zero) {
const real alpha = arg_alpha[0];
// Loops over the work per thread in both dimensions
#pragma unroll
for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
#pragma unroll
for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
// Masking in case of triangular matrices: updates only the upper or lower part
bool condition = true;
#if defined(ROUTINE_SYRK) || defined(ROUTINE_HERK) || defined(ROUTINE_SYR2K) || defined(ROUTINE_HER2K)
if (upper == 1) { condition = (id_two >= id_one); }
else if (lower == 1) { condition = (id_two <= id_one); }
#endif
if (condition) {
// Copies the value into the destination matrix. This is always within bounds of the source
// matrix, as we know that the destination matrix is smaller or equal to the source.
if (id_two < dest_two && id_one < dest_one) {
real value = src[id_two*src_ld + id_one + src_offset];
if (diagonal_imag_zero == 1 && id_one == id_two) { ImagToZero(value); }
Multiply(dest[id_two*dest_ld + id_one + dest_offset], alpha, value);
}
}
}
}
}
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
// Data-widths
#if TRA_WPT == 1
typedef real realT;
#elif TRA_WPT == 2
typedef real2 realT;
#elif TRA_WPT == 4
typedef real4 realT;
#elif TRA_WPT == 8
typedef real8 realT;
#elif TRA_WPT == 16
typedef real16 realT;
#endif
// =================================================================================================
// Transposes and copies a matrix. Requires both matrices to be of the same dimensions and without
// offset. A more general version is available in 'padtranspose.opencl'.
__kernel __attribute__((reqd_work_group_size(TRA_DIM, TRA_DIM, 1)))
void TransposeMatrixFast(const int ld,
__global const realT* restrict src,
__global realT* dest,
const __constant real* restrict arg_alpha) {
const real alpha = arg_alpha[0];
// Sets the group identifiers. They might be 'shuffled' around to distribute work in a different
// way over workgroups, breaking memory-bank dependencies.
const int gid0 = get_group_id(0);
#if TRA_SHUFFLE == 1
const int gid1 = (get_group_id(0) + get_group_id(1)) % get_num_groups(0);
#else
const int gid1 = get_group_id(1);
#endif
// Local memory to store a tile of the matrix (for coalescing)
__local realT tile[TRA_WPT*TRA_DIM][TRA_DIM + TRA_PAD];
// Loops over the work per thread
#pragma unroll
for (int w_one=0; w_one<TRA_WPT; ++w_one) {
// Computes the identifiers for the source matrix. Note that the local and global dimensions
// do not correspond to each other!
const int id_one = gid1 * TRA_DIM + get_local_id(0);
const int id_two = (gid0 * TRA_DIM + get_local_id(1))*TRA_WPT + w_one;
// Loads data into the local memory
realT value = src[id_two*(ld/TRA_WPT) + id_one];
tile[get_local_id(0)*TRA_WPT + w_one][get_local_id(1)] = value;
}
// Synchronizes all threads in a workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Loads transposed data from the local memory
realT v[TRA_WPT];
#pragma unroll
for (int w_one=0; w_one<TRA_WPT; ++w_one) {
v[w_one] = tile[get_local_id(1)*TRA_WPT + w_one][get_local_id(0)];
}
// Performs the register-level transpose of the vectorized data
realT results[TRA_WPT];
#if TRA_WPT == 1
results[0] = v[0];
#elif TRA_WPT == 2
results[0] = (realT) {v[0].x, v[1].x};
results[1] = (realT) {v[0].y, v[1].y};
#elif TRA_WPT == 4
results[0] = (realT) {v[0].x, v[1].x, v[2].x, v[3].x};
results[1] = (realT) {v[0].y, v[1].y, v[2].y, v[3].y};
results[2] = (realT) {v[0].z, v[1].z, v[2].z, v[3].z};
results[3] = (realT) {v[0].w, v[1].w, v[2].w, v[3].w};
#elif TRA_WPT == 8
results[0] = (realT) {v[0].s0, v[1].s0, v[2].s0, v[3].s0, v[4].s0, v[5].s0, v[6].s0, v[7].s0};
results[1] = (realT) {v[0].s1, v[1].s1, v[2].s1, v[3].s1, v[4].s1, v[5].s1, v[6].s1, v[7].s1};
results[2] = (realT) {v[0].s2, v[1].s2, v[2].s2, v[3].s2, v[4].s2, v[5].s2, v[6].s2, v[7].s2};
results[3] = (realT) {v[0].s3, v[1].s3, v[2].s3, v[3].s3, v[4].s3, v[5].s3, v[6].s3, v[7].s3};
results[4] = (realT) {v[0].s4, v[1].s4, v[2].s4, v[3].s4, v[4].s4, v[5].s4, v[6].s4, v[7].s4};
results[5] = (realT) {v[0].s5, v[1].s5, v[2].s5, v[3].s5, v[4].s5, v[5].s5, v[6].s5, v[7].s5};
results[6] = (realT) {v[0].s6, v[1].s6, v[2].s6, v[3].s6, v[4].s6, v[5].s6, v[6].s6, v[7].s6};
results[7] = (realT) {v[0].s7, v[1].s7, v[2].s7, v[3].s7, v[4].s7, v[5].s7, v[6].s7, v[7].s7};
#elif TRA_WPT == 16
results[ 0] = (realT) {v[0].s0, v[1].s0, v[2].s0, v[3].s0, v[4].s0, v[5].s0, v[6].s0, v[7].s0, v[8].s0, v[9].s0, v[10].s0, v[11].s0, v[12].s0, v[13].s0, v[14].s0, v[15].s0};
results[ 1] = (realT) {v[0].s1, v[1].s1, v[2].s1, v[3].s1, v[4].s1, v[5].s1, v[6].s1, v[7].s1, v[8].s1, v[9].s1, v[10].s1, v[11].s1, v[12].s1, v[13].s1, v[14].s1, v[15].s1};
results[ 2] = (realT) {v[0].s2, v[1].s2, v[2].s2, v[3].s2, v[4].s2, v[5].s2, v[6].s2, v[7].s2, v[8].s2, v[9].s2, v[10].s2, v[11].s2, v[12].s2, v[13].s2, v[14].s2, v[15].s2};
results[ 3] = (realT) {v[0].s3, v[1].s3, v[2].s3, v[3].s3, v[4].s3, v[5].s3, v[6].s3, v[7].s3, v[8].s3, v[9].s3, v[10].s3, v[11].s3, v[12].s3, v[13].s3, v[14].s3, v[15].s3};
results[ 4] = (realT) {v[0].s4, v[1].s4, v[2].s4, v[3].s4, v[4].s4, v[5].s4, v[6].s4, v[7].s4, v[8].s4, v[9].s4, v[10].s4, v[11].s4, v[12].s4, v[13].s4, v[14].s4, v[15].s4};
results[ 5] = (realT) {v[0].s5, v[1].s5, v[2].s5, v[3].s5, v[4].s5, v[5].s5, v[6].s5, v[7].s5, v[8].s5, v[9].s5, v[10].s5, v[11].s5, v[12].s5, v[13].s5, v[14].s5, v[15].s5};
results[ 6] = (realT) {v[0].s6, v[1].s6, v[2].s6, v[3].s6, v[4].s6, v[5].s6, v[6].s6, v[7].s6, v[8].s6, v[9].s6, v[10].s6, v[11].s6, v[12].s6, v[13].s6, v[14].s6, v[15].s6};
results[ 7] = (realT) {v[0].s7, v[1].s7, v[2].s7, v[3].s7, v[4].s7, v[5].s7, v[6].s7, v[7].s7, v[8].s7, v[9].s7, v[10].s7, v[11].s7, v[12].s7, v[13].s7, v[14].s7, v[15].s7};
results[ 8] = (realT) {v[0].s8, v[1].s8, v[2].s8, v[3].s8, v[4].s8, v[5].s8, v[6].s8, v[7].s8, v[8].s8, v[9].s8, v[10].s8, v[11].s8, v[12].s8, v[13].s8, v[14].s8, v[15].s8};
results[ 9] = (realT) {v[0].s9, v[1].s9, v[2].s9, v[3].s9, v[4].s9, v[5].s9, v[6].s9, v[7].s9, v[8].s9, v[9].s9, v[10].s9, v[11].s9, v[12].s9, v[13].s9, v[14].s9, v[15].s9};
results[10] = (realT) {v[0].sA, v[1].sA, v[2].sA, v[3].sA, v[4].sA, v[5].sA, v[6].sA, v[7].sA, v[8].sA, v[9].sA, v[10].sA, v[11].sA, v[12].sA, v[13].sA, v[14].sA, v[15].sA};
results[11] = (realT) {v[0].sB, v[1].sB, v[2].sB, v[3].sB, v[4].sB, v[5].sB, v[6].sB, v[7].sB, v[8].sB, v[9].sB, v[10].sB, v[11].sB, v[12].sB, v[13].sB, v[14].sB, v[15].sB};
results[12] = (realT) {v[0].sC, v[1].sC, v[2].sC, v[3].sC, v[4].sC, v[5].sC, v[6].sC, v[7].sC, v[8].sC, v[9].sC, v[10].sC, v[11].sC, v[12].sC, v[13].sC, v[14].sC, v[15].sC};
results[13] = (realT) {v[0].sD, v[1].sD, v[2].sD, v[3].sD, v[4].sD, v[5].sD, v[6].sD, v[7].sD, v[8].sD, v[9].sD, v[10].sD, v[11].sD, v[12].sD, v[13].sD, v[14].sD, v[15].sD};
results[14] = (realT) {v[0].sE, v[1].sE, v[2].sE, v[3].sE, v[4].sE, v[5].sE, v[6].sE, v[7].sE, v[8].sE, v[9].sE, v[10].sE, v[11].sE, v[12].sE, v[13].sE, v[14].sE, v[15].sE};
results[15] = (realT) {v[0].sF, v[1].sF, v[2].sF, v[3].sF, v[4].sF, v[5].sF, v[6].sF, v[7].sF, v[8].sF, v[9].sF, v[10].sF, v[11].sF, v[12].sF, v[13].sF, v[14].sF, v[15].sF};
#endif
// Multiplies by alpha and then stores the results into the destination matrix
#pragma unroll
for (int w_two=0; w_two<TRA_WPT; ++w_two) {
realT result;
#if TRA_WPT == 1
Multiply(result, alpha, results[w_two]);
#elif TRA_WPT == 2
Multiply(result.x, alpha, results[w_two].x);
Multiply(result.y, alpha, results[w_two].y);
#elif TRA_WPT == 4
Multiply(result.x, alpha, results[w_two].x);
Multiply(result.y, alpha, results[w_two].y);
Multiply(result.z, alpha, results[w_two].z);
Multiply(result.w, alpha, results[w_two].w);
#elif TRA_WPT == 8
Multiply(result.s0, alpha, results[w_two].s0);
Multiply(result.s1, alpha, results[w_two].s1);
Multiply(result.s2, alpha, results[w_two].s2);
Multiply(result.s3, alpha, results[w_two].s3);
Multiply(result.s4, alpha, results[w_two].s4);
Multiply(result.s5, alpha, results[w_two].s5);
Multiply(result.s6, alpha, results[w_two].s6);
Multiply(result.s7, alpha, results[w_two].s7);
#elif TRA_WPT == 16
Multiply(result.s0, alpha, results[w_two].s0);
Multiply(result.s1, alpha, results[w_two].s1);
Multiply(result.s2, alpha, results[w_two].s2);
Multiply(result.s3, alpha, results[w_two].s3);
Multiply(result.s4, alpha, results[w_two].s4);
Multiply(result.s5, alpha, results[w_two].s5);
Multiply(result.s6, alpha, results[w_two].s6);
Multiply(result.s7, alpha, results[w_two].s7);
Multiply(result.s8, alpha, results[w_two].s8);
Multiply(result.s9, alpha, results[w_two].s9);
Multiply(result.sA, alpha, results[w_two].sA);
Multiply(result.sB, alpha, results[w_two].sB);
Multiply(result.sC, alpha, results[w_two].sC);
Multiply(result.sD, alpha, results[w_two].sD);
Multiply(result.sE, alpha, results[w_two].sE);
Multiply(result.sF, alpha, results[w_two].sF);
#endif
const int id_one = gid0*TRA_DIM + get_local_id(0);
const int id_two = (gid1*TRA_DIM + get_local_id(1))*TRA_WPT + w_two;
dest[id_two*(ld/TRA_WPT) + id_one] = result;
}
}
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
// Transposes a matrix from source to destination. The output is padded with zero values in case the
// destination matrix dimensions are larger than the transposed source matrix dimensions.
__kernel __attribute__((reqd_work_group_size(PADTRA_TILE, PADTRA_TILE, 1)))
void TransposePadMatrix(const int src_one, const int src_two,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_one, const int dest_two,
const int dest_ld, const int dest_offset,
__global real* dest,
const __constant real* restrict arg_alpha,
const int do_conjugate) {
const real alpha = arg_alpha[0];
// Local memory to store a tile of the matrix (for coalescing)
__local real tile[PADTRA_WPT*PADTRA_TILE][PADTRA_WPT*PADTRA_TILE + PADTRA_PAD];
// Loop over the work per thread
#pragma unroll
for (int w_one=0; w_one<PADTRA_WPT; ++w_one) {
#pragma unroll
for (int w_two=0; w_two<PADTRA_WPT; ++w_two) {
// Computes the identifiers for the source matrix. Note that the local and global dimensions
// do not correspond to each other!
const int id_src_one = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(0);
const int id_src_two = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(1);
// Loads data into the local memory if the thread IDs are within bounds of the source matrix.
// Otherwise, set the local memory value to zero.
real value;
SetToZero(value);
if (id_src_two < src_two && id_src_one < src_one) {
value = src[id_src_two*src_ld + id_src_one + src_offset];
}
tile[get_local_id(1)*PADTRA_WPT + w_two][get_local_id(0)*PADTRA_WPT + w_one] = value;
}
}
// Synchronizes all threads in a workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Loop over the work per thread
#pragma unroll
for (int w_one=0; w_one<PADTRA_WPT; ++w_one) {
#pragma unroll
for (int w_two=0; w_two<PADTRA_WPT; ++w_two) {
// Computes the identifiers for the destination matrix
const int id_dest_one = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(0);
const int id_dest_two = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(1);
// Stores the transposed value in the destination matrix
if ((id_dest_one < dest_one) && (id_dest_two < dest_two)) {
real value = tile[get_local_id(0)*PADTRA_WPT + w_two][get_local_id(1)*PADTRA_WPT + w_one];
if (do_conjugate == 1) { COMPLEX_CONJUGATE(value); }
Multiply(dest[id_dest_two*dest_ld + id_dest_one + dest_offset], alpha, value);
}
}
}
}
// =================================================================================================
// Transposes a matrix, while considering possible padding in the source matrix. Data is read from a
// padded source matrix, but only the actual data is written back to the transposed destination
// matrix. This kernel optionally checks for upper/lower triangular matrices.
__kernel __attribute__((reqd_work_group_size(PADTRA_TILE, PADTRA_TILE, 1)))
void TransposeMatrix(const int src_one, const int src_two,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_one, const int dest_two,
const int dest_ld, const int dest_offset,
__global real* dest,
const __constant real* restrict arg_alpha,
const int upper, const int lower,
const int diagonal_imag_zero) {
const real alpha = arg_alpha[0];
// Local memory to store a tile of the matrix (for coalescing)
__local real tile[PADTRA_WPT*PADTRA_TILE][PADTRA_WPT*PADTRA_TILE + PADTRA_PAD];
// Loop over the work per thread
#pragma unroll
for (int w_one=0; w_one<PADTRA_WPT; ++w_one) {
#pragma unroll
for (int w_two=0; w_two<PADTRA_WPT; ++w_two) {
// Computes the identifiers for the source matrix. Note that the local and global dimensions
// do not correspond to each other!
const int id_src_one = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(0);
const int id_src_two = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(1);
// Loads data into the local memory if the thread IDs are within bounds of the source matrix.
if ((id_src_one < src_one) && (id_src_two < src_two)) {
real value = src[id_src_two*src_ld + id_src_one + src_offset];
tile[get_local_id(1)*PADTRA_WPT + w_two][get_local_id(0)*PADTRA_WPT + w_one] = value;
}
}
}
// Synchronizes all threads in a workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Loop over the work per thread
#pragma unroll
for (int w_one=0; w_one<PADTRA_WPT; ++w_one) {
#pragma unroll
for (int w_two=0; w_two<PADTRA_WPT; ++w_two) {
// Computes the identifiers for the destination matrix
const int id_dest_one = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(0);
const int id_dest_two = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(1);
// Masking in case of triangular matrices: updates only the upper or lower part
bool condition = true;
#if defined(ROUTINE_SYRK) || defined(ROUTINE_HERK) || defined(ROUTINE_SYR2K) || defined(ROUTINE_HER2K)
if (upper == 1) { condition = (id_dest_one >= id_dest_two); }
else if (lower == 1) { condition = (id_dest_one <= id_dest_two); }
#endif
if (condition) {
// Stores the transposed value in the destination matrix
if ((id_dest_one < dest_one) && (id_dest_two < dest_two)) {
real value = tile[get_local_id(0)*PADTRA_WPT + w_two][get_local_id(1)*PADTRA_WPT + w_one];
if (diagonal_imag_zero == 1 && id_dest_one == id_dest_two) { ImagToZero(value); }
Multiply(dest[id_dest_two*dest_ld + id_dest_one + dest_offset], alpha, value);
}
}
}
}
}
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
#if defined(ROUTINE_SYMM)
// Kernel to populate a squared symmetric matrix, given that the triangle which holds the data is
// stored as the lower-triangle of the input matrix. This uses the padding kernel's parameters.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void SymmLowerToSquared(const int src_dim,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_dim,
const int dest_ld, const int dest_offset,
__global real* dest) {
// Loops over the work per thread in both dimensions
#pragma unroll
for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
#pragma unroll
for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
if (id_two < dest_dim && id_one < dest_dim) {
// Loads data from the lower-symmetric matrix
real result;
SetToZero(result);
if (id_two < src_dim && id_one < src_dim) {
if (id_two <= id_one) { result = src[id_two*src_ld + id_one + src_offset]; }
else { result = src[id_one*src_ld + id_two + src_offset]; }
}
// Stores the result in the destination matrix
dest[id_two*dest_ld + id_one + dest_offset] = result;
}
}
}
}
// Same as above, but now the matrix' data is stored in the upper-triangle
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void SymmUpperToSquared(const int src_dim,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_dim,
const int dest_ld, const int dest_offset,
__global real* dest) {
// Loops over the work per thread in both dimensions
#pragma unroll
for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
#pragma unroll
for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
if (id_two < dest_dim && id_one < dest_dim) {
// Loads data from the upper-symmetric matrix
real result;
SetToZero(result);
if (id_two < src_dim && id_one < src_dim) {
if (id_one <= id_two) { result = src[id_two*src_ld + id_one + src_offset]; }
else { result = src[id_one*src_ld + id_two + src_offset]; }
}
// Stores the result in the destination matrix
dest[id_two*dest_ld + id_one + dest_offset] = result;
}
}
}
}
#endif
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
#if defined(ROUTINE_TRMM)
// Kernel to populate a squared triangular matrix, given that the triangle which holds the data is
// stored as the lower-triangle of the input matrix. This uses the padding kernel's parameters.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void TriaLowerToSquared(const int src_dim,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_dim,
const int dest_ld, const int dest_offset,
__global real* dest,
const int unit_diagonal) {
// Loops over the work per thread in both dimensions
#pragma unroll
for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
#pragma unroll
for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
if (id_two < dest_dim && id_one < dest_dim) {
// Loads data from the lower-triangular matrix
real result;
SetToZero(result);
if (id_two < src_dim && id_one < src_dim) {
if (id_two <= id_one) { result = src[id_two*src_ld + id_one + src_offset]; }
if (id_two == id_one && unit_diagonal) { SetToOne(result); }
// Else: result is zero
}
// Stores the result in the destination matrix
dest[id_two*dest_ld + id_one + dest_offset] = result;
}
}
}
}
// Same as above, but now the matrix' data is stored in the upper-triangle
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void TriaUpperToSquared(const int src_dim,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_dim,
const int dest_ld, const int dest_offset,
__global real* dest,
const int unit_diagonal) {
// Loops over the work per thread in both dimensions
#pragma unroll
for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
#pragma unroll
for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
if (id_two < dest_dim && id_one < dest_dim) {
// Loads data from the upper-triangular matrix
real result;
SetToZero(result);
if (id_two < src_dim && id_one < src_dim) {
if (id_one <= id_two) { result = src[id_two*src_ld + id_one + src_offset]; }
if (id_one == id_two && unit_diagonal) { SetToOne(result); }
// Else: result is zero
}
// Stores the result in the destination matrix
dest[id_two*dest_ld + id_one + dest_offset] = result;
}
}
}
}
#endif
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
#if defined(ROUTINE_HEMM) && (PRECISION == 3232 || PRECISION == 6464)
// Kernel to populate a squared hermitian matrix, given that the triangle which holds the data is
// stored as the lower-triangle of the input matrix. This uses the padding kernel's parameters.
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void HermLowerToSquared(const int src_dim,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_dim,
const int dest_ld, const int dest_offset,
__global real* dest) {
// Loops over the work per thread in both dimensions
#pragma unroll
for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
#pragma unroll
for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
if (id_two < dest_dim && id_one < dest_dim) {
// Loads data from the lower-hermitian matrix
real result;
SetToZero(result);
if (id_two < src_dim && id_one < src_dim) {
if (id_two <= id_one) {
result = src[id_two*src_ld + id_one + src_offset];
if (id_one == id_two) { result.y = ZERO; }
}
else {
result = src[id_one*src_ld + id_two + src_offset];
COMPLEX_CONJUGATE(result);
}
}
// Stores the result in the destination matrix
dest[id_two*dest_ld + id_one + dest_offset] = result;
}
}
}
}
// Same as above, but now the matrix' data is stored in the upper-triangle
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1)))
void HermUpperToSquared(const int src_dim,
const int src_ld, const int src_offset,
__global const real* restrict src,
const int dest_dim,
const int dest_ld, const int dest_offset,
__global real* dest) {
// Loops over the work per thread in both dimensions
#pragma unroll
for (int w_one=0; w_one<PAD_WPTX; ++w_one) {
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0);
#pragma unroll
for (int w_two=0; w_two<PAD_WPTY; ++w_two) {
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1);
if (id_two < dest_dim && id_one < dest_dim) {
// Loads data from the upper-hermitian matrix
real result;
SetToZero(result);
if (id_two < src_dim && id_one < src_dim) {
if (id_one <= id_two) {
result = src[id_two*src_ld + id_one + src_offset];
if (id_one == id_two) { result.y = ZERO; }
}
else {
result = src[id_one*src_ld + id_two + src_offset];
COMPLEX_CONJUGATE(result);
}
}
// Stores the result in the destination matrix
dest[id_two*dest_ld + id_one + dest_offset] = result;
}
}
}
}
#endif
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
// Parameters set by the tuner or by the database. Here they are given a basic default value in case
// this kernel file is used outside of the CLBlast library.
#ifndef MWG
#define MWG 8 // Tile-size in dimension M (e.g. 64, 128)
#endif
#ifndef NWG
#define NWG 8 // Tile-size in dimension N (e.g. 64, 128)
#endif
#ifndef KWG
#define KWG 8 // Tile-size in dimension K (e.g. 8, 16)
#endif
#ifndef MDIMC
#define MDIMC 8 // Threads per workgroup in M-dimension (e.g. 8, 16, 32)
#endif
#ifndef NDIMC
#define NDIMC 8 // Threads per workgroup in N-dimension (e.g. 8, 16, 32)
#endif
#ifndef MDIMA
#define MDIMA 8 // Re-shaped tile dimension of matrix A: KDIMA * MDIMA
#endif
#ifndef NDIMB
#define NDIMB 8 // Re-shaped tile dimension of matrix B: KDIMB * NDIMB
#endif
#ifndef KWI
#define KWI 1 // Unroll factor of the KWG loop (smaller or equal than KWG)
#endif
#ifndef VWM
#define VWM 1 // Vector width of matrices A and C
#endif
#ifndef VWN
#define VWN 1 // Vector width of matrix B
#endif
#ifndef STRM
#define STRM 0 // Use strided access within a thread in the M-dimension (1) or not (0)
#endif
#ifndef STRN
#define STRN 0 // Use strided access within a thread in the N-dimension (1) or not (0)
#endif
#ifndef SA
#define SA 0 // Use local/shared memory to cache matrix A (1) or not (0)
#endif
#ifndef SB
#define SB 0 // Use local/shared memory to cache matrix B (1) or not (0)
#endif
// Helper parameters based on the above tuning parameters
#define MWI (MWG/MDIMC) // Work per work-item (M-dimension)
#define NWI (NWG/NDIMC) // Work per work-item (N-dimension)
#define KDIMA ((MDIMC*NDIMC)/(MDIMA)) // Re-shaped tile dimension of matrix A: KDIMA * MDIMA
#define KDIMB ((MDIMC*NDIMC)/(NDIMB)) // Re-shaped tile dimension of matrix B: KDIMB * NDIMB
#define MWA (MWG/MDIMA) // Amount of loads-per-thread for matrix A (M-dimension)
#define KWA (KWG/KDIMA) // Amount of loads-per-thread for matrix A (K-dimension)
#define KWB (KWG/KDIMB) // Amount of loads-per-thread for matrix B (K-dimension)
#define NWB (NWG/NDIMB) // Amount of loads-per-thread for matrix B (N-dimension)
// Settings
#ifndef USE_VECTOR_MAD
#define USE_VECTOR_MAD 0 // Unroll (0) or don't (1) unroll the vector MAD manually
#endif
#ifndef GLOBAL_MEM_FENCE
#define GLOBAL_MEM_FENCE 0 // Global synchronisation barrier for potential better performance
#endif
// =================================================================================================
// Data-widths in dimension M
#if VWM == 1
typedef real realM;
#elif VWM == 2
typedef real2 realM;
#elif VWM == 4
typedef real4 realM;
#elif VWM == 8
typedef real8 realM;
#elif VWM == 16
typedef real16 realM;
#endif
// Data-widths in dimension N
#if VWN == 1
typedef real realN;
#elif VWN == 2
typedef real2 realN;
#elif VWN == 4
typedef real4 realN;
#elif VWN == 8
typedef real8 realN;
#elif VWN == 16
typedef real16 realN;
#endif
// =================================================================================================
// Initializes the accumulation registers to zero
void InitAccRegisters(realM cpm[NWI][MWI/VWM]) {
#pragma unroll
for (int mi=0; mi<MWI/VWM; ++mi) {
#pragma unroll
for (int ni=0; ni<NWI; ++ni) {
#if VWM == 1
SetToZero(cpm[ni][mi]);
#elif VWM == 2
SetToZero(cpm[ni][mi].x);
SetToZero(cpm[ni][mi].y);
#elif VWM == 4
SetToZero(cpm[ni][mi].x);
SetToZero(cpm[ni][mi].y);
SetToZero(cpm[ni][mi].z);
SetToZero(cpm[ni][mi].w);
#elif VWM == 8
SetToZero(cpm[ni][mi].s0);
SetToZero(cpm[ni][mi].s1);
SetToZero(cpm[ni][mi].s2);
SetToZero(cpm[ni][mi].s3);
SetToZero(cpm[ni][mi].s4);
SetToZero(cpm[ni][mi].s5);
SetToZero(cpm[ni][mi].s6);
SetToZero(cpm[ni][mi].s7);
#elif VWM == 16
SetToZero(cpm[ni][mi].s0);
SetToZero(cpm[ni][mi].s1);
SetToZero(cpm[ni][mi].s2);
SetToZero(cpm[ni][mi].s3);
SetToZero(cpm[ni][mi].s4);
SetToZero(cpm[ni][mi].s5);
SetToZero(cpm[ni][mi].s6);
SetToZero(cpm[ni][mi].s7);
SetToZero(cpm[ni][mi].s8);
SetToZero(cpm[ni][mi].s9);
SetToZero(cpm[ni][mi].sA);
SetToZero(cpm[ni][mi].sB);
SetToZero(cpm[ni][mi].sC);
SetToZero(cpm[ni][mi].sD);
SetToZero(cpm[ni][mi].sE);
SetToZero(cpm[ni][mi].sF);
#endif
}
}
}
// =================================================================================================
// Caches global off-chip memory into local (shared) memory on-chip. This function is specific for
// caching the A input matrix.
#if SA == 1
inline void GlobalToLocalA(const __global realM* restrict agm, __local realM* alm,
const int kSizeM, const int tid, const int kwg) {
const int la0 = tid % MDIMA;
const int la1 = tid / MDIMA;
#pragma unroll
for (int mia=0; mia<MWA/VWM; ++mia) {
#pragma unroll
for (int kia=0; kia<KWA; ++kia) {
// Computes the indices based on strided/non-strided access
#if STRM == 0
int mg = mia + la0*(MWA/VWM);
#elif STRM == 1
int mg = la0 + mia*MDIMA;
#endif
// Computes the indices for the global memory
int kg = kia + la1*KWA;
int idm = mg + GetGroupID0() * (MWG/VWM);
int idk = kg + kwg;
// Loads the data from global memory (not transposed) into the local memory
alm[kg*(MWG/VWM) + mg] = agm[idk*(kSizeM/VWM) + idm];
}
}
}
#endif
// Same as above, but now for the B input matrix
#if SB == 1
inline void GlobalToLocalB(const __global realN* restrict bgm, __local realN* blm,
const int kSizeN, const int tid, const int kwg) {
const int lb0 = tid % NDIMB;
const int lb1 = tid / NDIMB;
#pragma unroll
for (int kib=0; kib<KWB; ++kib) {
#pragma unroll
for (int nib=0; nib<NWB/VWN; ++nib) {
// Computes the indices based on strided/non-strided access
#if STRN == 0
int ng = nib + lb0*(NWB/VWN);
#elif STRN == 1
int ng = lb0 + nib*NDIMB;
#endif
// Computes the indices for the global memory
int kg = kib + lb1*KWB;
int idn = ng + GetGroupID1() * (NWG/VWN);
int idk = kg + kwg;
// Loads the data from global memory (transposed) into the local memory
blm[kg*(NWG/VWN) + ng] = bgm[idk*(kSizeN/VWN) + idn];
}
}
}
#endif
// =================================================================================================
// Caches global off-chip memory directly into per-thread private memory (registers). This function
// is specific for caching the A input matrix.
#if SA == 0
inline void GlobalToPrivateA(const __global realM* restrict agm, realM apm[MWI/VWM],
const int kSizeM, const int idk, const int kwg) {
#pragma unroll
for (int mi=0; mi<MWI/VWM; ++mi) {
// Computes the indices based on strided/non-strided access
#if STRM == 0
int mg = mi + get_local_id(0)*(MWI/VWM);
#elif STRM == 1
int mg = get_local_id(0) + mi*MDIMC;
#endif
// Computes the indices for the global memory
int idm = mg + GetGroupID0() * (MWG/VWM);
// Loads the data from global memory (not transposed) and stores into registers
apm[mi] = agm[idk*(kSizeM/VWM) + idm];
}
}
#endif
// Same as above, but now for the B input matrix
#if SB == 0
inline void GlobalToPrivateB(const __global realN* restrict bgm, realN bpm[NWI/VWN],
const int kSizeN, const int idk) {
#pragma unroll
for (int ni=0; ni<NWI/VWN; ++ni) {
// Computes the indices based on strided/non-strided access
#if STRN == 0
int ng = ni + get_local_id(1)*(NWI/VWN);
#elif STRN == 1
int ng = get_local_id(1) + ni*NDIMC;
#endif
// Computes the indices for the global memory
int idn = ng + GetGroupID1() * (NWG/VWN);
// Loads the data from global memory (transposed) and stores into registers
bpm[ni] = bgm[idk*(kSizeN/VWN) + idn];
}
}
#endif
// =================================================================================================
// Caches on-chip local memory into per-thread private memory (registers). This function is specific
// for caching the A input matrix.
#if SA == 1
inline void LocalToPrivateA(__local realM* alm, realM apm[MWI/VWM], const int kg) {
#pragma unroll
for (int mi=0; mi<MWI/VWM; ++mi) {
#if STRM == 0
int mg = mi + get_local_id(0)*(MWI/VWM);
#elif STRM == 1
int mg = get_local_id(0) + mi*MDIMC;
#endif
apm[mi] = alm[kg*(MWG/VWM) + mg];
}
}
#endif
// Same as above, but now for the B input matrix
#if SB == 1
inline void LocalToPrivateB(__local realN* blm, realN bpm[NWI/VWN], const int kg) {
#pragma unroll
for (int ni=0; ni<NWI/VWN; ++ni) {
#if STRN == 0
int ng = ni + get_local_id(1)*(NWI/VWN);
#elif STRN == 1
int ng = get_local_id(1) + ni*NDIMC;
#endif
bpm[ni] = blm[kg*(NWG/VWN) + ng];
}
}
#endif
// =================================================================================================
// End of the C++11 raw string literal
// =================================================================================================
// The vectorised multiply-add function
inline realM MultiplyAddVector(realM cvec, const realM avec, const real bval) {
#if USE_VECTOR_MAD == 1
cvec += avec * bval;
#else
#if VWM == 1
MultiplyAdd(cvec, avec, bval);
#elif VWM == 2
MultiplyAdd(cvec.x , avec.x, bval);
MultiplyAdd(cvec.y , avec.y, bval);
#elif VWM == 4
MultiplyAdd(cvec.x , avec.x, bval);
MultiplyAdd(cvec.y , avec.y, bval);
MultiplyAdd(cvec.z , avec.z, bval);
MultiplyAdd(cvec.w , avec.w, bval);
#elif VWM == 8
MultiplyAdd(cvec.s0, avec.s0, bval);
MultiplyAdd(cvec.s1, avec.s1, bval);
MultiplyAdd(cvec.s2, avec.s2, bval);
MultiplyAdd(cvec.s3, avec.s3, bval);
MultiplyAdd(cvec.s4, avec.s4, bval);
MultiplyAdd(cvec.s5, avec.s5, bval);
MultiplyAdd(cvec.s6, avec.s6, bval);
MultiplyAdd(cvec.s7, avec.s7, bval);
#elif VWM == 16
MultiplyAdd(cvec.s0, avec.s0, bval);
MultiplyAdd(cvec.s1, avec.s1, bval);
MultiplyAdd(cvec.s2, avec.s2, bval);
MultiplyAdd(cvec.s3, avec.s3, bval);
MultiplyAdd(cvec.s4, avec.s4, bval);
MultiplyAdd(cvec.s5, avec.s5, bval);
MultiplyAdd(cvec.s6, avec.s6, bval);
MultiplyAdd(cvec.s7, avec.s7, bval);
MultiplyAdd(cvec.s8, avec.s8, bval);
MultiplyAdd(cvec.s9, avec.s9, bval);
MultiplyAdd(cvec.sA, avec.sA, bval);
MultiplyAdd(cvec.sB, avec.sB, bval);
MultiplyAdd(cvec.sC, avec.sC, bval);
MultiplyAdd(cvec.sD, avec.sD, bval);
MultiplyAdd(cvec.sE, avec.sE, bval);
MultiplyAdd(cvec.sF, avec.sF, bval);
#endif
#endif
return cvec;
}
// Performs the actual computation: Cpm += Apm * Bpm
inline void MultiplyAccumulate(realM cpm[NWI][MWI/VWM], realM apm[MWI/VWM], realN bpm[NWI/VWN]) {
#pragma unroll
for (int ni=0; ni<NWI/VWN; ++ni) {
#pragma unroll
for (int mi=0; mi<MWI/VWM; ++mi) {
const realM aval = apm[mi];
#if VWN == 1
cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni]);
#elif VWN == 2
cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni].x);
cpm[ni*VWN + 1][mi] = MultiplyAddVector(cpm[ni*VWN + 1][mi], aval, bpm[ni].y);
#elif VWN == 4
cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni].x);
cpm[ni*VWN + 1][mi] = MultiplyAddVector(cpm[ni*VWN + 1][mi], aval, bpm[ni].y);
cpm[ni*VWN + 2][mi] = MultiplyAddVector(cpm[ni*VWN + 2][mi], aval, bpm[ni].z);
cpm[ni*VWN + 3][mi] = MultiplyAddVector(cpm[ni*VWN + 3][mi], aval, bpm[ni].w);
#elif VWN == 8
cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni].s0);
cpm[ni*VWN + 1][mi] = MultiplyAddVector(cpm[ni*VWN + 1][mi], aval, bpm[ni].s1);
cpm[ni*VWN + 2][mi] = MultiplyAddVector(cpm[ni*VWN + 2][mi], aval, bpm[ni].s2);
cpm[ni*VWN + 3][mi] = MultiplyAddVector(cpm[ni*VWN + 3][mi], aval, bpm[ni].s3);
cpm[ni*VWN + 4][mi] = MultiplyAddVector(cpm[ni*VWN + 4][mi], aval, bpm[ni].s4);
cpm[ni*VWN + 5][mi] = MultiplyAddVector(cpm[ni*VWN + 5][mi], aval, bpm[ni].s5);
cpm[ni*VWN + 6][mi] = MultiplyAddVector(cpm[ni*VWN + 6][mi], aval, bpm[ni].s6);
cpm[ni*VWN + 7][mi] = MultiplyAddVector(cpm[ni*VWN + 7][mi], aval, bpm[ni].s7);
#elif VWN == 16
cpm[ni*VWN + 0 ][mi] = MultiplyAddVector(cpm[ni*VWN + 0 ][mi], aval, bpm[ni].s0);
cpm[ni*VWN + 1 ][mi] = MultiplyAddVector(cpm[ni*VWN + 1 ][mi], aval, bpm[ni].s1);
cpm[ni*VWN + 2 ][mi] = MultiplyAddVector(cpm[ni*VWN + 2 ][mi], aval, bpm[ni].s2);
cpm[ni*VWN + 3 ][mi] = MultiplyAddVector(cpm[ni*VWN + 3 ][mi], aval, bpm[ni].s3);
cpm[ni*VWN + 4 ][mi] = MultiplyAddVector(cpm[ni*VWN + 4 ][mi], aval, bpm[ni].s4);
cpm[ni*VWN + 5 ][mi] = MultiplyAddVector(cpm[ni*VWN + 5 ][mi], aval, bpm[ni].s5);
cpm[ni*VWN + 6 ][mi] = MultiplyAddVector(cpm[ni*VWN + 6 ][mi], aval, bpm[ni].s6);
cpm[ni*VWN + 7 ][mi] = MultiplyAddVector(cpm[ni*VWN + 7 ][mi], aval, bpm[ni].s7);
cpm[ni*VWN + 8 ][mi] = MultiplyAddVector(cpm[ni*VWN + 8 ][mi], aval, bpm[ni].s8);
cpm[ni*VWN + 9 ][mi] = MultiplyAddVector(cpm[ni*VWN + 9 ][mi], aval, bpm[ni].s9);
cpm[ni*VWN + 10][mi] = MultiplyAddVector(cpm[ni*VWN + 10][mi], aval, bpm[ni].sA);
cpm[ni*VWN + 11][mi] = MultiplyAddVector(cpm[ni*VWN + 11][mi], aval, bpm[ni].sB);
cpm[ni*VWN + 12][mi] = MultiplyAddVector(cpm[ni*VWN + 12][mi], aval, bpm[ni].sC);
cpm[ni*VWN + 13][mi] = MultiplyAddVector(cpm[ni*VWN + 13][mi], aval, bpm[ni].sD);
cpm[ni*VWN + 14][mi] = MultiplyAddVector(cpm[ni*VWN + 14][mi], aval, bpm[ni].sE);
cpm[ni*VWN + 15][mi] = MultiplyAddVector(cpm[ni*VWN + 15][mi], aval, bpm[ni].sF);
#endif
}
}
}
// =================================================================================================
// Merges the results in Cpm with the global array in Cgm. This also performs the multiplication
// with the constants: Cgm = alpha*A*B + beta*Cgm = alpha*Cpm + beta*Cgm
inline void StoreResults(__global realM* cgm, realM cpm[NWI][MWI/VWM], const int kSizeM,
const real alpha, const real beta) {
#pragma unroll
for (int ni=0; ni<NWI; ++ni) {
#pragma unroll
for (int mi=0; mi<MWI/VWM; ++mi) {
#if STRM == 0
int mg = mi + get_local_id(0)*(MWI/VWM);
#elif STRM == 1
int mg = get_local_id(0) + mi*MDIMC;
#endif
#if STRN == 0
int ng = ni + get_local_id(1)*NWI;
#elif STRN == 1
int ng = ni%VWN + get_local_id(1)*VWN + (ni/VWN)*VWN*NDIMC;
#endif
int idm = mg + GetGroupID0() * (MWG/VWM);
int idn = ng + GetGroupID1() * NWG;
// The final multiplication with alpha and the addition with beta*C
int index = idn*(kSizeM/VWM) + idm;
realM result;
realM xval = cpm[ni][mi];
realM yval = cgm[index];
#if VWM == 1
AXPBY(result, alpha, xval, beta, yval);
#elif VWM == 2
AXPBY(result.x, alpha, xval.x, beta, yval.x);
AXPBY(result.y, alpha, xval.y, beta, yval.y);
#elif VWM == 4
AXPBY(result.x, alpha, xval.x, beta, yval.x);
AXPBY(result.y, alpha, xval.y, beta, yval.y);
AXPBY(result.z, alpha, xval.z, beta, yval.z);
AXPBY(result.w, alpha, xval.w, beta, yval.w);
#elif VWM == 8
AXPBY(result.s0, alpha, xval.s0, beta, yval.s0);
AXPBY(result.s1, alpha, xval.s1, beta, yval.s1);
AXPBY(result.s2, alpha, xval.s2, beta, yval.s2);
AXPBY(result.s3, alpha, xval.s3, beta, yval.s3);
AXPBY(result.s4, alpha, xval.s4, beta, yval.s4);
AXPBY(result.s5, alpha, xval.s5, beta, yval.s5);
AXPBY(result.s6, alpha, xval.s6, beta, yval.s6);
AXPBY(result.s7, alpha, xval.s7, beta, yval.s7);
#elif VWM == 16
AXPBY(result.s0, alpha, xval.s0, beta, yval.s0);
AXPBY(result.s1, alpha, xval.s1, beta, yval.s1);
AXPBY(result.s2, alpha, xval.s2, beta, yval.s2);
AXPBY(result.s3, alpha, xval.s3, beta, yval.s3);
AXPBY(result.s4, alpha, xval.s4, beta, yval.s4);
AXPBY(result.s5, alpha, xval.s5, beta, yval.s5);
AXPBY(result.s6, alpha, xval.s6, beta, yval.s6);
AXPBY(result.s7, alpha, xval.s7, beta, yval.s7);
AXPBY(result.s8, alpha, xval.s8, beta, yval.s8);
AXPBY(result.s9, alpha, xval.s9, beta, yval.s9);
AXPBY(result.sA, alpha, xval.sA, beta, yval.sA);
AXPBY(result.sB, alpha, xval.sB, beta, yval.sB);
AXPBY(result.sC, alpha, xval.sC, beta, yval.sC);
AXPBY(result.sD, alpha, xval.sD, beta, yval.sD);
AXPBY(result.sE, alpha, xval.sE, beta, yval.sE);
AXPBY(result.sF, alpha, xval.sF, beta, yval.sF);
#endif
cgm[index] = result;
}
}
}
// =================================================================================================
// Main body of the matrix-multiplication algorithm. It calls the (inlined) functions above.
inline void XgemmBody(const int kSizeM, const int kSizeN, const int kSizeK,
const __global realM* restrict agm, const __global realN* restrict bgm,
__global realM* cgm, realM cpm[NWI][MWI/VWM]
#if SA == 1 && SB == 1
, __local realM* alm, __local realN* blm
#elif SA == 1
, __local realM* alm
#elif SB == 1
, __local realN* blm
#endif
) {
// Allocates workitem-private memory (registers)
realM apm[MWI/VWM];
realN bpm[NWI/VWN];
// Combined thread identifier (volatile to disable caching)
#if SA == 1 || SB == 1
volatile int tid = get_local_id(0) + MDIMC*get_local_id(1);
#endif
// Initializes the accumulation registers
InitAccRegisters(cpm);
// Loops over all workgroup tiles
for (int kwg=0; kwg<kSizeK; kwg+=KWG) {
// Loads data: off-chip --> local (matrix A)
#if SA == 1
GlobalToLocalA(agm, alm, kSizeM, tid, kwg);
#endif
// Loads data: off-chip --> local (matrix B)
#if SB == 1
GlobalToLocalB(bgm, blm, kSizeN, tid, kwg);
#endif
#if SA == 1 || SB == 1
barrier(CLK_LOCAL_MEM_FENCE);
#endif
// Loops over all workitem tiles, unrolled by a factor KWI
for (int pwi=0; pwi<KWG; pwi+=KWI) {
#pragma unroll
for (int pit=0; pit<KWI; ++pit) {
#if SA == 0 || SB == 0
int idk = kwg + pwi + pit;
#endif
#if SA == 1 || SB == 1
int kg = pwi+pit;
#endif
// Loads data: local --> private (matrix A)
#if SA == 1
LocalToPrivateA(alm, apm, kg);
// Loads data: off-chip --> private (matrix A)
#else
GlobalToPrivateA(agm, apm, kSizeM, idk, kwg);
#endif
// Loads data: local --> private (matrix B)
#if SB == 1
LocalToPrivateB(blm, bpm, kg);
// Loads data: off-chip --> private (matrix B)
#else
GlobalToPrivateB(bgm, bpm, kSizeN, idk);
#endif
// Performs the accumulation (Cpm += Apm * Bpm)
MultiplyAccumulate(cpm, apm, bpm);
}
}
#if SA == 1 || SB == 1
barrier(CLK_LOCAL_MEM_FENCE);
#endif
}
#if GLOBAL_MEM_FENCE == 1
barrier(CLK_GLOBAL_MEM_FENCE);
#endif
}
// =================================================================================================
// The upper-triangular and lower-triangular kernels are only used in special cases
#if defined(ROUTINE_SYRK) || defined(ROUTINE_HERK) || defined(ROUTINE_SYR2K) || defined(ROUTINE_HER2K)
// Main entry point of the kernel. This is the upper-triangular version.
__kernel __attribute__((reqd_work_group_size(MDIMC, NDIMC, 1)))
void XgemmUpper(const int kSizeN, const int kSizeK,
const __constant real* restrict arg_alpha,
const __constant real* restrict arg_beta,
const __global realM* restrict agm,
const __global realN* restrict bgm,
__global realM* cgm) {
const real alpha = arg_alpha[0];
const real beta = arg_beta[0];
// Skip these threads if they do not contain threads contributing to the upper-triangle
if (GetGroupID1()*NWG < GetGroupID0()*MWG) {
return;
}
// Allocates workgroup-private memory (local memory)
#if SA == 1
__local realM alm[KWG * MWG/VWM];
#endif
#if SB == 1
__local realN blm[KWG * NWG/VWN];
#endif
// Computes the matrix-multiplication and stores the result in register memory
realM cpm[NWI][MWI/VWM];
#if SA == 1 && SB == 1
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm, blm);
#elif SA == 1
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm);
#elif SB == 1
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, blm);
#else
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm);
#endif
// Stores an MWG * NWG tile of results and performs the multiplication with alpha and beta
StoreResults(cgm, cpm, kSizeN, alpha, beta);
}
// Main entry point of the kernel. This is the lower-triangular version.
__kernel __attribute__((reqd_work_group_size(MDIMC, NDIMC, 1)))
void XgemmLower(const int kSizeN, const int kSizeK,
const __constant real* restrict arg_alpha,
const __constant real* restrict arg_beta,
const __global realM* restrict agm,
const __global realN* restrict bgm,
__global realM* cgm) {
const real alpha = arg_alpha[0];
const real beta = arg_beta[0];
// Skip these threads if they do not contain threads contributing to the lower-triangle
if (GetGroupID1()*NWG > GetGroupID0()*MWG) {
return;
}
// Allocates workgroup-private memory (local memory)
#if SA == 1
__local realM alm[KWG * MWG/VWM];
#endif
#if SB == 1
__local realN blm[KWG * NWG/VWN];
#endif
// Computes the matrix-multiplication and stores the result in register memory
realM cpm[NWI][MWI/VWM];
#if SA == 1 && SB == 1
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm, blm);
#elif SA == 1
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm);
#elif SB == 1
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, blm);
#else
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm);
#endif
// Stores an MWG * NWG tile of results and performs the multiplication with alpha and beta
StoreResults(cgm, cpm, kSizeN, alpha, beta);
}
// =================================================================================================
// If not using a triangular version, include the regular kernel
#else
// Main entry point of the kernel. This is the regular full version.
__kernel __attribute__((reqd_work_group_size(MDIMC, NDIMC, 1)))
void Xgemm(const int kSizeM, const int kSizeN, const int kSizeK,
const __constant real* restrict arg_alpha,
const __constant real* restrict arg_beta,
const __global realM* restrict agm,
const __global realN* restrict bgm,
__global realM* cgm) {
const real alpha = arg_alpha[0];
const real beta = arg_beta[0];
// Allocates workgroup-private memory (local memory)
#if SA == 1
__local realM alm[KWG * MWG/VWM];
#endif
#if SB == 1
__local realN blm[KWG * NWG/VWN];
#endif
// Computes the matrix-multiplication and stores the result in register memory
realM cpm[NWI][MWI/VWM];
#if SA == 1 && SB == 1
XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm, blm);
#elif SA == 1
XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm);
#elif SB == 1
XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm, blm);
#else
XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm);
#endif
// Stores an MWG * NWG tile of results and performs the multiplication with alpha and beta
StoreResults(cgm, cpm, kSizeM, alpha, beta);
}
#endif
// =================================================================================================
// End of the C++11 raw string literal
F0829 11:54:00.688835 5841 greentea_math_functions.cpp:254] Check failed: static_cast<int>(status) == static_cast<int>(clblast::StatusCode::kSuccess) (-11 vs. 0) GREENTEA ERROR: CLBlast error
*** Check failure stack trace: ***
@ 0x76ea4c24 (unknown)
@ 0x76ea6cfc (unknown)
@ 0x76ea4814 (unknown)
@ 0x76ea7648 (unknown)
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
@ 0x74bdc138 caffe::greentea_gpu_gemm<>()
Makefile:657: recipe for target 'runtest' failed
make: *** [runtest] Aborted
iCar-G3-446:/dev/disk/mounted/sda1/opencl/caffe #
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment