- https://software.intel.com/en-us/node/468480
- https://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/index.html
- http://blog.csdn.net/seven_first/article/details/47378697#1-caffecpugemm-函数
void cblas_sgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
const int K, const float alpha, const float *A,
const int lda, const float *B, const int ldb,
const float beta, float *C, const int ldc);
C = alpha * op(A) * op(B) + beta * C
op(A)
=A'
ifTransA
is set, otherwiseop(A)
=A
.op(B)
is similar.op(A)
isM
xK
,op(B)
isKxN
,C
isMxN
lda
always be the number of cols of A (at least), no matterTransA
is set or not.ldb
is similar.
Matrix are stored in row-major order in CPU but in col-major order in GPU. So caffe_cpu_gemm
computes C=A*B
while caffe_gpu_gemm
computes C'=B'*A'
.