Skip to content

Instantly share code, notes, and snippets.

View allanmac's full-sized avatar

Allan MacKinnon allanmac

  • Dispatch3 Inc.
  • South Florida, USA
  • 10:09 (UTC -04:00)
  • X @pixelio
View GitHub Profile
__global__
void fmaTest(float* const values)
{
const unsigned int tidx = threadIdx.x;
const float b = values[ tidx];
float a = values[2*tidx];
a = __fmaf_rn(a, b, 0.73f);
a = __fmaf_rn(a, b, 0.37f);
@allanmac
allanmac / gmem.cu
Created February 6, 2013 20:20
Exercise vector load and store. Also evaluate LDG.CT.
//
//
//
#define WARP_SIZE 32
#define RESTRICT __restrict
//
@allanmac
allanmac / smid.cu
Last active December 12, 2015 09:19
Probe the CUDA special registers %smid and %nsmid.
#include <stdio.h>
//
//
//
#define DEVICE_INTRINSIC_QUALIFIERS __device__ __forceinline__
DEVICE_INTRINSIC_QUALIFIERS
unsigned int
@allanmac
allanmac / geff.cu
Last active December 14, 2015 01:29
Example kernel used to observe Global Load/Store Efficiency metrics in the Visual Profiler.
#include <stdio.h>
//
//
//
#define TYPE unsigned int
#define REPS 1
@allanmac
allanmac / natural.cu
Last active December 14, 2015 06:38
The `setp` and `selp` instructions are your friends.
extern "C"
__global__
void natural(const unsigned int b,
const unsigned int c,
const unsigned int y,
const unsigned int z,
const unsigned int id,
unsigned int* const out)
{
const bool flag = (id == 1);
@allanmac
allanmac / fdimf.cu
Last active December 14, 2015 17:49
Inspecting fdimf() output.
#define KERNEL_QUALIFIERS extern "C" __global__
KERNEL_QUALIFIERS
void fdimfTest(const float x, const float y, float* const fout)
{
fout[threadIdx.x] = fdimf(x,y);
}
KERNEL_QUALIFIERS
void fdimfTest2(const float x, const float y, float* const fout)
@allanmac
allanmac / shflmax.cu
Created March 14, 2013 20:38
Butterfly max.
#define KERNEL_QUALIFIERS extern "C" __global__
KERNEL_QUALIFIERS
void shflmax(const int* const vin, int* const vout)
{
int v = vin[threadIdx.x];
v = max(v,__shfl_xor(v,16));
v = max(v,__shfl_xor(v, 8));
@allanmac
allanmac / bfe64.cu
Last active December 16, 2015 23:29
What is the best way to extract up to 32 bits that straddle the 32-bit boundary of a 64-bit word given a constant starting position and number of bits? On sm_35 the SHF.R.CLAMP opcode can accomplish this in two instructions. For sm_12-sm_30 devices as many as four instructions are required.
#include <stdio.h>
//
//
//
#define DEVICE_INTRINSIC_QUALIFIERS __device__ __forceinline__
//
//
@allanmac
allanmac / Makefile
Last active December 17, 2015 23:48
Demonstrate the impact of the resident block limit on grids with 32-thread "tinyblocks".
all:
nvcc -m 32 -Xptxas=-v,-abi=no \
-gencode=arch=compute_11,code=sm_11 \
-gencode=arch=compute_12,code=sm_12 \
-gencode=arch=compute_20,code=sm_21 \
-gencode=arch=compute_30,code=sm_30 \
-gencode=arch=compute_35,code=sm_35 \
blocks.cu -o blocks
@allanmac
allanmac / mem4.cu
Last active December 19, 2015 14:49
Difference between 4 sequential 32-bit loads and 1 128-bit (4x32-bit) vector load.
#define KERNEL_QUALIFIERS extern "C" __global__
//
//
//
#define REPEAT1() \
REPS(0)
#define REPEAT4() \