Skip to content

Instantly share code, notes, and snippets.

View allanmac's full-sized avatar

Allan MacKinnon allanmac

  • Dispatch3 Inc.
  • South Florida, USA
  • 19:29 (UTC -04:00)
  • X @pixelio
View GitHub Profile
@allanmac
allanmac / short4.cu
Created February 7, 2014 03:06
Why I2I?
// -*- compile-command: "nvcc -m 32 -arch sm_35 -Xptxas=-v,-abi=no -cubin short4.cu"; -*-
#include <stdint.h>
typedef uint32_t u32;
typedef uint64_t u64;
typedef union
{
short4 s16v4;
@allanmac
allanmac / int_ipc.cu
Last active August 29, 2015 14:06
Try to push an integer math kernel's IPC metric as high as possible!
// -*- compile-command: "nvcc -m 32 -arch sm_50 -Xptxas=-v,-abi=no int_ipc.cu -o int_ipc" ; -*-
//
//
//
#define KERNEL_QUALIFIERS __global__
#define KERNEL_QUALIFIERS_EXTERN extern KERNEL_QUALIFIERS
#define KERNEL_QUALIFIERS_EXTERN_C extern "C" KERNEL_QUALIFIERS
@allanmac
allanmac / xmad.cu
Created October 7, 2014 22:07
Try to generate XMAD instructions
// -*- compile-command: "nvcc -m 32 -arch sm_50 -Xptxas=-v,-abi=no -cubin xmad.cu"; -*-
//
//
//
#define KERNEL_QUALIFIERS extern "C" __global__
#define RESTRICT __restrict__
//
@allanmac
allanmac / kth.cu
Last active August 29, 2015 14:12
Acquire the indices of all set bits in a 32-bit word. For Fermi and Kepler the indices will be captured from MSB to LSB. For Maxwell the indices are ordered from LSB to MSB.
// -*- compile-command: "nvcc -m 32 -arch compute_20 -Xptxas=-v,-abi=no -cubin kth.cu"; -*-
#include <stdio.h>
#include <stdint.h>
//
//
//
typedef uint8_t u8;
@allanmac
allanmac / ipc.cu
Created January 5, 2015 16:14
Try to push a kernel's IPC metric as high as possible!
// -*- compile-command: "nvcc -m 32 -arch sm_50 -Xptxas=-v,-abi=no ipc.cu -o ipc" ; -*-
//
//
//
#define KERNEL_QUALIFIERS __global__
#define KERNEL_QUALIFIERS_EXTERN extern KERNEL_QUALIFIERS
#define KERNEL_QUALIFIERS_EXTERN_C extern "C" KERNEL_QUALIFIERS
@allanmac
allanmac / scan_64.cu
Last active August 29, 2015 14:13
Signed 64-bit reduce-add using shuffles (untested)
// -*- compile-command: "nvcc -m 32 -arch sm_50 -Xptxas=-v,-abi=no -cubin scan_64.cu" ; -*-
#include <stdint.h>
//
//
//
#define KERNEL_QUALIFIERS __global__
#define KERNEL_QUALIFIERS_EXTERN extern KERNEL_QUALIFIERS
@allanmac
allanmac / f16.cu
Last active August 29, 2015 14:13
Blind attempt at getting f16v2 ops to work... Totally untested.
// -*- compile-command: "nvcc -m 32 -arch sm_52 -Xptxas=-v,-abi=no -cubin f16.cu" ; -*-
#include <stdint.h>
//
//
//
#define KERNEL_QUALIFIERS __global__
#define KERNEL_QUALIFIERS_EXTERN extern KERNEL_QUALIFIERS
@allanmac
allanmac / ldg.cu
Created March 5, 2015 00:20
ld.global.nc (LDG.CI) operations not being generated when const+restrict pointers are within a const struct passed as a kernel argument
//
//
//
#define WARP_SIZE 32
#define KERNEL_QUALIFIERS extern "C" __global__
#define DEVICE_INTRINSIC_QUALIFIERS __device__ __forceinline__
#define RESTRICT __restrict__
//
@allanmac
allanmac / malloc.cu
Last active October 29, 2015 15:57
Allocate more than 4GB
#include <stdio.h>
//
//
//
static
void
cuda_assert(const cudaError_t code, const char* const file, const int line, const bool abort)
{
@allanmac
allanmac / fmuladd.cu
Last active December 10, 2015 08:38
__global__
void fmuladdTest(float* const values)
{
const unsigned int tidx = threadIdx.x;
const float b = values[ tidx];
float a = values[2*tidx];
a = __fmul_rn(a, b);
a = __fadd_rn(a, 0.73f);