Skip to content

Instantly share code, notes, and snippets.

View allanmac's full-sized avatar

Allan MacKinnon allanmac

  • Dispatch3 Inc.
  • South Florida, USA
  • 20:36 (UTC -04:00)
  • X @pixelio
View GitHub Profile
@allanmac
allanmac / malloc.cu
Last active October 29, 2015 15:57
Allocate more than 4GB
#include <stdio.h>
//
//
//
static
void
cuda_assert(const cudaError_t code, const char* const file, const int line, const bool abort)
{
@allanmac
allanmac / short4.cu
Created February 7, 2014 03:06
Why I2I?
// -*- compile-command: "nvcc -m 32 -arch sm_35 -Xptxas=-v,-abi=no -cubin short4.cu"; -*-
#include <stdint.h>
typedef uint32_t u32;
typedef uint64_t u64;
typedef union
{
short4 s16v4;
@allanmac
allanmac / sha256.cu
Last active November 10, 2023 01:26
A CUDA SHA-256 subroutine using macro expansion
// -*- compile-command: "nvcc -m 32 -arch sm_35 -Xptxas=-v,-abi=no -cubin sha256.cu"; -*-
//
// Copyright 2013 Allan MacKinnon <allanmac@alum.mit.edu>
//
// Permission is hereby granted, free of charge, to any person obtaining
// a copy of this software and associated documentation files (the
// "Software"), to deal in the Software without restriction, including
// without limitation the rights to use, copy, modify, merge, publish,
// distribute, sublicense, and/or sell copies of the Software, and to
@allanmac
allanmac / float3 SoA to AoS
Last active June 30, 2018 16:12
A strategy for converting a float3 SoA into AoS without using shared memory.
===============================================================================================
Load three arrays (x, y and z) in SoA order, repack them and store them in AoS order.
Strategy: each warp permutes its load lane with:
(rowNum + (laneId() * 3)) & 31
This will convert SoA into AoS but with x/y/z staggered across rows of registers.
@allanmac
allanmac / warp_scan.inl
Last active December 25, 2015 00:28
The macro at the bottom of "warp_scan.inl" is used to declare an optimal CUDA warp scan primitive without using C++ templates and specialization. The macro supports 32-bit PTX types (u32/s32/f32) and can generate inclusive and exclusive scans over any appropriate PTX two-argument operator (add,sub,min,max,mul,div,rem,etc). See examples below.
#pragma once
//
//
//
#define PXL_WARP_SCAN_SHFL(_op,_vT,_opT,_regC,_exc,_exc0,_excP)
////////////////////////////////////////////////////////////////////////
//
@allanmac
allanmac / peer.cu
Created September 15, 2013 18:42
Test all device pairings for peer-to-peer memory access support.
#include <stdio.h>
#include <cuda.h>
int main(int argc, char** argv)
{
cuInit(0);
int count;
@allanmac
allanmac / symarg.cu
Created August 30, 2013 20:59
Inspecting the difference between kernel arguments and __constant__ variables.
//
//
//
#define KERNEL_QUALIFIERS extern "C" __global__
//
//
//
@allanmac
allanmac / scan.cu
Last active September 24, 2016 12:20
Inclusive and exclusive warp-level scan snippets. Evaluating SHFL vs. shared implementations. Also evaluating the simplest transformation of an inclusive scan into an exclusive scan. It's only two ops on sm_3x.
#include <stdio.h>
//
//
//
#define WARP_SIZE 32
#define VOLATILE volatile
@allanmac
allanmac / namespace.cu
Created July 26, 2013 01:03
Namespaces and shared structs.
#define KERNEL_QUALIFIERS __global__
#define VOLATILE volatile
#define DEVICE_INTRINSIC_QUALIFIERS __device__ __forceinline__
#define DEVICE_STATIC_FUNCTION_QUALIFIERS static DEVICE_FUNCTION_QUALIFIERS
#define DEVICE_STATIC_INTRINSIC_QUALIFIERS static DEVICE_INTRINSIC_QUALIFIERS
//
@allanmac
allanmac / sync.cu
Created July 14, 2013 21:21
Examine the SASS that's generated for barrier reduction operations: __syncthreads_count(), __syncthreads_or(), __syncthreads_and() as well as the regular __syncthreads() barrier op. Somewhat surprisingly these are not mapped to a number of SASS ops. The barrier reductions are executed and the result is moved from a "barrier register" to a regula…
//
//
//
#define KERNEL_QUALIFIERS extern "C" __global__
//
//
//