Skip to content

Instantly share code, notes, and snippets.

View allanmac's full-sized avatar

Allan MacKinnon allanmac

  • Dispatch3 Inc.
  • South Florida, USA
  • 11:39 (UTC -04:00)
  • X @pixelio
View GitHub Profile
@allanmac
allanmac / assert_cuda.c
Last active September 17, 2024 18:34
A tiny example of CUDA + OpenGL interop with write-only surfaces and CUDA kernels. Uses GLFW+GLAD.
//
//
//
#include <stdlib.h>
#include <stdio.h>
//
//
//
@allanmac
allanmac / sha256.cu
Last active November 10, 2023 01:26
A CUDA SHA-256 subroutine using macro expansion
// -*- compile-command: "nvcc -m 32 -arch sm_35 -Xptxas=-v,-abi=no -cubin sha256.cu"; -*-
//
// Copyright 2013 Allan MacKinnon <allanmac@alum.mit.edu>
//
// Permission is hereby granted, free of charge, to any person obtaining
// a copy of this software and associated documentation files (the
// "Software"), to deal in the Software without restriction, including
// without limitation the rights to use, copy, modify, merge, publish,
// distribute, sublicense, and/or sell copies of the Software, and to
@allanmac
allanmac / shflrot.cu
Last active September 21, 2023 22:59
Experiments with shfl.idx/up/down to see how negative indices or offsets are handled. The shuffled value and its predicate are returned.
#include <stdio.h>
//
//
//
#define WARP_SIZE 32
#define KERNEL_QUALIFIERS extern "C" __global__
#define DEVICE_INTRINSIC_QUALIFIERS __device__ __forceinline__
@allanmac
allanmac / README.md
Last active June 10, 2023 11:11
Macros for neatly error checking OpenCL API functions.

Simply adding two parentheses cl(...) gives you error checking for OpenCL API functions that return a cl_int error code.

The second cl_ok(err) macro is for error checking API functions that initialize their error code as an argument.

The header also includes a useful function for converting OpenCL errors to strings:

char const * clGetErrorString(cl_int const err);
@allanmac
allanmac / warp_scan.cu
Created August 6, 2016 18:48
Inclusive vs. exclusive warp scan
#include <stdio.h>
#include <stdint.h>
#define WARP_SIZE 32
//
//
//
@allanmac
allanmac / cub_sort.cu
Last active June 17, 2022 17:29
Benchmark CUB Radix Sort with uniformly random data
//
// Build:
//
// nvcc -lcurand --generate-code arch=compute_50,code=compute_50 --generate-code arch=compute_75,code=compute_75 -D CUB_SORT_TYPE=uint32_t -o sort_cub_32 cub_sort.cu
// nvcc -lcurand --generate-code arch=compute_50,code=compute_50 --generate-code arch=compute_75,code=compute_75 -D CUB_SORT_TYPE=uint64_t -o sort_cub_64 cub_sort.cu
//
#define THRUST_IGNORE_CUB_VERSION_CHECK
#include <curand.h>
@allanmac
allanmac / ck_2.cu
Last active May 19, 2022 03:08
Concurrent kernel test that demonstrates _different_ kernels running concurrently. Hacked from NVIDIA's example. ck_2.cu has two kernels each requiring half of an sm_50 multiprocessor's shared memory. Kernel "a" is run on 5 out of 6 launches, otherwise kernel "b" is launched. ck_6.cu has six kernels.
/*
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
@allanmac
allanmac / sort.cu
Last active May 30, 2021 14:06
Thrust Radix Sort benchmark
// -*- compile-command: "nvcc -D THRUST_SORT_TYPE=uint32_t -arch sm_50 -o sort sort_32.cu"; -*-
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <algorithm>
#include <cstdlib>
@allanmac
allanmac / lop3.cu
Last active March 25, 2020 23:54
Test to see if the bit hack "Conditionally set or clear bits without branching" maps to a single Maxwell LOP3.LUT opcode
// -*- compile-command: "nvcc -m 32 -arch sm_50 -Xptxas=-v,-abi=no -cubin lop3.cu"; -*-
#define KERNEL_QUALIFIERS extern "C" __global__
//
// Bit hack: "Conditionally set or clear bits without branching"
// http://graphics.stanford.edu/~seander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching
//
// This bit hack *should* map to a single LOP3.LUT opcode:
//
@allanmac
allanmac / probe_bw.cu
Last active October 10, 2019 15:27
Measure achieved bandwidth when performing 128, 256 or 512 byte transactions on a multi-megabyte extent. This appears to reproduce @Genoil's original findings: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7
// -*- compile-command: "nvcc -m 64 -arch compute_30 -Xptxas=-v -o probe_bw probe_bw.cu"; -*-
//
// Copyright 2015 Allan MacKinnon <allanmac@alum.mit.edu>
//
// Permission is hereby granted, free of charge, to any person obtaining
// a copy of this software and associated documentation files (the
// "Software"), to deal in the Software without restriction, including
// without limitation the rights to use, copy, modify, merge, publish,
// distribute, sublicense, and/or sell copies of the Software, and to