Skip to content

Instantly share code, notes, and snippets.


Muhammad Osama neoblizz

View GitHub Profile
neoblizz /
Created December 8, 2022 01:12
HIPIFY CUDA code (entire directory)
find . -type f \( -iname \*.cu -o -iname \*.cuh -o -iname \*.cpp -o -iname \*.hpp -o -iname \*.hxx -o -iname *.cxx \) -exec hipify-perl -inplace -experimental -print-stats {} \;
neoblizz /
Created November 21, 2022 18:41
Install Latest Doxygen (UNIX)

I got Doxygen version 1.9.6 (6bc5f864d0c847a74944d6e9e4a42346e8c18b28) to build using the following instructions on Ubuntu 18.04.

sudo apt-get install graphviz bison flex
git clone
cd doxygen
mkdir build
cd build
cmake -G "Unix Makefiles" ..
# Default installation at /usr/local/bin/doxygen
neoblizz / launch.cuh
Last active October 23, 2022 03:21
C++ wrapper around cooperative groups launch API.
View launch.cuh
#pragma once
// Includes CUDA
#include <cuda_runtime.h>
#include <cooperative_groups.h>
#include <utility>
namespace cg = cooperative_groups;
neoblizz /
Last active March 26, 2022 02:01
Simple Speed-of-Light Analysis of SpMM and GEMM

Hardware Speed-of-Light Analysis

The following numbers are based on NVIDIA's Volta microarchitecture. To perform a similar analysis for a newer architecture, I recommend changing the numbers below based on device_query CUDA sample or wikipedia page.

CUDA Cores = SM * Cores per SM (SM = 80, Cores/SM = 64)
Maximum Clock Rate = Clock Rate (KHz) * 1e-6 (GHz)
Maximum Throughput (type == floats, doubles or half) =
    CUDA Cores * Maximum Clock Rate * Type Ratio (device properties) (GFLOP/s)

Maximum Memory Bandwidth = 
neoblizz / sssp.cpp
Last active March 15, 2022 21:01
Parallel SSSP using C++20.
View sssp.cpp
#include <vector>
#include <algorithm>
#include <execution>
#include <mutex>
#include <utility>
#include <ranges>
struct frontier_t {
// Underlying representation of frontier.
std::vector<int> active_vertices;
neoblizz /
Last active January 17, 2022 19:45
Envisioning `__ignore__` support in NVCC with a simple example.

How crazy is it to imagine a keyword (NVCC-supported), something like __ignore__, where if you use that in front of an expression (function, variable, object, etc.), it is ignored on the device side (in __device__ and __global__). This solves the issue where complicated containers that support host and device code, and their constructors/destructors that run on host code are all just ignored on device when they are passed as a member of larger class or struct. For example;

__global__ void kernel(foo_t foo) {
  auto idx = threadIdx.x;
  auto ptr = foo.get_ptr();
  ptr[idx] = idx;
neoblizz /
Last active December 7, 2021 19:33
Capturing conditional inheritance in CPP (with pixel shaders as a toy example)

The Problem

We have a top-level object that the user wants to interact with, such as a pixel on the screen. But given the contents within that pixel, it may choose to color/shade it differently. If that pixel is representing a cloth, it may have a texture and color of a cloth, if it is representing metal, it may be shiny and metal-like... you get the point. To represent this object in c++, we have number of options. The most obvious one is to have a function that colors (or applies some sort of texture) to the pixel, and has the different specializations for the materials/colors within that function.

Obvious approach

void apply_texture(pixel_t* p, texture_t t) {
  if(t == texture_t::cloth) {
    // apply cloth
 } else if (t == texture_t::skin) {
neoblizz /
Last active May 4, 2021 11:47
CUDA-based implementation to introduce sparsity.
#include <stdio.h>
#include <stdlib.h>
#include <ctime>
#include <random>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
neoblizz /
Created February 9, 2021 12:15 — forked from mbinna/
Effective Modern CMake

Effective Modern CMake

Getting Started

For a brief user-level introduction to CMake, watch C++ Weekly, Episode 78, Intro to CMake by Jason Turner. LLVM’s CMake Primer provides a good high-level introduction to the CMake syntax. Go read it now.

After that, watch Mathieu Ropert’s CppCon 2017 talk Using Modern CMake Patterns to Enforce a Good Modular Design (slides). It provides a thorough explanation of what modern CMake is and why it is so much better than “old school” CMake. The modular design ideas in this talk are based on the book [Large-Scale C++ Software Design](

neoblizz /
Last active March 27, 2020 21:15
Symphony Examples

2D Merge-Path Search (by Duane Merrill)

  • Input: Diagonal index, lengths of lists A and B, iterators (pointers) to lists A and B
  • Output: The 2D coordinate (x,y) of the intersection of the merge decision path with the specified grid diagonal
CoordinateT MergePathSearch(int diagonal, int a_len, int b_len, AIteratorT a,
                            BIteratorT b) {
  // Diagonal search range (in x coordinate space)
  int x_min = max(diagonal - b_len, 0);
 int x_max = min(diagonal, a_len);