Skip to content

Instantly share code, notes, and snippets.


Muhammad Osama neoblizz

View GitHub Profile
neoblizz /
Last active Mar 26, 2022
Simple Speed-of-Light Analysis of SpMM and GEMM

Hardware Speed-of-Light Analysis

The following numbers are based on NVIDIA's Volta microarchitecture. To perform a similar analysis for a newer architecture, I recommend changing the numbers below based on device_query CUDA sample or wikipedia page.

CUDA Cores = SM * Cores per SM (SM = 80, Cores/SM = 64)
Maximum Clock Rate = Clock Rate (KHz) * 1e-6 (GHz)
Maximum Throughput (type == floats, doubles or half) =
    CUDA Cores * Maximum Clock Rate * Type Ratio (device properties) (GFLOP/s)

Maximum Memory Bandwidth = 
neoblizz / sssp.cpp
Last active Mar 15, 2022
Parallel SSSP using C++20.
View sssp.cpp
#include <vector>
#include <algorithm>
#include <execution>
#include <mutex>
#include <utility>
#include <ranges>
struct frontier_t {
// Underlying representation of frontier.
std::vector<int> active_vertices;
neoblizz /
Last active Jan 17, 2022
Envisioning `__ignore__` support in NVCC with a simple example.

How crazy is it to imagine a keyword (NVCC-supported), something like __ignore__, where if you use that in front of an expression (function, variable, object, etc.), it is ignored on the device side (in __device__ and __global__). This solves the issue where complicated containers that support host and device code, and their constructors/destructors that run on host code are all just ignored on device when they are passed as a member of larger class or struct. For example;

__global__ void kernel(foo_t foo) {
  auto idx = threadIdx.x;
  auto ptr = foo.get_ptr();
  ptr[idx] = idx;
neoblizz /
Last active Dec 7, 2021
Capturing conditional inheritance in CPP (with pixel shaders as a toy example)

The Problem

We have a top-level object that the user wants to interact with, such as a pixel on the screen. But given the contents within that pixel, it may choose to color/shade it differently. If that pixel is representing a cloth, it may have a texture and color of a cloth, if it is representing metal, it may be shiny and metal-like... you get the point. To represent this object in c++, we have number of options. The most obvious one is to have a function that colors (or applies some sort of texture) to the pixel, and has the different specializations for the materials/colors within that function.

Obvious approach

void apply_texture(pixel_t* p, texture_t t) {
  if(t == texture_t::cloth) {
    // apply cloth
 } else if (t == texture_t::skin) {
neoblizz /
Last active May 4, 2021
CUDA-based implementation to introduce sparsity.
#include <stdio.h>
#include <stdlib.h>
#include <ctime>
#include <random>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>

Effective Modern CMake

Getting Started

For a brief user-level introduction to CMake, watch C++ Weekly, Episode 78, Intro to CMake by Jason Turner. LLVM’s CMake Primer provides a good high-level introduction to the CMake syntax. Go read it now.

After that, watch Mathieu Ropert’s CppCon 2017 talk Using Modern CMake Patterns to Enforce a Good Modular Design (slides). It provides a thorough explanation of what modern CMake is and why it is so much better than “old school” CMake. The modular design ideas in this talk are based on the book [Large-Scale C++ Software Design](

neoblizz /
Last active Mar 27, 2020
Symphony Examples

2D Merge-Path Search (by Duane Merrill)

  • Input: Diagonal index, lengths of lists A and B, iterators (pointers) to lists A and B
  • Output: The 2D coordinate (x,y) of the intersection of the merge decision path with the specified grid diagonal
CoordinateT MergePathSearch(int diagonal, int a_len, int b_len, AIteratorT a,
                            BIteratorT b) {
  // Diagonal search range (in x coordinate space)
  int x_min = max(diagonal - b_len, 0);
 int x_max = min(diagonal, a_len);
neoblizz /
Created Mar 26, 2020
CUPTI Events for NVIDIA Volta V100 (32 GB)

CUPTI Events for Volta V100 (32 GB)

Following is the list of CUPTI events that you can profile for a Volta V100 (32 GB) NVIDIA graphics card in a DGX Station.




Stevey's Google Platforms Rant

I was at Amazon for about six and a half years, and now I've been at Google for that long. One thing that struck me immediately about the two companies -- an impression that has been reinforced almost daily -- is that Amazon does everything wrong, and Google does everything right. Sure, it's a sweeping generalization, but a surprisingly accurate one. It's pretty crazy. There are probably a hundred or even two hundred different ways you can compare the two companies, and Google is superior in all but three of them, if I recall correctly. I actually did a spreadsheet at one point but Legal wouldn't let me show it to anyone, even though recruiting loved it.

I mean, just to give you a very brief taste: Amazon's recruiting process is fundamentally flawed by having teams hire for themselves, so their hiring bar is incredibly inconsistent across teams, despite various efforts they've made to level it out. And their operations are a mess; they don't real

neoblizz / test_helpers.hxx
Created Apr 23, 2019
Simplified SSSP-Gunrock
View test_helpers.hxx
// ----------------------------------------------------------------
// Gunrock -- Fast and Efficient GPU Graph Library
// ----------------------------------------------------------------
// This source code is distributed under the terms of LICENSE.TXT
// in the root directory of this source distribution.
// ----------------------------------------------------------------
* @file
* test_helpers.hxx