Skip to content

Instantly share code, notes, and snippets.

View csullivan's full-sized avatar

Chris Sullivan csullivan

  • NVIDIA
  • Portland
View GitHub Profile
@csullivan
csullivan / test_group_gemm.py
Created May 21, 2025 23:50
Roughly analogous performance to the fp8xfp8 the first FC layer from the triton-lang/triton/python/triton_kernels _p_matmul_ogs.py Mixture of Experts kernel when the routing is exactly uniform (even; no variance) to all the experts
import pytest
from typing import Optional
import torch
import triton
import triton.language as tl
DEVICE = "cuda"

Creamy Basil Cashew Pesto Pasta

Ingredients

  • Pasta: 3/4 of a 14.5 oz box (approx. 300–310 g), cooked in very salty water (“like the ocean”)
  • Basil: 1 package (12 g) lightly dried chopped basil or a large handful of fresh leaves or 3–4 tbsp dried basil
  • Cashews: 1/3 to 1/2 cup unsalted
  • Parmesan: 1/2 cup grated
  • Garlic: 1 large clove (or 2 medium), peeled
@csullivan
csullivan / cereal_polymorphic_serialization.cc
Last active March 13, 2025 22:44
polymorphic data serialization example with cereal
#include <iostream>
#include <vector>
#include <memory>
#include <cereal/archives/binary.hpp>
#include <cereal/types/vector.hpp>
#include <cereal/types/string.hpp>
#include <cereal/types/base_class.hpp>
#include <cereal/types/memory.hpp>
#include <cereal/access.hpp>
@csullivan
csullivan / 2024_09_26_nsys_single_instance_wgmma_register_and_shared_layout.txt
Created September 26, 2024 16:55
Performance comparison: 5% gain using wgmma with LHS in registers vs shared. [1] https://github.com/csullivan/wgmma-intrin
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
---- 2495089 101 24703.9 24736.0 24544 27520 302.9 wgmma_f16_m64n256k16_kernel_shared_layout(__half *, __half *, __half *)
---- 2361204 101 23378.3 23423.0 23231 25600 245.6 wgmma_f16_m64n256k16_register_layout_kernel(__half *, __half *, __half *)
@csullivan
csullivan / 2024_09_26_nsys_single_instance_wgmma_register_and_shared_layout.txt
Created September 26, 2024 16:54
Performance comparison: 5% gain using wgmma with LHS in registers vs shared.
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
---- 2495089 101 24703.9 24736.0 24544 27520 302.9 wgmma_f16_m64n256k16_kernel_shared_layout(__half *, __half *, __half *)
---- 2361204 101 23378.3 23423.0 23231 25600 245.6 wgmma_f16_m64n256k16_register_layout_kernel(__half *, __half *, __half *)
@csullivan
csullivan / passthrough_notes.md
Last active May 14, 2024 12:12
QEMU/KVM GPU passthrough on 18.04 LTS notes

For blacklisting an Nvidia GTX1070 gpu on Ubuntu 18.04 with Intel integrated graphics for the host

Blacklist the nvidia driver:

sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf" 
sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"

/etc/default/grub:

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.11 python3.11-distutils python3.11-venv libpython3.11-dev
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11
@csullivan
csullivan / note.md
Last active September 7, 2023 06:50
CUTLASS CMake configuration for Hopper (sm90a)
@csullivan
csullivan / sharded_decode.py
Last active September 6, 2023 23:30
Sharded decode, sharding rewrite done after FuseOpsByPattern (cublas/cutlass byoc) -- With debug tracing calls
# Ignore `tvm.save_and_copy_tensor` packed functions inserted for debugging
@R.function
def decode(input_ids1: R.Tensor((1, 1), dtype="int32"), all_seq_len: R.Shape(["n"]), kv_cache: R.Tuple(R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Obj
@csullivan
csullivan / test_torch_distributed.py
Created July 17, 2023 05:44
Minimal NCCL torch.distributed example
import os
import torch
import torch.distributed as dist
def read_file_and_all_reduce():
# Get the rank and world size from environment variables
rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])
# initialize the process group