Skip to content

Instantly share code, notes, and snippets.

@syadegari
Created September 24, 2025 16:16
Show Gist options
  • Select an option

  • Save syadegari/ada8311c44c91357645d82c7f9dfbe71 to your computer and use it in GitHub Desktop.

Select an option

Save syadegari/ada8311c44c91357645d82c7f9dfbe71 to your computer and use it in GitHub Desktop.
root@computeinstance-e00ccn3b72xn1s8y6y:/workspace# mkdir -p /workspace/syad && cd /workspace/syad
git clone --depth=1 --branch main --single-branch https://github.com/HazyResearch/ThunderKittens ThunderKittens 2>/dev/null || true
# --- versions ---
. /etc/os-release 2>/dev/null; UBU="${NAME:-Linux} ${VERSION_ID:-$(uname -r)}"
CUDA=$(nvcc --version 2>/dev/null | sed -n 's/.*release \([0-9.]\+\).*/\1/p' | head -1 || echo "")
TORCH=$(python -c "import torch; print(torch.__version__)" 2>/dev/null || echo "not found")
TRT=$(python -c "import tensorrt as trt; print(trt.__version__)" 2>/dev/null || echo "not found")
GPU=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | paste -sd ' / ' - || echo "not found")
HASH=$(git -C ThunderKittens rev-parse --short=12 HEAD 2>/dev/null || echo "unknown")
echo
echo "Ubuntu: $UBU"
echo "CUDA: ${CUDA:-not found}"
echo "PyTorch: $TORCH"
echo "TensorRT: $TRT"
echo "GPU: $GPU"
echo "TK HEAD (main): $HASH"
echo
set -x
export THUNDERKITTENS_ROOT=/workspace/syad/ThunderKittens
cd /workspace/syad/ThunderKittens/kernels/matmul/educational/
set +x
make clean && make && ./matmul || echo "educational failed"
set -x
cd ../H100/
set +x
make clean && make && ./matmul || echo "H100 failed"
Ubuntu: Ubuntu 24.04
CUDA: 12.8
PyTorch: 2.7.0a0+7c8ec84dab.nv25.03
TensorRT: 10.9.0.34
GPU: NVIDIA H100 80GB HBM3
TK HEAD (main): 2ba96ceedfb1
+ export THUNDERKITTENS_ROOT=/workspace/syad/ThunderKittens
+ THUNDERKITTENS_ROOT=/workspace/syad/ThunderKittens
+ cd /workspace/syad/ThunderKittens/kernels/matmul/educational/
+ set +x
rm -f matmul
nvcc level_08.cu -DNDEBUG -Xcompiler=-fPIE -Xcompiler -fopenmp --expt-extended-lambda --expt-relaxed-constexpr -Xcompiler=-Wno-psabi -Xcompiler=-fno-strict-aliasing --use_fast_math -forward-unknown-to-host-compiler -O3 -Xnvlink=--verbose -Xptxas=--verbose -Xptxas=--warn-on-spills -std=c++20 -MD -MT -MF -x cu -lrt -lpthread -ldl -DKITTENS_HOPPER -arch=sm_90a -lcuda -lcudadevrt -lcudart_static -lcublas -lgomp -I/workspace/syad/ThunderKittens/include -I/workspace/syad/ThunderKittens/prototype -o matmul
ptxas info : (C7508) Potential Performance Loss: 'setmaxnreg' ignored; unable to determine register count at entry.
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_Z6kernel14matmul_globals' for 'sm_90a'
ptxas info : Function properties for _Z6kernel14matmul_globals
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 154 registers, used 16 barriers, 128 bytes smem
ptxas info : Compile time = 73.157 ms
nvlink info : 0 bytes gmem
-------------------- M=4096 N=4096 K=4096 --------------------
Allocated host memory
Initialized matrices
Performed CPU matrix multiplication
Allocated device memory
Copied matrices to device
Avg Kernel execution time: 268.414 us
Achieved performance: 512.041 TFLOPs
Copied result back to host
Converted result back to float
Max error: 0.0982647
Error count: 0
Total count: 16777216
+ cd ../H100/
+ set +x
rm -f matmul
nvcc matmul.cu -DNDEBUG -Xcompiler=-fPIE -Xcompiler -fopenmp --expt-extended-lambda --expt-relaxed-constexpr -Xcompiler=-Wno-psabi -Xcompiler=-fno-strict-aliasing --use_fast_math -forward-unknown-to-host-compiler -O3 -Xnvlink=--verbose -Xptxas=--verbose -Xptxas=--warn-on-spills -std=c++20 -MD -MT -MF -x cu -lrt -lpthread -ldl -DKITTENS_HOPPER -arch=sm_90a -lcuda -lcudadevrt -lcudart_static -lcublas -lgomp -I/workspace/syad/ThunderKittens/include -I/workspace/syad/ThunderKittens/prototype -o matmul
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_ZN7kittens9prototype3lcf6kernelI15matmul_templateILi2ELi4ELi8EEEEvNT_6layout7globalsE' for 'sm_90a'
ptxas info : Function properties for _ZN7kittens9prototype3lcf6kernelI15matmul_templateILi2ELi4ELi8EEEEvNT_6layout7globalsE
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 168 registers, used 16 barriers, 80 bytes smem
ptxas info : Compile time = 244.284 ms
nvlink info : 0 bytes gmem
-------------------- M=4096 N=4096 K=4096 --------------------
Block size: 128x256
Allocated host memory
Initialized matrices
Performed CPU matrix multiplication
Allocated device memory
Copied matrices to device
Launching warmup kernel with grid (132, 1), block (384)
Launching kernel with grid (132, 1), block (384)
terminate called after throwing an instance of 'std::runtime_error'
what(): Error in tile TMA descriptor creation: unspecified launch failure
Parameters:
batch: 1
depth: 1
rows: 4096
cols: 4096
ST::rows: 64
ST::cols: 64
cuTensorMapEncodeTiled arguments:
tma_map: 140730429461440
tma_format: 9
tma_dim: 5
global_addr: 139840108625920
global_addr memory type: valid device memory
gmem_shape: 140730429460672 [64, 4096, 64, 1, 1]
gmem_stride: 140730429459776 [8192, 128, 33554432, 33554432]
smem_shape: 140730429459712 [64, 64, 1, 1, 1]
smem_stride: 140730429459744 [1, 1, 1, 1, 1]
tma_interleave: 0
tma_swizzle: 3
tma_l2Promotion: 0
tma_oobFill: 0
Aborted (core dumped)
H100 failed
root@computeinstance-e00ccn3b72xn1s8y6y:/workspace/syad/ThunderKittens/kernels/matmul/H100#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment