Skip to content

Instantly share code, notes, and snippets.

View msaroufim's full-sized avatar
🤖
Putting the finishing touches on my robot army

Mark Saroufim msaroufim

🤖
Putting the finishing touches on my robot army
View GitHub Profile
import ast
from pathlib import Path
from typing import Set, Dict
from collections import defaultdict
def analyze_imports(file_path: str) -> Dict[str, Set[str]]:
"""
Analyze Python file imports and return a dictionary of package dependencies.
Args:
@msaroufim
msaroufim / 🍿.md
Created November 7, 2024 18:34
Project Popcorn: Generate SOTA kernels with LLMs in public

TL;DR: We're building an LLM that can codegenerate efficient CUDA kernels in public. Today models like ChatGPT are terrible at systems programming because they don't seem to understand how GPUs work and frequently hallucinate. However projects like llm.c with a smart human in the loop with an LLM have shown us that it should be possible to make this happen. There's a lot we need to innovate on both in terms of how we create more kernel tokens, what are the right abstractions LLMs should use, how to scale test-time compute and considering how hard this we want to do everything in public in Discord. We will share infra, loss curves, chat messages all on Discord and try to include as many people as possible so we can actually crack this problem

Logistics

We're distributed research effort so we mostly chat async on discord.gg/gpumode on the popcorn channel

If you prefer longer form content you can check out https://drive.google.com/drive/folders/1nt2KcRRKb8YdySxkRxUu5PR4c7UPM_rK

Top level goals for the n

run_vit_b.py run_vit_b_quant.py
(ao) [marksaroufim@devvm4567.ash0 ~/ao/tutorials/quantize_vit (main)]$ python run_vit_b_quant.py
Downloading: "https://download.pytorch.org/models/vit_b_16-c867db91.pth" to /home/marksaroufim/.cache/torch/hub/checkpoints/vit_b_16-c867db91.pth
100%|█████████████████████████████████████████████████████████████████████████████████| 330M/330M [00:01<00:00, 209MB/s]
AUTOTUNE convolution(1x3x224x224, 768x3x16x16)
triton_convolution_4 0.1184 ms 100.0%
convolution 0.1450 ms 81.7%
triton_convolution_3 0.2024 ms 58.5%
triton_convolution_5 0.2268 ms 52.2%
triton_convolution_6 0.2445 ms 48.4%
*Nim Sum Dim Sum*, a bustling local dumpling restaurant, has two game theory-loving servers named, you guessed it, Alice and Bob. Its dining area can be represented as a two-dimensional grid of \(R\) rows (numbered \(1..R\) from top to bottom) by \(C\) columns (numbered \(1..C\) from left to right\).
Currently, both of them are standing at coordinates \((1, 1)\) where there is a big cart of dim sum. Their job is to work together to push the cart to a customer at coordinates \((R, C)\). To make the job more interesting, they've turned it into a game.
Alice and Bob will take turns pushing the cart. On Alice's turn, the cart must be moved between \(1\) and \(A\) units down. On Bob's turn, the cart must be moved between \(1\) and \(B\) units to the right. The cart may not be moved out of the grid. If the cart is already at row \(R\) on Alice's turn or column \(C\) on Bob's turn, then that person loses their turn.
The "winner" is the person to ultimately move the cart to \((R, C)\) and thus get all the recognit
import torch
# >>> import sys
# >>> size_of_bool = sys.getsizeof(True) # or sys.getsizeof(False)
# >>> print(size_of_bool)
# 28
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import os
import glob
from datetime import datetime
from setuptools import find_packages, setup
INFO:/home/marksaroufim/.local/lib/python3.10/site-packages/torchao/prototype/galore/kernels/custom_autotune.py:
Autotune Best Config: BLOCK_M: 32, BLOCK_N: 128, BLOCK_K: 32, SPLIT_K: 1, num_warps: 4, num_ctas: 1, num_stages: 3
INFO:/home/marksaroufim/.local/lib/python3.10/site-packages/torchao/prototype/galore/kernels/custom_autotune.py:
Autotune Best Config: BLOCK_M: 16, BLOCK_N: 32, BLOCK_K: 32, SPLIT_K: 1, num_warps: 2, num_ctas: 1, num_stages: 5
INFO:/home/marksaroufim/.local/lib/python3.10/site-packages/torchao/prototype/galore/kernels/custom_autotune.py:
Autotune Best Config: BLOCK_M: 32, BLOCK_N: 128, BLOCK_K: 32, SPLIT_K: 1, num_warps: 4, num_ctas: 1, num_stages: 3
~ nvcc -O3 --use_fast_math attention_forward.cu -o attention_forward -lcublas
⚡ ~ ./attention_forward 1
Using kernel 1
-0.529510 -0.529510
0.889394 0.889394
0.881674 0.881674
0.651789 0.651789
-0.483486 -0.483486
Results match!
block_size 32 | time 7618.906250 ms
import time
from typing import Callable, List
import torch
torch.set_printoptions(threshold=10000)
# Llama-7B
SIZES = [torch.Size([32000, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([11008, 4096]), torch.Size([4096, 11008]), torch.Size([11008, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([11008, 4096]), torch.Size([4096, 11008]), torch.Size([11008, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([11008, 4096]), torch.Size([4096, 11008]), torch.Size([11008, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([11008, 4096]), torch.Size([4096, 11008]), torch.Size([11008, 4096]), torch.Size([40
import time
from typing import Callable, List
import torch
torch.set_printoptions(threshold=10000)
# Llama-7B
SIZES = [torch.Size([32000, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([11008, 4096]), torch.Size([4096, 11008]), torch.Size([11008, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([11008, 4096]), torch.Size([4096, 11008]), torch.Size([11008, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([11008, 4096]), torch.Size([4096, 11008]), torch.Size([11008, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([4096, 4096]), torch.Size([11008, 4096]), torch.Size([4096, 11008]), torch.Size([11008, 4096]), torch.Size([40