Skip to content

Instantly share code, notes, and snippets.

View stas00's full-sized avatar

Stas Bekman stas00

View GitHub Profile
@stas00
stas00 / static_kv_cache.py
Created March 2, 2024 02:56 — forked from ArthurZucker/static_kv_cache.py
simple static kv cache script
from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache
import torch
from typing import Optional
device = "cuda"
# Copied from the gpt-fast repo
def multinomial_sample_one_no_sync(probs_sort): # Does multinomial sampling without a cuda synchronization
q = torch.empty_like(probs_sort).exponential_(1)
return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
@stas00
stas00 / Mellanox OFED cheat sheet
Created March 1, 2024 02:40 — forked from githubfoam/Mellanox OFED cheat sheet
Mellanox OFED cheat sheet
--------------------------------------------------------------------------
# ofed_info -s
--------------------------------------------------------------------------
Find Mellanox Adapter Type and Firmware/Driver version
ConnectX-4 card
# lspci | grep Mellanox
0a:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
# lspci -vv -s 0a:00.0 | grep "Part number" -A 3
# lspci | grep Mellanox | awk '{print $1}' | xargs -i -r mstvpd {}
@stas00
stas00 / mm_bmm-perf.py
Created February 16, 2024 00:27 — forked from malfet/mm_bmm-perf.py
Measure performance difference of `torch.mm` vs `torch.bmm`
# Benchmark relative performance of torch.mm and torch.bmm with single batch
import torch
import time
def benchmark_fn(fn, args, warmup=5, cycles=300, use_kineto=False) -> float:
if use_kineto:
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as p:
fn(*args)
return sum([e.cuda_time for e in p.key_averages()])
@stas00
stas00 / mfu_compute.py
Created January 5, 2024 23:28 — forked from Chillee/mfu_compute.py
Compute Flop Utilization in PyTorch
import torch
from torch.utils.flop_counter import FlopCounterMode
from triton.testing import do_bench
def get_flops_achieved(f):
flop_counter = FlopCounterMode(display=False)
with flop_counter:
f()
total_flops = flop_counter.get_total_flops()
ms_per_iter = do_bench(f)
@stas00
stas00 / vram.ipynb
Created December 18, 2023 03:12
memory allocations breakdown
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@stas00
stas00 / calc_transformer_flops.py
Created November 22, 2023 01:16 — forked from Quentin-Anthony/calc_transformer_flops.py
Transformer FLOPs with Dense/MoE
import argparse
import math
# Helper function to pretty-print message sizes
def convert_flops(params):
if params == 0:
return "0"
size_name = ("", "KFLOPs", "MFLOPs", "GFLOPs", "TFLOPs", "PFLOPs", "EFLOPs", "ZFLOPs", "YFLOPs")
i = int(math.floor(math.log(params, 1000)))
p = math.pow(1000, i)
@stas00
stas00 / calc_transformer_params.py
Created November 22, 2023 01:15 — forked from Quentin-Anthony/calc_transformer_params.py
Transformer Parameter Count
import argparse
import math
# Helper function to pretty-print message sizes
def convert_params(params):
if params == 0:
return "0"
size_name = ("", "K", "M", "B", "T", "P", "E", "Z", "Y")
i = int(math.floor(math.log(params, 1000)))
p = math.pow(1000, i)
@stas00
stas00 / benchmark_dist_init.py
Last active November 20, 2023 20:33
Profiling `init_process_group('nccl')`
# run as:
# python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d benchmark_dist_init.py
import torch
import os
import cProfile
import torch.distributed as dist
import timeit

Connect via SSH to a Slurm compute job that runs as Enroot container

Being able to SSH directly into a compute job has the advantage of using all remote development tools such as using your IDE's debugger also for GPU jobs (VSCode, PyCharm, ...).

  • Slurm: Scheduling system that many HPC clusters use
  • Enroot: Container system like Docker for NVIDIA GPUs

General problem:

@stas00
stas00 / sft_trainer.py
Created October 13, 2023 17:53 — forked from lewtun/sft_trainer.py
Fine-tuning Mistral 7B with TRL & DeepSpeed ZeRO-3
# This is a modified version of TRL's `SFTTrainer` example (https://github.com/huggingface/trl/blob/main/examples/scripts/sft_trainer.py),
# adapted to run with DeepSpeed ZeRO-3 and Mistral-7B-V1.0. The settings below were run on 1 node of 8 x A100 (80GB) GPUs.
#
# Usage:
# - Install the latest transformers & accelerate versions: `pip install -U transformers accelerate`
# - Install deepspeed: `pip install deepspeed==0.9.5`
# - Install TRL from main: pip install git+https://github.com/huggingface/trl.git
# - Clone the repo: git clone github.com/huggingface/trl.git
# - Copy this Gist into trl/examples/scripts
# - Run from root of trl repo with: accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml --gradient_accumulation_steps 8 examples/scripts/sft_trainer.py