Skip to content

Instantly share code, notes, and snippets.

View Birch-san's full-sized avatar

Birch-san

View GitHub Profile
@Birch-san
Birch-san / llama-convert.md
Created June 1, 2023 18:24
Converting LLaMA model weights to huggingface format + safetensors

Loading LLaMA via Huggingface + Safetensors, with 4-bit quantization

Let's say we're trying to load a LLaMA model via AutoModelForCausalLM.from_pretrained with 4-bit quantization in order to inference from it:

python -m generate.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, LlamaTokenizerFast, LlamaForCausalLM
import transformers
@Birch-san
Birch-san / fine-tuning.md
Last active December 27, 2023 17:24
Fine-tuning LLaMA-7B on ~12GB VRAM with QLoRA, 4-bit quantization

Fine-tuning LLaMA-7B on ~12GB VRAM with QLoRA, 4-bit quantization

nvidia-smi said this required 11181MiB, at least to train on the sequence lengths of prompt that occurred initially in the alpaca dataset (~337 token long prompts).
You can get this down to about 10.9GB if (by modifying qlora.py) you run torch.cuda.empty_cache() after PEFT has been applied to your loaded model and before you begin training.

Setup

All instructions are written assuming your command-line shell is bash.

Clone repository:

@Birch-san
Birch-san / gist:daf94f0dd0fc4b87ad530db6f77b6a55
Created June 1, 2023 16:10
Falco-40B-Instruct parameter names
# for i, x in model.named_parameters():
# print(i)
transformer.word_embeddings.weight
transformer.h.0.ln_attn.weight
transformer.h.0.ln_attn.bias
transformer.h.0.ln_mlp.weight
transformer.h.0.ln_mlp.bias
transformer.h.0.self_attention.query_key_value.weight
transformer.h.0.self_attention.dense.weight
transformer.h.0.mlp.dense_h_to_4h.weight
@Birch-san
Birch-san / opencv-cuda.md
Last active May 4, 2024 23:58
Building OpenCV with CUDA acceleration

For CUDA 12, see Installing CUDA 12.1.1 + PyTorch nightly + Python 3.10 on Ubuntu 22.10 for how to install Nvidia driver 530, gcc 12 and CUDA 12.1.1 libraries.
If you want CUDA 11.8, then you can use latest Nvidia driver from Production branch, 525, with gcc 11.

Activate your conda environment, if you haven't done so already.

CUDA 11:
Make sure gcc 11 is the default gcc for your OS, or select gcc 11 explicitly.
CUDA 12:
Make sure gcc 12 is the default gcc for your OS, or select gcc 12 explicitly.
Check CUDA_DIR below points to the CUDA installation you wish to use.

@Birch-san
Birch-san / magma-readme.md
Created April 27, 2023 21:58
Build magma from source
@Birch-san
Birch-san / CUDA-12-1-1-pytorch.md
Last active April 28, 2024 10:22
Installing CUDA 12.1.1 + PyTorch nightly + Python 3.10 on Ubuntu 22.10

Installing CUDA 12.1.1 + PyTorch nightly + Python 3.10 on Ubuntu 22.10

Should you keep your NVIDIA driver?

CUDA 12.1.1 toolkit is gonna offer to install Nvidia driver 530 for us. It's from New Feature branch. It's likely to be newer than the default Nvidia driver you would've installed via apt-get (apt would prefer to give you 525, i.e. Production Branch).

If you're confident that you already have a new enough Nvidia driver for CUDA 12.1.1, and you'd like to keep your driver: feel free to skip this "uninstall driver" step.

But if you're not sure, or you know your driver is too old: let's uninstall it. CUDA will install a new driver for us later.

@Birch-san
Birch-san / attn_scores_buffer.py
Created April 9, 2023 10:50
Compute size of buffer required to fit q_proj @ k_proj.T attention scores
float_width=2 # float16
cond_count=2 # uncond and cond for 1 sample
attn_heads=8 # SD1.5 isn't optimized for flash attn, so all layers have 8 heads, lol
vae_scale_factor=8
px_height=px_width=768
latent_height=px_height/vae_scale_factor
latent_width=px_width/vae_scale_factor
q_proj_tokens=k_proj_tokens=latent_height*latent_width
qk_bytes = cond_count*attn_heads*float_width*q_proj_tokens*k_proj_tokens
qk_mb = qk_bytes/1024**2
from torch import FloatTensor, load, baddbmm, zeros
from dataclasses import dataclass
import torch
from os.path import join
@dataclass
class Fixtures:
q_proj: FloatTensor
k_proj: FloatTensor
@Birch-san
Birch-san / topk_softmax_denominator.py
Created April 3, 2023 22:45
Reducing the softmax denominator to sum only as many attention scores as the in-distibution checkpoint would've, so that its outputs have in-distribution magnitudes
from torch import FloatTensor
vae_scale_factor = 8
typical_self_attn_key_length = (512/vae_scale_factor) * (512/vae_scale_factor)
desired_self_attn_key_length = (768/vae_scale_factor) * (768/vae_scale_factor)
key_length_factor=desired_self_attn_key_length/typical_self_attn_key_length if is_self_attn else 1.
def softmax(x: FloatTensor, dim=-1) -> FloatTensor:
maxes = x.max(dim, keepdim=True).values
@Birch-san
Birch-san / scaled_softmax.py
Created April 3, 2023 00:16
Questionable softmax
from torch import FloatTensor
vae_scale_factor = 8
typical_self_attn_key_length = (512/vae_scale_factor) * (512/vae_scale_factor)
desired_self_attn_key_length = (200/vae_scale_factor) * (200/vae_scale_factor)
key_length_factor=desired_self_attn_key_length/typical_self_attn_key_length if is_self_attn else 1.
def softmax(x: FloatTensor, dim=-1) -> FloatTensor:
key_tokens = x.size(-1)