Skip to content

Instantly share code, notes, and snippets.

View sshleifer's full-sized avatar
🏠
Working from home

Sam Shleifer sshleifer

🏠
Working from home
View GitHub Profile
@sshleifer
sshleifer / apps.md
Last active September 1, 2023 15:12
My Favorite apps and workflow stuff (for mac/iOS/python)
// Place your key bindings in this file to override the defaults
[
// Switching between editor and terminal
{
"key": "ctrl+j",
"command": "workbench.action.terminal.focus",
"when": "editorFocus || !editorIsOpen"
},
{
"key": "ctrl+j",
// Place your key bindings in this file to override the defaults
[
// Switching between editor and terminal
{
"key": "ctrl+j",
"command": "workbench.action.terminal.focus",
"when": "editorFocus || !editorIsOpen"
},
{
"key": "ctrl+j",

How BartConfig controls when LayerNorm is applied

6 groups of models inherit from BartForConditionalGeneration. The major differences between them are:

  • pretraining objective & data
  • finetuning objective & data
  • number of layers and dimension of each layer
  • when layernorm is applied

This document focuses on layernorm timing.

import torch
import torch.nn.functional as F
d = 8
seq_len = 13
bs = 1
wt = torch.rand((d, d))
x = torch.rand((seq_len, bs, d))
x_r0, x_r1 = x[:,:, :d//2], x[:,:, d//2:]
wt_r0, wt_r1 = wt[:, :d//2], wt[:, d//2:]
git fetch
git checkout paper-v2
export SD=/data/users/sshleifer/fairseq-py/roberta_azure
train_roberta_base () {
export NCCL_DEBUG="warn"
./fb_sweep/bmr.py -g 8 -t 1 -n 8 --dl 12 --embed-dim 768 \
--bs 32 --li 50 --epg 0 --mu 2000000 --ebs 2048 --arch prenorm \
--resume-failed --nw 0 -p bl \
--opt adam --local-checkpoints-dir $SD --checkpoints-dir $SD --use-fused-softmax \
--ddp fully_sharded "$@"

The way I test things quickly with srun:

(1) on devfair:

srun --gres=gpu:8 --partition=devaccel --nodes=1 --cpus-per-task 64 \
    --ntasks-per-node 1 --mem=400G --constraint volta32gb \
    --time="2-00:00:00" --pty /bin/zsh -l

(2) on the resultant shell:

  1. remove optimizer state and save to $HOME for example:
MODEL_DIR=/large_experiments/xlmg/models/moe/52B/xlmg.52b.fp16.bm_none.tps2048.transformer_lm_gpt2_bigger.dl24.demb1024.dffn4096.moe_w0.01.all.share.adam.b2_0.98.eps1e-08.cl0.0.lr0.0003.sqrt_world_size.wu715.dr0.0.atdr0.0.wd0.01.ms2.uf1.mu572204.s1.ngpu128

python scripts/remove_opt_state.py \
 $MODEL_DIR/checkpoint_1_105000/checkpoint_1_105000 \
@sshleifer
sshleifer / adam8bit_fair_usage.md
Last active July 28, 2021 22:02
How to use adam8bit

Setup

To use it on the fair cluster gshard branch, you need the following dependencies: (from inside fairseq env, assuming cuda 11.0)

pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda110 -U
pip install -U fairscale

WARNING: if you dont do this step your checkpoints will not be usable!

Results

Params 209,190,912. Fraction Embedding: 19%
Params 265,814,016. Fraction Embedding: 15%
Params 354,418,688. Fraction Embedding: 15%
Params 455,081,984. Fraction Embedding: 12%
Params 1,312,817,152. Fraction Embedding: 8%
Params 1,715,470,336. Fraction Embedding: 6%
Params 2,875,195,392. Fraction Embedding: 5%