Skip to content

Instantly share code, notes, and snippets.

View sshleifer's full-sized avatar
🏠
Working from home

Sam Shleifer sshleifer

🏠
Working from home
View GitHub Profile
// Place your key bindings in this file to override the defaults
[
// Switching between editor and terminal
{
"key": "ctrl+j",
"command": "workbench.action.terminal.focus",
"when": "editorFocus || !editorIsOpen"
},
{
"key": "ctrl+j",
// Place your key bindings in this file to override the defaults
[
// Switching between editor and terminal
{
"key": "ctrl+j",
"command": "workbench.action.terminal.focus",
"when": "editorFocus || !editorIsOpen"
},
{
"key": "ctrl+j",
import torch
import torch.nn.functional as F
d = 8
seq_len = 13
bs = 1
wt = torch.rand((d, d))
x = torch.rand((seq_len, bs, d))
x_r0, x_r1 = x[:,:, :d//2], x[:,:, d//2:]
wt_r0, wt_r1 = wt[:, :d//2], wt[:, d//2:]
git fetch
git checkout paper-v2
export SD=/data/users/sshleifer/fairseq-py/roberta_azure
train_roberta_base () {
export NCCL_DEBUG="warn"
./fb_sweep/bmr.py -g 8 -t 1 -n 8 --dl 12 --embed-dim 768 \
--bs 32 --li 50 --epg 0 --mu 2000000 --ebs 2048 --arch prenorm \
--resume-failed --nw 0 -p bl \
--opt adam --local-checkpoints-dir $SD --checkpoints-dir $SD --use-fused-softmax \
--ddp fully_sharded "$@"
  1. remove optimizer state and save to $HOME for example:
MODEL_DIR=/large_experiments/xlmg/models/moe/52B/xlmg.52b.fp16.bm_none.tps2048.transformer_lm_gpt2_bigger.dl24.demb1024.dffn4096.moe_w0.01.all.share.adam.b2_0.98.eps1e-08.cl0.0.lr0.0003.sqrt_world_size.wu715.dr0.0.atdr0.0.wd0.01.ms2.uf1.mu572204.s1.ngpu128

python scripts/remove_opt_state.py \
 $MODEL_DIR/checkpoint_1_105000/checkpoint_1_105000 \

The way I test things quickly with srun:

(1) on devfair:

srun --gres=gpu:8 --partition=devaccel --nodes=1 --cpus-per-task 64 \
    --ntasks-per-node 1 --mem=400G --constraint volta32gb \
    --time="2-00:00:00" --pty /bin/zsh -l

(2) on the resultant shell:

@sshleifer
sshleifer / adam8bit_fair_usage.md
Last active July 28, 2021 22:02
How to use adam8bit

Setup

To use it on the fair cluster gshard branch, you need the following dependencies: (from inside fairseq env, assuming cuda 11.0)

pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda110 -U
pip install -U fairscale

WARNING: if you dont do this step your checkpoints will not be usable!

Results

Params 209,190,912. Fraction Embedding: 19%
Params 265,814,016. Fraction Embedding: 15%
Params 354,418,688. Fraction Embedding: 15%
Params 455,081,984. Fraction Embedding: 12%
Params 1,312,817,152. Fraction Embedding: 8%
Params 1,715,470,336. Fraction Embedding: 6%
Params 2,875,195,392. Fraction Embedding: 5%
@sshleifer
sshleifer / optim_cmds.md
Last active July 22, 2021 23:39
gshard optimizer expeiment cmds

Setup

  • git clone git@github.com:fairinternal/fairseq-py.git && cd fairseq-py && git checkout stable-emb
  • if you don't have the fairseq conda env, follow these instructions
  • pip install numpy==1.20. (optional, but some people needed this)
  • pip install fairscale (should be > 0.3.7, as of writing)
  • on FAIR cluster: pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda110 -U)
  • OR on AWS: pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda111 -U)

Common Logic for all commands

Edit this as needed

@sshleifer
sshleifer / sharded_data_doc.md
Last active April 15, 2021 09:11
Construct+Use sharded dataset in fairseq

Constructing a sharded dataset

  • cat all your raw text into one huge file in /scratch/
  • run your favorite bpe on that file (20mins for 160GB with 20 workers), writing the result to /scratch.

Then we do some filtering of newlines

grep -A1 . /scratch/rc_train_big.bpe | grep -v "^--$" > /scratch/rc.filtered.train.bpe