-
Fluid made standalone Gmail, Trello apps for cmd-tab
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Place your key bindings in this file to override the defaults | |
[ | |
// Switching between editor and terminal | |
{ | |
"key": "ctrl+j", | |
"command": "workbench.action.terminal.focus", | |
"when": "editorFocus || !editorIsOpen" | |
}, | |
{ | |
"key": "ctrl+j", |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Place your key bindings in this file to override the defaults | |
[ | |
// Switching between editor and terminal | |
{ | |
"key": "ctrl+j", | |
"command": "workbench.action.terminal.focus", | |
"when": "editorFocus || !editorIsOpen" | |
}, | |
{ | |
"key": "ctrl+j", |
6 groups of models inherit from BartForConditionalGeneration
.
The major differences between them are:
- pretraining objective & data
- finetuning objective & data
- number of layers and dimension of each layer
- when layernorm is applied
This document focuses on layernorm timing.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import torch | |
import torch.nn.functional as F | |
d = 8 | |
seq_len = 13 | |
bs = 1 | |
wt = torch.rand((d, d)) | |
x = torch.rand((seq_len, bs, d)) | |
x_r0, x_r1 = x[:,:, :d//2], x[:,:, d//2:] | |
wt_r0, wt_r1 = wt[:, :d//2], wt[:, d//2:] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
git fetch | |
git checkout paper-v2 | |
export SD=/data/users/sshleifer/fairseq-py/roberta_azure | |
train_roberta_base () { | |
export NCCL_DEBUG="warn" | |
./fb_sweep/bmr.py -g 8 -t 1 -n 8 --dl 12 --embed-dim 768 \ | |
--bs 32 --li 50 --epg 0 --mu 2000000 --ebs 2048 --arch prenorm \ | |
--resume-failed --nw 0 -p bl \ | |
--opt adam --local-checkpoints-dir $SD --checkpoints-dir $SD --use-fused-softmax \ | |
--ddp fully_sharded "$@" |
The way I test things quickly with srun
:
(1) on devfair:
srun --gres=gpu:8 --partition=devaccel --nodes=1 --cpus-per-task 64 \
--ntasks-per-node 1 --mem=400G --constraint volta32gb \
--time="2-00:00:00" --pty /bin/zsh -l
(2) on the resultant shell:
- remove optimizer state and save to
$HOME
for example:
MODEL_DIR=/large_experiments/xlmg/models/moe/52B/xlmg.52b.fp16.bm_none.tps2048.transformer_lm_gpt2_bigger.dl24.demb1024.dffn4096.moe_w0.01.all.share.adam.b2_0.98.eps1e-08.cl0.0.lr0.0003.sqrt_world_size.wu715.dr0.0.atdr0.0.wd0.01.ms2.uf1.mu572204.s1.ngpu128
python scripts/remove_opt_state.py \
$MODEL_DIR/checkpoint_1_105000/checkpoint_1_105000 \
Params 209,190,912. Fraction Embedding: 19%
Params 265,814,016. Fraction Embedding: 15%
Params 354,418,688. Fraction Embedding: 15%
Params 455,081,984. Fraction Embedding: 12%
Params 1,312,817,152. Fraction Embedding: 8%
Params 1,715,470,336. Fraction Embedding: 6%
Params 2,875,195,392. Fraction Embedding: 5%
NewerOlder