Skip to content

Instantly share code, notes, and snippets.

View sshleifer's full-sized avatar
🏠
Working from home

Sam Shleifer sshleifer

🏠
Working from home
View GitHub Profile
@sshleifer
sshleifer / anki_setup.md
Created March 6, 2021 19:27
Anki Setup
@sshleifer
sshleifer / time_dbart_generate.py
Created October 26, 2020 17:29
Timing Generate
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import time
from tqdm import tqdm
from pathlib import Path
import pandas as pd
models = ['sshleifer/distilbart-cnn-12-3',
'sshleifer/distilbart-cnn-12-6',
'sshleifer/distilbart-cnn-6-6',
@sshleifer
sshleifer / latex_style.md
Created October 19, 2020 15:19
Sasha's latex style rules

Avoid:

  • [!h] for figures/tables.
  • two datasets in one 1plot
  • NameError introducing terms that haven't been defined.
@sshleifer
sshleifer / download_summ_data.py
Created October 7, 2020 19:19
Fetching summarization datasets
from pathlib import Path
import fire
from tqdm import tqdm
DS_TO_KEY = {
'gigaword': ('document', 'summary'),
'xsum': ('document', 'summary'),
'aeslc': ('email_body', 'subject_line'),
from pathlib import Path
import fire
from tqdm import tqdm
DS_TO_KEY = {
'gigaword': ('document', 'summary'),
'xsum': ('document', 'summary'),
'aeslc': ('email_body', 'subject_line'),

How BartConfig controls when LayerNorm is applied

6 groups of models inherit from BartForConditionalGeneration. The major differences between them are:

  • pretraining objective & data
  • finetuning objective & data
  • number of layers and dimension of each layer
  • when layernorm is applied

This document focuses on layernorm timing.

export b="s3://models.huggingface.co/bert"
stas_to_fb () {
src=$1
shift
aws s3 sync $b/stas/$src $b/facebook/$src $@
}
stas_to_allenai () {
src=$1
shift
@sshleifer
sshleifer / dynb.md
Last active September 9, 2020 19:25

Problem:

  • In WMT datasets, there is wide variation in the length of examples. Some are one sentence. Some are 10 sentences.
  • The max batch size that can fit on a v100 is roughly (4, 512)
  • you end up with lots of batches of shape (4, 12) or (4, small_int) which don't fully utilize the GPU.

Dynamic Batch Size: try to organize batches to be 4*512=2048 tokens, one batch might be shaped (4,512) another (32, 64).

Details of Fairseq Solution:

@sshleifer
sshleifer / finetune_pegasus_xsum.sh
Last active September 8, 2020 21:19
took 25hr
python finetune.py \
--task summarization \
--learning_rate=3e-4 \
--do_train \
--do_predict \
--val_check_interval 0.25 --n_val 1000 \
--data_dir xsum \
--max_source_length 512 --max_target_length=56 \
--freeze_embeds \
--model_name_or_path google/pegasus-large \
@sshleifer
sshleifer / mbart_lb.md
Last active September 2, 2020 15:13
Experiment results distilling mbart-large-en-ro and finetuning mbart-large-cc-25
for file in ls */*bleu.json
do
   echo "$file:"
   cat "$file" | sed -n '/^\s*$/!{p;q}' 
   echo  "------"
done

enro test bleu (distil-mbart unless otherwise specified, before post processing).