Skip to content

Instantly share code, notes, and snippets.

View sshleifer's full-sized avatar
🏠
Working from home

Sam Shleifer sshleifer

🏠
Working from home
View GitHub Profile
@sshleifer
sshleifer / pegasus.png
Last active August 21, 2020 15:10
Pegasus Thumbnail
pegasus.png
@sshleifer
sshleifer / lang_tag_logic.md
Created August 19, 2020 01:00
Language Tagging Process
@sshleifer
sshleifer / marian_constituents.py
Created August 18, 2020 15:37
Marian Multilingual Groups
# three letter code -> (group/language name, {constituents...}
# if this language is on the target side the constituents can be used as target language codes.
# if the language is on the source side they are supported natively without special codes.
{'aav': ('Austro-Asiatic languages',
{'hoc', 'hoc_Latn', 'kha', 'khm', 'khm_Latn', 'mnw', 'vie', 'vie_Hani'}),
'afa': ('Afro-Asiatic languages',
{'acm', 'afb', 'amh', 'apc', 'ara', 'arq', 'ary', 'arz', 'hau_Latn', 'heb', 'kab', 'mlt', 'rif_Latn', 'shy_Latn', 'som', 'thv', 'tir'}),
'afr': ('Afrikaans', {'afr'}),
'alv': ('Atlantic-Congo languages',
{'ewe', 'fuc', 'fuv', 'ibo', 'kin', 'lin', 'lug', 'nya', 'run', 'sag', 'sna', 'swh', 'toi_Latn', 'tso', 'umb', 'wol', 'xho', 'yor', 'zul'}),
@sshleifer
sshleifer / fairseq_model_inputs.md
Last active August 19, 2020 18:10
breakpoint at /home/shleifer/fairseq/fairseq/tasks/fairseq_task.py(385)train_step()

(first, wget fairseq_wmt_enro.tgz from s3)

During training, fairseq passes mbart dynamically sized batches (up to 128 tokens), in a dict called sample with the following relevant keys:

  • target (our labels): no bos, ends with [2, tgt_lang_code]
  • net_input.src_tokens (our input_ids): ends with [2, 250004]
  • net_input.prev_output_tokens (our decoder_input_ids): startswith 250020, ends with 2 . This is the "shift_tokens_right" version of target.

Here are the logs from my breakpoint:

@sshleifer
sshleifer / s3_wmt.sh
Created August 11, 2020 15:31
s3 translation dataset upload workflow
tar -czvf wmt16_en_ru.tgz wmt16_en_ru
# wmt16_en_ru/
# wmt16_en_ru/train.source
# wmt16_en_ru/train.target
# wmt16_en_ru/test.target
# wmt16_en_ru/test.source
# wmt16_en_ru/val.source
# wmt16_en_ru/val.target
@sshleifer
sshleifer / .vimrc
Last active August 10, 2020 05:01
vimrc
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
" Filename: .vimrc "
" Maintainer: Sam Shleifer <sshleifer@gmail.com> "
" URL: http://github.com/sshlefier/dotfiles "
" "
" "
" Sections: "
" 01. Plugins ................. using vundle "
" 02. python .................. General autocmd events "
" 03. Vim options ............ Colors, fonts, etc. "
@sshleifer
sshleifer / generate_cc25.sh
Last active June 21, 2020 19:06
Broken Script to translate with cc25
export langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
export CC25=/Users/shleifer/cc25_pretrain
export outfile=pred_en_ro.txt
export PRETRAIN=$CC25/model.pt
fairseq-generate tmp/ --path $PRETRAIN \
--task translation_from_pretrained_bart -t en_XX -s ro_RO --bpe 'sentencepiece' \
--sentencepiece-vocab $CC25/sentence.bpe.model --sacrebleu --remove-bpe 'sentencepiece' \
--max-sentences 32 --langs $langs --beam 5 > $outfile
# by stas00 and sshleifer
import nlp
from tqdm import tqdm
dataset = 'wmt19'
s = 'ru'
t = 'en'
pair = f'{s}-{t}'

Stanford CoreNLP Setup

ptb_tokenize () {
    cat $1 | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $2
}

sudo apt install openjdk-8-jre-headless
sudo apt-get install ant
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip