Skip to content

Instantly share code, notes, and snippets.

View RamonYeung's full-sized avatar
🚀
Working on a rocket ticket !

杨海宏 RamonYeung

🚀
Working on a rocket ticket !
  • MIT, The Alibaba DAMO Academy
  • Hangzhou, China
View GitHub Profile
@RamonYeung
RamonYeung / BPE
Created June 27, 2019 08:33 — forked from ranihorev/BPE
Byte Pair Encoding example (Source: Sennrich et al. - https://arxiv.org/abs/1508.07909)
import re, collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
return pairs
''' Script for downloading all GLUE data.
Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).
mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC