Skip to content

Instantly share code, notes, and snippets.

@aravindpai
Created May 22, 2020 16:13
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Embed
What would you like to do?
#merges the most frequent pair in the corpus
#accepts the corpus and best pair
#returns the modified corpus
import re
def merge_vocab(pair, corpus_in):
corpus_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in corpus_in:
w_out = p.sub(''.join(pair), word)
corpus_out[w_out] = corpus_in[word]
return corpus_out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment