Skip to content

Instantly share code, notes, and snippets.

@aravindpai
Created May 22, 2020 16:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aravindpai/315a5ccc9b9d3a8c815d36f26daad6b0 to your computer and use it in GitHub Desktop.
Save aravindpai/315a5ccc9b9d3a8c815d36f26daad6b0 to your computer and use it in GitHub Desktop.
#merges the most frequent pair in the corpus
#accepts the corpus and best pair
#returns the modified corpus
import re
def merge_vocab(pair, corpus_in):
corpus_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in corpus_in:
w_out = p.sub(''.join(pair), word)
corpus_out[w_out] = corpus_in[word]
return corpus_out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment