Skip to content

Instantly share code, notes, and snippets.

@vadimkantorov
Last active November 5, 2019 20:45
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vadimkantorov/c57d1345c3b21f09bcdcaa940e63e532 to your computer and use it in GitHub Desktop.
Save vadimkantorov/c57d1345c3b21f09bcdcaa940e63e532 to your computer and use it in GitHub Desktop.
Compare two ARPA language models with KenLM
# Usage: python3 find_domain_words --ours chats.arpa --theirs ru_wiyalen_no_punkt.arpa.binary > domain_words.txt
import argparse
import kenlm
parser = argparse.ArgumentParser()
parser.add_argument('--ours', required = True)
parser.add_argument('--theirs', required = True)
args = parser.parse_args()
ours = kenlm.LanguageModel(args.ours)
theirs = kenlm.LanguageModel(args.theirs)
vocab = []
for l in open(args.ours):
if l.startswith('-'):
vocab.append(l.split()[1])
if '2-grams' in l:
break
scores = [(w, log_prob_ours, log_prob_theirs, log_prob_ours - log_prob_theirs) for w in vocab for log_prob_ours, log_prob_theirs in [(ours.score(w), theirs.score(w))]]
for w, log_prob_ours, log_prob_theirs, log_prob_ratio in sorted(scores, key = lambda s: s[-1], reverse = True):
print(w, log_prob_ratio)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment