Skip to content

Instantly share code, notes, and snippets.

@kylemcdonald
Created December 10, 2014 17:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kylemcdonald/1f6cebaf258c5019bc5d to your computer and use it in GitHub Desktop.
Save kylemcdonald/1f6cebaf258c5019bc5d to your computer and use it in GitHub Desktop.
Word-subsword pairs sorted by combined frequency.
# run this as:
# python yinxyz.py > pairs.txt
# and in another shell:
# sort -rn pairs.txt | head -500 | cut -f2-
# word list is from http://norvig.com/ngrams/count_1w.txt
pairs = [line.strip().split('\t') for line in open('count_1w.txt')]
count = {}
for w, c in pairs:
count[w] = int(c)
words = list(count.keys())
words.sort(key = len)
words4 = [x for x in words if len(x) > 4]
words6 = [x for x in words if len(x) > 6]
for xyz in words6:
mid = xyz[1:-1]
for y in words4:
if len(y) > len(mid):
break
if y in mid: # faster if we http://stackoverflow.com/a/6934237/940196
match = count[y] * count[xyz]
print '\t'.join([str(match), y, xyz])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment