Skip to content

Instantly share code, notes, and snippets.

@neilmayhew
Last active June 5, 2021 22:10
Show Gist options
  • Save neilmayhew/31cbda1003f7352feb71 to your computer and use it in GitHub Desktop.
Save neilmayhew/31cbda1003f7352feb71 to your computer and use it in GitHub Desktop.
Python script to print a counted list of all digraphs in a set of files
#!/usr/bin/env python
import fileinput, re
counts = {}
for l in fileinput.input():
l = l.lower()
for m in re.finditer(r'[a-z](?=[a-z])', l):
graph = l[m.start():m.end()+1]
counts[graph] = counts.get(graph, 0) + 1
# Output in frequency order, highest first
for k, v in sorted(counts.items(), key=lambda (k, v): v, reverse=True):
print k, v
@neilmayhew
Copy link
Author

@HughP: OK, just comment out that line (with #) and you should get what you want.

A full implementation would take command-line options for graph length (di-, tri-, etc.) case and sort order (alphabetic vs frequency).

@neilmayhew
Copy link
Author

To manually change it to search for trigraphs instead of digraphs,

for m in re.finditer(r'[a-z](?=[a-z])(?=[a-z])', l):
    graph = l[m.start():m.end()+2]

@HughP
Copy link

HughP commented Mar 26, 2015

@neilmayhew doesn't the [a-z] portion exclude punctuation marks and capitals?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment