Skip to content

Instantly share code, notes, and snippets.

@arrowtype
Last active March 1, 2023 23:42
Show Gist options
  • Save arrowtype/1cbddcfe2fac1b0b6c8b547e7f561986 to your computer and use it in GitHub Desktop.
Save arrowtype/1cbddcfe2fac1b0b6c8b547e7f561986 to your computer and use it in GitHub Desktop.
A simple Python script to count and rank the frequency of words in a text file, e.g. for verifying that you are kerning important pairs for specific content
"""
Simple Python script to count word frequency in a given text document.
Started from
https://www.geeksforgeeks.org/python-count-occurrences-of-each-word-in-given-text-file/
Usage: Update the file path below, then run in the command line.
"""
# Relative path to a .txt file
fileToRank = "proofing/moby-dick.txt"
# Relative path to a .txt file to create, with frequency counts
rankingFile = fileToRank.replace(".txt", "--words_counted.txt")
# Open the file in read mode
text = open(fileToRank, "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
if word == "":
continue
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# sort the dictionary by the frequency count values, in descender order
sortedDict = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
# make a new text file, then write results to that
with open(rankingFile, 'w') as f:
for key in list(sortedDict.keys()):
f.write(f"{key}: {sortedDict[key]}\n")
the: 14322
of: 6560
and: 6255
a: 4607
to: 4521
in: 4066
that: 2734
his: 2485
it: 1765
i: 1723
as: 1705
with: 1703
he: 1681
but: 1667
is: 1579
was: 1576
for: 1529
all: 1342
at: 1297
this: 1232
by: 1151
from: 1086
not: 1065
be: 986
on: 951
so: 877
one: 781
you: 770
had: 763
have: 755
# etc... (Moby Dick has about 31,436 unique words)
# from https://www.gutenberg.org/files/2701/2701-0.txt, counted with Project Gutenberg text removed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment