Last active
March 1, 2023 23:42
-
-
Save arrowtype/1cbddcfe2fac1b0b6c8b547e7f561986 to your computer and use it in GitHub Desktop.
A simple Python script to count and rank the frequency of words in a text file, e.g. for verifying that you are kerning important pairs for specific content
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Simple Python script to count word frequency in a given text document. | |
Started from | |
https://www.geeksforgeeks.org/python-count-occurrences-of-each-word-in-given-text-file/ | |
Usage: Update the file path below, then run in the command line. | |
""" | |
# Relative path to a .txt file | |
fileToRank = "proofing/moby-dick.txt" | |
# Relative path to a .txt file to create, with frequency counts | |
rankingFile = fileToRank.replace(".txt", "--words_counted.txt") | |
# Open the file in read mode | |
text = open(fileToRank, "r") | |
# Create an empty dictionary | |
d = dict() | |
# Loop through each line of the file | |
for line in text: | |
# Remove the leading spaces and newline character | |
line = line.strip() | |
# Convert the characters in line to | |
# lowercase to avoid case mismatch | |
line = line.lower() | |
# Split the line into words | |
words = line.split(" ") | |
# Iterate over each word in line | |
for word in words: | |
if word == "": | |
continue | |
# Check if the word is already in dictionary | |
if word in d: | |
# Increment count of word by 1 | |
d[word] = d[word] + 1 | |
else: | |
# Add the word to dictionary with count 1 | |
d[word] = 1 | |
# sort the dictionary by the frequency count values, in descender order | |
sortedDict = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)} | |
# make a new text file, then write results to that | |
with open(rankingFile, 'w') as f: | |
for key in list(sortedDict.keys()): | |
f.write(f"{key}: {sortedDict[key]}\n") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
the: 14322 | |
of: 6560 | |
and: 6255 | |
a: 4607 | |
to: 4521 | |
in: 4066 | |
that: 2734 | |
his: 2485 | |
it: 1765 | |
i: 1723 | |
as: 1705 | |
with: 1703 | |
he: 1681 | |
but: 1667 | |
is: 1579 | |
was: 1576 | |
for: 1529 | |
all: 1342 | |
at: 1297 | |
this: 1232 | |
by: 1151 | |
from: 1086 | |
not: 1065 | |
be: 986 | |
on: 951 | |
so: 877 | |
one: 781 | |
you: 770 | |
had: 763 | |
have: 755 | |
# etc... (Moby Dick has about 31,436 unique words) | |
# from https://www.gutenberg.org/files/2701/2701-0.txt, counted with Project Gutenberg text removed |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment