Skip to content

Instantly share code, notes, and snippets.

@fernandrone
Last active February 1, 2021 15:12
Show Gist options
  • Save fernandrone/7809a5e919b142508b6a45838cde139b to your computer and use it in GitHub Desktop.
Save fernandrone/7809a5e919b142508b6a45838cde139b to your computer and use it in GitHub Desktop.
Top N most-used words in a text
#!/usr/bin/env sh
#
# Simple script that prints out the top N most-used words in a text from standard input.
#
# Inspired by https://buttondown.email/hillelwayne/archive/donald-knuth-was-framed/. The short
# linux script is first shown at https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/pearls-2.pdf
topw() {
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed "$N"q
}
help() {
cat << EOF
Usage: $(basename $0) [OPTION]...
Print out the top N most-used words in a text from standard input. A word is
a non-zero-length sequence of letters ('A-Za-z' regex) delimited by white
space.
-n value of N (number of top words to list) [default 5]
-h display this help and exit
EOF
}
N=5
while getopts "hn:" option; do
case $option in
h)
help
exit 0;;
n)
N="$2"
shift 2;;
\?)
echo "$(basename $0): invalid option -- '${1}'"
help
exit 1;;
esac
done
topw
@fernandrone
Copy link
Author

fernandrone commented May 30, 2020

To install, copy the latest gist to a location in your PATH, e.g.:

wget -O ~/topw https://gist.githubusercontent.com/fernandrone/7809a5e919b142508b6a45838cde139b/raw/65bc8aee6cc9cd8f41324f732788a58089ab1a70/topw
chmod +x ~/topw
sudo mv ~/topw /usr/local/bin/topw

Then, just pipe the output of any text:

$ man bash | topw
   4200 the
   1822 is
   1251 to
   1221 a
   1147 of

The complete works of Shakespeare (the "dataset" also includes some legal notes by Project Gutenberg though).

$ curl -s https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt | topw -n 20
  27660 the
  26784 and
  22538 i
  19819 to
  18191 of
  14746 a
  13860 you
  12489 my
  11549 that
  11123 in
   9784 is
   8960 d
   8740 not
   8341 for
   8016 with
   7777 me
   7737 it
   7723 s
   7130 be
   6885 your

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment