Skip to content

Instantly share code, notes, and snippets.

@digitalist
Created March 21, 2021 08:07
Show Gist options
  • Save digitalist/10a3cd43e7afafac0c23c521f9468b9e to your computer and use it in GitHub Desktop.
Save digitalist/10a3cd43e7afafac0c23c521f9468b9e to your computer and use it in GitHub Desktop.
qucik and dirty linux count words from html mirror
# todo: clean utf chars with tr
find -name '*.html' -exec html2text {} \; | tr -s '[[:punct:][:space:]]' '\n' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -bnr > ~/temp/words.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment