Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save p1nesap/9212288 to your computer and use it in GitHub Desktop.
Save p1nesap/9212288 to your computer and use it in GitHub Desktop.
Bash Shell Regex Most Common Words from Web Page, Sorted to File
#Process most common words from web page, and output to file, using wget, sed, awk, sort command line utilities.
Using both sed #and awk methods provides greatest accuracy. Example is for YouTube video comments.
#sed stream editor method:
wget -O - \
| sed -e 's/<[^>]*>//g' | tr -cs A-Za-z\' '\n' \ #clean up HTML tags; convert all to lowercase
| tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 | sed ${1:-200}q > conan.txt
#awk method:
wget -O - \
| awk '{ gsub(/<[^>]*>/,"") # remove the content in label <>
$0=tolower($0) # convert all to lowercase
gsub(/[^a-z]]*/," ") # remove all non-letter chars and replaced by space
for (i=1;i<=NF;i++) a[$i]++ # save each word in array a, and sum it.
}END{for (i in a) print a[i],i|"sort -nr|head -200"}' > miley-awk25.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment