Skip to content

Instantly share code, notes, and snippets.

@atombrella
Created January 18, 2021 08:15
Show Gist options
  • Save atombrella/932ae643df72a96f08a3e26dfecaafb7 to your computer and use it in GitHub Desktop.
Save atombrella/932ae643df72a96f08a3e26dfecaafb7 to your computer and use it in GitHub Desktop.
Scrape website with wget
# useful for extracting a website to text
wget --mirror --random-wait -R gif,png,jpg,webp,svg,css,js <website>/
for file in `find . -type f`; do A=`basename $file`;html2text "$file" > ../txt/$A.txt; done
cat ../txt/*| tr '[:punct:]' ' ' | tr 'A-Z' 'a-z' | tr ' ' '\n' | sort|egrep -e '^[a-zA-z]{4,15}$'|uniq -c|sort -n
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment