Skip to content

Instantly share code, notes, and snippets.

@benlancaster
Last active August 29, 2015 14:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save benlancaster/f563a0a1cfa64c86476a to your computer and use it in GitHub Desktop.
Save benlancaster/f563a0a1cfa64c86476a to your computer and use it in GitHub Desktop.
Scrape and sanitise allaboutcircuits.com
#!/usr/bin/env bash
wget \
--reject=ogv,mp4,pdf \
--exclude-domains forum.allaboutcircuits.com \
--domains=sub.allaboutcircuits.com,allaboutcircuits.com \
--recursive \
--span-hosts \
--level=0 \
--convert-links \
--page-requisites \
--execute=robots=off \
--no-verbose \
--no-use-server-timestamps \
--exclude-directories=videos,worksheets \
--no-remove-listing \
http://www.allaboutcircuits.com/
find . -type f -name '*.html' -print0 | while IFS= read -r -d '' file;
do
gsed -i 's/<The/The/g' $file
gsed -i 's/<DELTA>/\&#916;/g' $file
gsed -i 's/<SIGMA>/\&#931;/g' $file
gsed -i 's/<PI>/\&#928;/g' $file
gsed -i 's/<sp>//g' $file
gsed -i 's/<\/sp>//g' $file
gsed -i 's/<plusminus)>/&#177;/g' $file
gsed -i 's/<superscript>/<sup>/g' $file
gsed -i 's/<\/superscript>/<sup>/g' $file
gsed -i 's/<italic>/<i>/g' $file
gsed -i 's/<\/italic>/<\/i>/g' $file
gsed -i 's/<Onega>/\&#937;/g' $file
gsed -i 's/<phi-2>/\&#966;/g' $file
gsed -i 's/<hypertarget>diodeparameter<\/hypertarget>//g' $file
gsed -i 's/<pageref>03442.png<\/pageref>//g' $file
gsed -i 's/<points>/\&lt;points\&gt;/g' $file
output=`xidel ${file} --e "//article[@class='articlemain']" --input-format=html --output-format=xml`
echo $output > $file
tidy -q \
--show-warnings false \
--drop-proprietary-attributes true \
--numeric-entities true \
--add-xml-decl false \
--hide-comments true \
--doctype omit \
-asxml \
-modify \
-indent \
$file
output=`xml ed -N x=http://www.w3.org/1999/xhtml -d "//x:div[@align='google-ads']|//x:script|//x:div[@id='google-ads']|//x:ul[contains(@class,'breadcrumb')]" $file`
echo $output > $file
done
for i in {1..6};
do
/Applications/calibre.app/Contents/MacOS/ebook-convert "www.allaboutcircuits.com/vol_${i}/index.html" "vol${i}.azw3"
done;
@benlancaster
Copy link
Author

Tested on Mac OS X 10.9.

The gsed lines are necessary to clean-up some rogue markup in the source files.

Notes:

  • gsed is homebrew-installed GNU-sed, *nix users should just use sed
  • tidy is the W3C HTML5 fork of the Tidy library (installed with brew install --HEAD tidy)
  • xml is XmlStarlet (brew install xmlstarlet)
  • xidel and wget also from homebrew (brew install xidel and brew install wget respectively)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment