Skip to content

Instantly share code, notes, and snippets.

@pedrokoblitz
Created June 11, 2014 20:03
Show Gist options
  • Save pedrokoblitz/cb005a1ee0cb244cf25b to your computer and use it in GitHub Desktop.
Save pedrokoblitz/cb005a1ee0cb244cf25b to your computer and use it in GitHub Desktop.
bash crawler2mysql
#!/bin/bash
# a basic crawler in bash
# https://github.com/jashmenn/bashpider
# usage: crawl.sh urlfile.txt <numprocs>
URLS_FILE=$1
BANDWIDTH=2300
CRAWLERS=$2
mkdir -p data/pages
# add this in below if you want to limit the rate of an individual crawler,
# though I would suggest you oversubscribe otherwise some crawlers will be
# starved while waiting for slow neighbors.
#
# RATE_LIMIT=$(($BANDWIDTH/$CRAWLERS))
# --limit-rate=${RATE_LIMIT}k \
WGET_CMD="wget \
--tries=5 \
--dns-timeout=30 \
--connect-timeout=5 \
--read-timeout=5 \
--timestamping \
--directory-prefix=data/pages \
--wait=2 \
--random-wait \
--recursive \
--level=5 \
--no-parent \
--no-verbose \
--reject *.jpg --reject *.gif \
--reject *.png --reject *.css \
--reject *.pdf --reject *.bz2 \
--reject *.gz --reject *.zip \
--reject *.mov --reject *.fla \
--reject *.xml \
--no-check-certificate"
cat $URLS_FILE | xargs -P $CRAWLERS -I _URL_ $WGET_CMD _URL_
#!/bin/bash
# escaping in awk
awk 'BEGIN {FS=" ";} {printf "'\''%s'\'' ", $1}'
# escaping double quotes in awk
echo '"Landkauf" Bund' | awk '{gsub("\"", "\\\"")}1'
bash crawler
http://eigenjoy.com/2010/09/06/a-crawler-using-wget-and-xargs/
defensive bash programming
http://www.kfirlavi.com/blog/2012/11/14/defensive-bash-programming/
writing robust shell scripts
http://www.davidpashley.com/articles/writing-robust-shell-scripts/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment