Created
June 11, 2014 20:03
-
-
Save pedrokoblitz/cb005a1ee0cb244cf25b to your computer and use it in GitHub Desktop.
bash crawler2mysql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# a basic crawler in bash | |
# https://github.com/jashmenn/bashpider | |
# usage: crawl.sh urlfile.txt <numprocs> | |
URLS_FILE=$1 | |
BANDWIDTH=2300 | |
CRAWLERS=$2 | |
mkdir -p data/pages | |
# add this in below if you want to limit the rate of an individual crawler, | |
# though I would suggest you oversubscribe otherwise some crawlers will be | |
# starved while waiting for slow neighbors. | |
# | |
# RATE_LIMIT=$(($BANDWIDTH/$CRAWLERS)) | |
# --limit-rate=${RATE_LIMIT}k \ | |
WGET_CMD="wget \ | |
--tries=5 \ | |
--dns-timeout=30 \ | |
--connect-timeout=5 \ | |
--read-timeout=5 \ | |
--timestamping \ | |
--directory-prefix=data/pages \ | |
--wait=2 \ | |
--random-wait \ | |
--recursive \ | |
--level=5 \ | |
--no-parent \ | |
--no-verbose \ | |
--reject *.jpg --reject *.gif \ | |
--reject *.png --reject *.css \ | |
--reject *.pdf --reject *.bz2 \ | |
--reject *.gz --reject *.zip \ | |
--reject *.mov --reject *.fla \ | |
--reject *.xml \ | |
--no-check-certificate" | |
cat $URLS_FILE | xargs -P $CRAWLERS -I _URL_ $WGET_CMD _URL_ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# escaping in awk | |
awk 'BEGIN {FS=" ";} {printf "'\''%s'\'' ", $1}' | |
# escaping double quotes in awk | |
echo '"Landkauf" Bund' | awk '{gsub("\"", "\\\"")}1' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bash crawler | |
http://eigenjoy.com/2010/09/06/a-crawler-using-wget-and-xargs/ | |
defensive bash programming | |
http://www.kfirlavi.com/blog/2012/11/14/defensive-bash-programming/ | |
writing robust shell scripts | |
http://www.davidpashley.com/articles/writing-robust-shell-scripts/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment