Skip to content

Instantly share code, notes, and snippets.

@mckelvin
Created January 3, 2013 07:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mckelvin/4441481 to your computer and use it in GitHub Desktop.
Save mckelvin/4441481 to your computer and use it in GitHub Desktop.
#!/bin/bash
#开始时间
START=\`date +%s%N\`;
#保存目录
RES\_DIR=search\_one_intro;
mkdir -p $RES_DIR
#访问一次初始化cookie
COOKIE_FILE=./current.cookie;
curl -c $COOKIE\_FILE -s "http://**/UTADB/teacher/search\_one\_intro.jsp?teacher\_id=00398"
#teacher.list 是教师的id列表,每行一个id
for eachline in \`cat teacher.list\`
do
{
#如果已经爬过(存在且文件尺寸大于0)
if [ -s $RES_DIR/$eachline.html ];then
continue
fi
wget -O $RES\_DIR/$eachline.html --load-cookies=$COOKIE\_FILE "http://\***|/UTADB/teacher/search\_one\_intro.jsp?teacher_id=$eachline"
wget -O $RES\_DIR/$eachline.jpg --load-cookies=$COOKIE\_FILE "http://\***|/UTADB/teacher/search\_intro\_pic_show.jsp"
#如果爬失败了,等待
if [ ! -s $RES_DIR/$eachline.html ];then
sleep 1
fi
}
done
#结束时间
END=\`date +%s%N\`;
echo "it takes \`expr \( $END - $START \) / 1000000\` ms."
#正则匹配所有爬到的Email地址并保存
grep -E -o --color "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" \*/\*.html > emails_gen.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment