Skip to content

Instantly share code, notes, and snippets.

wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains \
--no-parent \

Extract links from a BBC responsive site

HTTP_USER_AGENT="Mozilla/5.0 (iPhone; Mobile; AppleWebKit; Safari)"

wget --spider --no-directories --no-parent --force-html --recursive \
 --level=$MAX_DEPTH --no-clobber \
View wget_spider_https
To spider a site as a logged-in user:
1. post the form data (_every_ input with a name in the form, even if it doesn't have a value) required to log in (--post-data).
2. save the cookies that get generated (--save-cookies), including session cookies (--keep-session-cookies), which are not saved when --save-cookies alone is specified.
2. load the cookies, continue saving the session cookies, and recursively (-r) spider (--spider) the site, ignoring (-R) /logout.
# log in and save the cookies
wget --post-data='username=my_username&password=my_password&next=' --save-cookies=cookies.txt --keep-session-cookies
View gist:5c8592d48108e18d3de0
wget --spider -o wget.log -e robots=off --wait 3 -r -p -S http://
grep -ri 'http://' wget.log | grep -E -v '(files/|\.jpg|\.jpeg|\.gif|\.css|\.js|\.pdf|\.png|\.xls)' | awk '{print $3}'|sort|uniq|sort > site_map.txt
cat $1 |grep -i -E -v '(\.jpg|\.jpeg|\.gif|\.css|\.js|\.pdf|\.png|\.xls|\.ico|\.txt|\.doc|yandexbot|googlebot|YandexDirect|\/upload\/|" 404 |" 301 |" 302 )'|perl -MURI::Escape -lne 'print uri_unescape($_)'|grep yandsearch|awk '{print $1}'|sort|uniq|wc -l
wget -r --spider --delete-after --force-html -D "$DOMAINS" -l $DEPTH "$HOME" 2>&1 \
| grep '^--' | awk '{ print $3 }' | grep -v '\. \(css\|js\|png\|gif\|jpg\)$' | sort | uniq > $OUTPUT
# simple function to check http response code before downloading a remote file
# example usage:
# if `validate_url $url >/dev/null`; then dosomething; else echo "does not exist"; fi
function validate_url(){
if [[ `wget -S --spider $1 2>&1 | grep 'HTTP/1.1 200 OK'` ]]; then echo "true"; fi
View gist:6a72f50ff24aac30d1fe
from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider
def NextURL():
Generate a list of URLs to crawl. You can query a database or come up with some other means
Note that if you generate URLs to crawl from a scraped URL then you're better of using a

Best UNIX Shell tools

These are a list of usages of shell commands I can't live without on UNIX-based systems.


Mac OS X

Using Homebrew (yes, I am opinionated) you can install the following tools with the following packages:

#!/usr/bin/env python
import os
import sys
from mega import Mega
mega = Mega({'verbose': True})
m = mega.login('megauseremail', 'megapass')
jadedgnome / install-megatools.txt
Last active Sep 14, 2017
Install megatools on Debian 7 64bit & required dependencies
View install-megatools.txt
sudo apt-get update && sudo apt-get install libglib2.0-dev libtool autoconf glib-networking fuse curl wget gettext gobject-introspection libcurl4-openssl-dev -y
sudo apt-get install lib32gmp-dev lib32gmp10 lib32gmpxx4 libgmp-dev libgmp10 libgmp3-dev -y
wget && tar xvf nettle-3.0.tar.gz && cd nettle-3.0/ && ./configure && make && sudo make install && cd ../
wget && tar xvf megatools-1.9.93.tar.gz && cd megatools-1.9.93/ && ./configure && make && sudo make install