Skip to content

Instantly share code, notes, and snippets.

View wget-spider.md

Extract links from a BBC responsive site

DOMAIN="m.bbc.co.uk"
SERVICE="hindi"
HTTP_USER_AGENT="Mozilla/5.0 (iPhone; Mobile; AppleWebKit; Safari)"
EXCLUDE_EXTENSIONS="\.\(txt\|css\|js\|png\|gif\|jpg\)$"
MAX_DEPTH="3"

wget --spider --no-directories --no-parent --force-html --recursive \
 --level=$MAX_DEPTH --no-clobber \
View download_site.sh
wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains domain.com \
--no-parent \
domain.com
View gist:6a72f50ff24aac30d1fe
from scrapy import log
from scrapy.item import Item
from scrapy.http import Request
from scrapy.contrib.spiders import XMLFeedSpider
def NextURL():
"""
Generate a list of URLs to crawl. You can query a database or come up with some other means
Note that if you generate URLs to crawl from a scraped URL then you're better of using a
View validate.sh
#!/bin/bash
# simple function to check http response code before downloading a remote file
# example usage:
# if `validate_url $url >/dev/null`; then dosomething; else echo "does not exist"; fi
function validate_url(){
if [[ `wget -S --spider $1 2>&1 | grep 'HTTP/1.1 200 OK'` ]]; then echo "true"; fi
}
View spider.sh
#!/bin/bash
HOME="http://www.yourdomain.com/some/page"
DOMAINS="yourdomain.com"
DEPTH=2
OUTPUT="./urls.csv"
wget -r --spider --delete-after --force-html -D "$DOMAINS" -l $DEPTH "$HOME" 2>&1 \
| grep '^--' | awk '{ print $3 }' | grep -v '\. \(css\|js\|png\|gif\|jpg\)$' | sort | uniq > $OUTPUT
View README.md

Best UNIX Shell tools

These are a list of usages of shell commands I can't live without on UNIX-based systems.

Install

Mac OS X

Using Homebrew (yes, I am opinionated) you can install the following tools with the following packages:

View mega-upload.py
#!/usr/bin/env python
#
#require: https://github.com/richardasaurus/mega.py
#
import os
import sys
from mega import Mega
mega = Mega({'verbose': True})
m = mega.login('megauseremail', 'megapass')
View rtmp_meta.rb
require 'forwardable'
module RtmpMeta
class Parser
PATTERN = /duration\s+(?<duration>\d+\.?\d+)$/
attr_reader :raw_data
def initialize raw_data
@raw_data = raw_data
end
View rename.sh
#!/bin/bash
# Rename the ouput html file from redditPostArchiver with the reddit thread title.
# https://github.com/sJohnsonStoever/redditPostArchiver
for f in *.html;
do
title=$( awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' "$f" )
mv -i "$f" "${title//[^a-zA-Z0-9\._\- ]}_$f"
View copyPBar.sh
alias cp='rsync -p --progress'
# copy with progressbar