Solonarv/README

## README
qntm.org/ra scraper
===================

Simple python script to scrape qntm.org/ra, dowloading every chapter. Requires curl or equivalent.

The `filtertoc` file does most of the heavy lifting. Usage:

    $ curl http://qntm.org/ra | ./filtertoc | sh

This will download every chapter into a separate file, named '$chapter_number $chapter_slug.html' (the slug is the last part of the qntm.org URL)

You can put the command into a separate file, called e.g. `scrape`, and then download everything by just running `./scrape`. This isn't smart and *will* redownload files you already have. It doesn't generate any more traffic than downloading manually, so don't worry about that.

License is DWTHYW (Do What The Hell You Want License). This took me 20mins, so I don't care what people do with it.

## filtertoc
#!/usr/bin/python

# Script to convert http://qntm.org/ra into a series of curl commands to download the individual chapters.
# Uses iterators in order to avoid holding the entire web page in memory

from itertools import takewhile, dropwhile, imap, count, islice, ifilter
from re import sub, compile

extractslug=compile(r"href='/([A-Za-z0-9_\-]+)'")

def stdin():
  try:
    while True:
      yield raw_input()
  except: pass

data = stdin()

# Snip out everything except actual ToC
data=islice(dropwhile(lambda s: "Today in Ra..." not in s, data), 2, None)
data=takewhile(lambda s: "</ul>" not in s, data)

data=ifilter(lambda s: "href" in s, data)
data=imap(lambda s: extractslug.search(s).group(1), data)
data=imap(lambda s, n: 'curl "http://qntm.org/%s" > "%02d %s.html"' % (s, n, s), data, count(1))

for line in data:
  print line
	qntm.org/ra scraper
	===================

	Simple python script to scrape qntm.org/ra, dowloading every chapter. Requires curl or equivalent.

	The `filtertoc` file does most of the heavy lifting. Usage:

	$ curl http://qntm.org/ra \| ./filtertoc \| sh

	This will download every chapter into a separate file, named '$chapter_number $chapter_slug.html' (the slug is the last part of the qntm.org URL)

	You can put the command into a separate file, called e.g. `scrape`, and then download everything by just running `./scrape`. This isn't smart and will redownload files you already have. It doesn't generate any more traffic than downloading manually, so don't worry about that.

	License is DWTHYW (Do What The Hell You Want License). This took me 20mins, so I don't care what people do with it.
	#!/usr/bin/python

	# Script to convert http://qntm.org/ra into a series of curl commands to download the individual chapters.
	# Uses iterators in order to avoid holding the entire web page in memory

	from itertools import takewhile, dropwhile, imap, count, islice, ifilter
	from re import sub, compile

	extractslug=compile(r"href='/([A-Za-z0-9_\-]+)'")

	def stdin():
	try:
	while True:
	yield raw_input()
	except: pass

	data = stdin()

	# Snip out everything except actual ToC
	data=islice(dropwhile(lambda s: "Today in Ra..." not in s, data), 2, None)
	data=takewhile(lambda s: "</ul>" not in s, data)

	data=ifilter(lambda s: "href" in s, data)
	data=imap(lambda s: extractslug.search(s).group(1), data)
	data=imap(lambda s, n: 'curl "http://qntm.org/%s" > "%02d %s.html"' % (s, n, s), data, count(1))

	for line in data:
	print line