Skip to content

Instantly share code, notes, and snippets.

View phillipsm's full-sized avatar

Matthew Phillips phillipsm

View GitHub Profile
...
/**
* A container object to house our incoming HTTP request
*
* @author Matt Phillips <mphillips@law.harvard.edu>
* @license http://www.gnu.org/licenses/lgpl.html GNU Lesser Public License
*/
class http_request {
@phillipsm
phillipsm / gist:8601065
Created January 24, 2014 16:43
wget command
# Construct wget command
command = 'wget '
command = command + '--quiet ' # turn off wget's output
command = command + '--tries=' + str(settings.NUMBER_RETRIES) + ' ' # number of retries (assuming no 404 or the like)
command = command + '--wait=' + str(settings.WAIT_BETWEEN_TRIES) + ' ' # number of seconds between requests (lighten the load on a page that has a lot of assets)
command = command + '--quota=' + settings.ARCHIVE_QUOTA + ' ' # only store this amount
command = command + '--random-wait ' # random wait between .5 seconds and --wait=
command = command + '--limit-rate=' + settings.ARCHIVE_LIMIT_RATE + ' ' # we'll be performing multiple archives at once. let's not download too much in one stream
command = command + '--adjust-extension ' # if a page is served up at .asp, adjust to .html. (this is the new --html-extension flag)
command = command + '--span-hosts ' # sometimes things like images are hosted at a CDN. let's span-hosts to get those
function check_status() {
// Check our status service to see if we have archivng jobs pending
var request = $.ajax({
url: status_url + newLinky.linky_id,
type: "GET",
dataType: "json",
cache: false
});
@phillipsm
phillipsm / gist:0ed98b2585f0ada5a769
Last active November 25, 2022 14:02
Example of parsing a table using BeautifulSoup and requests in Python
import requests
from bs4 import BeautifulSoup
# We've now imported the two packages that will do the heavy lifting
# for us, reqeusts and BeautifulSoup
# Let's put the URL of the page we want to scrape in a variable
# so that our code down below can be a little cleaner
url_to_scrape = 'http://apps2.polkcountyiowa.gov/inmatesontheweb/'
@phillipsm
phillipsm / gist:c832c825c994735b31fe
Last active August 29, 2015 14:21
All material for dgmde15

All material used for dgmde15

still dumping material in here

@phillipsm
phillipsm / gist:404780e419c49a5b62a8
Last active April 22, 2024 11:55
Inmate scraping script
import requests
from bs4 import BeautifulSoup
import time
# We've now imported the two packages that will do the heavy lifting
# for us, reqeusts and BeautifulSoup
# This is the URL that lists the current inmates
# Should this URL go away, and archive is available at
# http://perma.cc/2HZR-N38X
@phillipsm
phillipsm / gist:2bdb5f622cbabe107c5b
Created June 24, 2015 20:14
Import our packages
import requests
from bs4 import BeautifulSoup
@phillipsm
phillipsm / gist:7199f931a2de6787c0b6
Created June 24, 2015 20:16
Build list of inmates
url_to_scrape = 'http://apps2.polkcountyiowa.gov/inmatesontheweb/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text)
inmates_links = []
for table_row in soup.select(".inmatesList tr"):
table_cells = table_row.findAll('td')
inmates = []
for inmate_link in inmates_links[:10]:
r = requests.get(inmate_link)
soup = BeautifulSoup(r.text)
inmate_details = {}
inmate_profile_rows = soup.select("#inmateProfile tr")
inmate_details['age'] = inmate_profile_rows[0].findAll('td')[0].text.strip()
@phillipsm
phillipsm / gist:29d4cb4addb5c5a21ae7
Created June 24, 2015 20:22
Sum and print aggregations
inmate_cities = {}
for inmate in inmates:
if inmate['city'] in inmate_cities:
inmate_cities[inmate['city']] += 1
else:
inmate_cities[inmate['city']] = 1
print inmate_cities