Skip to content

Instantly share code, notes, and snippets.

@leonardreidy
leonardreidy / extract-contacts
Created June 25, 2013 23:14
Python function which takes an input file containing html (in .html or .txt format), and the name of an output file, and uses the BeautifulSoup library to extract the name of the institution stored in a <h2> tag, and the contents of a set of <td> tags that contain profile information stored in a Higher Education Directory.
from bs4 import BeautifulSoup
def preproc(infile, outfile):
#open input file for reading
file = open(infile, 'r')
#create BeautifulSoup object with the file contents
soup = BeautifulSoup(file)
@leonardreidy
leonardreidy / postproc
Created June 26, 2013 19:24
A python script that iterates through all of the files in a given directory and runs the previous preproc() function on them.
import os
#get a list of the files in the current directory
a = os.listdir(os.getcwd())
def postproc(a):
#for every file in the directory
for i in a:
#call the preproc function on said file and generate the appropriate outfile
preproc(i, "out"+str(a.index(i))+".txt")
@leonardreidy
leonardreidy / simple-email-extractor
Created July 5, 2013 03:09
A simple python script to iterate through all the (html) files in a directory, extracting emails from each, and writing a comma-separated list to an outfile for each html file.
import os
from bs4 import BeautifulSoup
# get a list of the files in the current directory
here = os.listdir(os.getcwd())
# define preprocessing method to extract email addresses from a given
# html file
def preproc(infile, outfile):
@leonardreidy
leonardreidy / extract-names-and-schools
Created July 5, 2013 05:42
Simple script to strip out the administrator names and school names of contacts in a certain online directory.
# A simple program to extract the administrator name, and school name from
# the html files of an online directory then output a file each for
# the lists of names and schools using the json.dumps() approach to generate
# simple json output
def extractor(infile, outfile1, outfile2):
file = open(infile, 'r')
soup = BeautifulSoup(file)
commonsoup = soup('strong')
names = []
schools = []
@leonardreidy
leonardreidy / prep-contacts-for-ponymailer
Created July 5, 2013 14:07
Parse html file with Beautiful Soup, find emails and names and output as json, ready for ponymailer.rb. Emails are found (with href=mailto) and names (inside <strong> tags). The program creates a single list that contains both names, and emails, and then output it as json, ready for ponymailer to send.
# A simple python script to extract names, and emails from
# a certain online directory
import os, json
from bs4 import BeautifulSoup
#get a list of the files in the current directory
inputfiles = os.listdir(os.getcwd())
def postproc(inputfiles):
@leonardreidy
leonardreidy / ponymailer
Created July 5, 2013 15:30
A simple ruby script that uses pony.rb to mass mail a list of contacts specified in json format.
#
# Note that this program does not have any error-handling code, so if it fails,
# you will just get some error messages at the command line. It won't skip the offending
# email and move on with the task; it will fail entirely.
#
# Keep an eye on the command prompt periodically, if you don't
# know how to write error-handling code. If it does fail, as it occassionally will
# due to a server rejecting the email, or something like that, then look in your
# Gmail 'sent' folder to identify the last email sent. Then, compare to the list
# of contacts. It will almost certainly have failed because of the next email in the list.
@leonardreidy
leonardreidy / dirlister
Created September 16, 2013 18:22
Write a list of the contents of the current directory to a csv file in Python.
import glob
# glob does regular expression pattern matching, expansion etc, and, it returns
# a list of strings - nice and easy to work with
# List files starting with anything and ending with anything:
# glob.glob("*.*")
#
# to list only text files, for example, try:
# glob.glob("*.txt")
@leonardreidy
leonardreidy / Twizzer-v0.0.1.py
Created November 11, 2013 05:14
A simple script to stream data from Twitter. Twizzer streams data with a given set of filters, then strips out the text fields and appends a datetime to them which it streams to stdout and to a file simultaneously. (Script is based on the work of Youtube user Sentdex, with modifications suggested by Youtube user Satish Chandra and some of my own…
#! /usr/bin/env python
# Twizzer-0-0-1.py: Simple script for pulling streaming data from Twitter using
# the credentials of a given user. You will need a developer account for this
# to work, because of the way Twitter API 1.1 handles authentication etc. This script is
# based on a very similar script, by YouTube user SentDex, but with some modifcations
# suggested by YouTube user Satish Chandra, and some of my own to resolve stdout encoding
# issues, and user interaction.
#
# ROADMAP
@leonardreidy
leonardreidy / gistbox-clipper
Created November 11, 2013 16:15
Testing new GistBox Clipper
function throttle( fn, time ) {
var t = 0;
return function() {
var args = arguments, ctx = this;
clearTimeout(t);
t = setTimeout( function() {
fn.apply( ctx, args );
}, time );
};
@leonardreidy
leonardreidy / jam-frags.py
Created November 20, 2013 22:23
Python BS4 fragments for extracting content from a recent web-based JAM.
from bs4 import BeautifulSoup
# open the infile for reading
file = open(infile, 'r')
# convert the contents of the infile to a Beautiful Soup object
soup = BeautifulSoup(file)
# create lists, a list containing bs4.element.Tag items generated by using
# the .select() syntax - the texts and their author names are contained in