Skip to content

Instantly share code, notes, and snippets.

@leonardreidy
Created June 25, 2013 23:14
Show Gist options
  • Save leonardreidy/5863337 to your computer and use it in GitHub Desktop.
Save leonardreidy/5863337 to your computer and use it in GitHub Desktop.
Python function which takes an input file containing html (in .html or .txt format), and the name of an output file, and uses the BeautifulSoup library to extract the name of the institution stored in a <h2> tag, and the contents of a set of <td> tags that contain profile information stored in a Higher Education Directory.
from bs4 import BeautifulSoup
def preproc(infile, outfile):
#open input file for reading
file = open(infile, 'r')
#create BeautifulSoup object with the file contents
soup = BeautifulSoup(file)
#use 'with' syntax to temporarily open the outfile
#this way, the interpreter takes care of closing/flushing
#the file afterwards
with open(outfile, 'w') as file:
#find the h2 with the school title and write it
file.write(soup('h2')[0].string.encode('utf-8')+",")
#iterate through the soup of <tr> tags
for i in soup('tr'):
#drill down to the contents of each i
for j in i:
#to avoid throwing errors with NoneTypes
#write to file only if the item of interest is not an empty tag
if j.string != None:
file.write(j.string.encode('utf-8')+",")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment