Skip to content

Instantly share code, notes, and snippets.

@leonardreidy
Created November 20, 2013 22:23
Show Gist options
  • Save leonardreidy/7572218 to your computer and use it in GitHub Desktop.
Save leonardreidy/7572218 to your computer and use it in GitHub Desktop.
Python BS4 fragments for extracting content from a recent web-based JAM.
from bs4 import BeautifulSoup
# open the infile for reading
file = open(infile, 'r')
# convert the contents of the infile to a Beautiful Soup object
soup = BeautifulSoup(file)
# create lists, a list containing bs4.element.Tag items generated by using
# the .select() syntax - the texts and their author names are contained in
# li elements, that nest divs containing the stuff of interest
lists = soup.select('li.message-container-li')
# extract the message text for the first Tag in the lists list
lists[0].div.find('div', {"class":"message-text cf"}).get_text().encode("utf-8")
# extract the message author name
lists[0].a.get_text().encode("utf-8")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment