Create a gist now

Instantly share code, notes, and snippets.

A Python script to get the winners of the Food and Drink categories in the Chicago Readers' Best of 2011 list. It just prints to the command line, but should be relatively easy to write to a CSV using Python's DictWriter class in the csv module.
from bs4 import BeautifulSoup
from urllib2 import urlopen
from time import sleep # be nice
BASE_URL = "http://www.chicagoreader.com"
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
def get_category_links(section_url):
soup = make_soup(section_url)
boccat = soup.find("dl", "boccat")
category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
return category_links
def get_category_winner(category_url):
soup = make_soup(category_url)
category = soup.find("h1", "headline").string
winner = [h2.string for h2 in soup.findAll("h2", "boc1")]
runners_up = [h2.string for h2 in soup.findAll("h2", "boc2")]
return {"category": category,
"category_url": category_url,
"winner": winner,
"runners_up": runners_up}
if __name__ == '__main__':
food_n_drink = ("http://www.chicagoreader.com/chicago/"
"best-of-chicago-2011-food-drink/BestOf?oid=4106228")
categories = get_category_links(food_n_drink)
data = [] # a list to store our dictionaries
for category in categories:
winner = get_category_winner(category)
data.append(winner)
sleep(1) # be nice
print data
@tkeville
tkeville commented Nov 6, 2013

I am super new to this programming lark so apologies if I am completely missing something obvious I was trying to run the above just to see I could get to run and I get the following,
Traceback (most recent call last):
File "/home/tom/scraping.py", line 34, in
winner = get_category_winner(category)
File "/home/tom/scraping.py", line 19, in get_category_winner
category = soup.find("h1", "headline").string
AttributeError: 'NoneType' object has no attribute 'string'

I don't know what that is about?

@nbau21
nbau21 commented Jun 25, 2014

Thanks, this helped more than anything I've read.

@cihanmehmet

Line 31,12,9, incorrect
Can you fix it

@jamesfacts

Thanks very much for your instructions! Spent a little time playing around with this script today and learned a lot.

@Goggler
Goggler commented Feb 8, 2015

I had to sign in to thank you for your example code and the explanation to go with it. I was able to write my first python application pretty quick based on your example.
Probably the best example I've read that helped me the most.

@mattcummins

thanks, really helpful approach.

@aznyellojersey

@tkeville, I believe the error is due to having two concurrent variables named "category", one in the get_category_winner function and again in the for loop.

EDIT: @gjreda is it possible to commit to this gist?

@KeynesYouDigIt

Yeah I am getting the same issue as @tkeville ,so I tried putting in jibberish arguments for the tags and got the same result. turns out its returning an empty object with type=none ... super weird.

@aznyellojersey that is also a flaw, but I dont think thats the only flaw in this code

Any one got help for some noobs?

@KeynesYouDigIt

ah lol either the site has changed or the dl elements are gone/renamed, at least at the url i am working with....

@tatumbla

Yes, the site has changed. the "h1", "headline" structure is now a div id="storyline" class="boc1". You need to read the BeautifulSoup docs to determine what the right search string needs to be. However, the process stays the same.

@NicholasBravobi

It works...and return "u" in front of the text, anyone know why and how we can remove it? Thanks for assisting.

Above script returns:

[{'category': u"Best restaurant that's been around forever and is still worth the trip\xa0", 'runners_up': [u'Frontera Grill', u'Chicago Diner ', u'Sabatino\u2019s', u'Twin Anchors'], 'winner': [u'Lula Cafe'], 'category_url': 'http://www.chicagoreader.com/chicago/BestOf?category=1979894&year=2011'}, {'category': u'Best fancy restaurant in Chicago\xa0', ...}]

@coldKnight

@NicholasBravobi the "u" denotes that the string has been represented as unicode.
You can get rid of them by returning category.encode('utf-8'), winner.encode('utf-8) and so on as values for "category" and other keys in get_category_winner function.

@rahulwrites

This is my 1st attempt at Python.
Here's what I get when I run the code provided:

Traceback (most recent call last):
File "python-tut1.py", line 31, in
categories = get_category_links(food_n_drink)
File "python-tut1.py", line 12, in get_category_links
soup = make_soup(section_url)
File "python-tut1.py", line 9, in make_soup
return BeautifulSoup(html, "lxml")
File "/usr/local/lib/python2.7/site-packages/bs4/init.py", line 156, in init
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment