Skip to content

Instantly share code, notes, and snippets.

@gjreda
Created March 3, 2013 15:42
Show Gist options
  • Star 68 You must be signed in to star a gist
  • Fork 34 You must be signed in to fork a gist
  • Save gjreda/f3e6875f869779ec03db to your computer and use it in GitHub Desktop.
Save gjreda/f3e6875f869779ec03db to your computer and use it in GitHub Desktop.
A Python script to get the winners of the Food and Drink categories in the Chicago Readers' Best of 2011 list. It just prints to the command line, but should be relatively easy to write to a CSV using Python's DictWriter class in the csv module.
from bs4 import BeautifulSoup
from urllib2 import urlopen
from time import sleep # be nice
BASE_URL = "http://www.chicagoreader.com"
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
def get_category_links(section_url):
soup = make_soup(section_url)
boccat = soup.find("dl", "boccat")
category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
return category_links
def get_category_winner(category_url):
soup = make_soup(category_url)
category = soup.find("h1", "headline").string
winner = [h2.string for h2 in soup.findAll("h2", "boc1")]
runners_up = [h2.string for h2 in soup.findAll("h2", "boc2")]
return {"category": category,
"category_url": category_url,
"winner": winner,
"runners_up": runners_up}
if __name__ == '__main__':
food_n_drink = ("http://www.chicagoreader.com/chicago/"
"best-of-chicago-2011-food-drink/BestOf?oid=4106228")
categories = get_category_links(food_n_drink)
data = [] # a list to store our dictionaries
for category in categories:
winner = get_category_winner(category)
data.append(winner)
sleep(1) # be nice
print data
@tkeville
Copy link

tkeville commented Nov 6, 2013

I am super new to this programming lark so apologies if I am completely missing something obvious I was trying to run the above just to see I could get to run and I get the following,
Traceback (most recent call last):
File "/home/tom/scraping.py", line 34, in
winner = get_category_winner(category)
File "/home/tom/scraping.py", line 19, in get_category_winner
category = soup.find("h1", "headline").string
AttributeError: 'NoneType' object has no attribute 'string'

I don't know what that is about?

@noelbautista91
Copy link

Thanks, this helped more than anything I've read.

@cihanmehmet
Copy link

Line 31,12,9, incorrect
Can you fix it

@jamesfacts
Copy link

Thanks very much for your instructions! Spent a little time playing around with this script today and learned a lot.

@Goggler
Copy link

Goggler commented Feb 8, 2015

I had to sign in to thank you for your example code and the explanation to go with it. I was able to write my first python application pretty quick based on your example.
Probably the best example I've read that helped me the most.

@mattcummins
Copy link

thanks, really helpful approach.

@aznyellojersey
Copy link

@tkeville, I believe the error is due to having two concurrent variables named "category", one in the get_category_winner function and again in the for loop.

EDIT: @gjreda is it possible to commit to this gist?

@KeynesYouDigIt
Copy link

Yeah I am getting the same issue as @tkeville ,so I tried putting in jibberish arguments for the tags and got the same result. turns out its returning an empty object with type=none ... super weird.

@aznyellojersey that is also a flaw, but I dont think thats the only flaw in this code

Any one got help for some noobs?

@KeynesYouDigIt
Copy link

ah lol either the site has changed or the dl elements are gone/renamed, at least at the url i am working with....

@tatumbla
Copy link

Yes, the site has changed. the "h1", "headline" structure is now a div id="storyline" class="boc1". You need to read the BeautifulSoup docs to determine what the right search string needs to be. However, the process stays the same.

@NicholasBravobi
Copy link

It works...and return "u" in front of the text, anyone know why and how we can remove it? Thanks for assisting.

Above script returns:

[{'category': u"Best restaurant that's been around forever and is still worth the trip\xa0", 'runners_up': [u'Frontera Grill', u'Chicago Diner ', u'Sabatino\u2019s', u'Twin Anchors'], 'winner': [u'Lula Cafe'], 'category_url': 'http://www.chicagoreader.com/chicago/BestOf?category=1979894&year=2011'}, {'category': u'Best fancy restaurant in Chicago\xa0', ...}]

@vatsl
Copy link

vatsl commented Apr 18, 2016

@NicholasBravobi the "u" denotes that the string has been represented as unicode.
You can get rid of them by returning category.encode('utf-8'), winner.encode('utf-8) and so on as values for "category" and other keys in get_category_winner function.

@rahulwrites
Copy link

This is my 1st attempt at Python.
Here's what I get when I run the code provided:

Traceback (most recent call last):
File "python-tut1.py", line 31, in
categories = get_category_links(food_n_drink)
File "python-tut1.py", line 12, in get_category_links
soup = make_soup(section_url)
File "python-tut1.py", line 9, in make_soup
return BeautifulSoup(html, "lxml")
File "/usr/local/lib/python2.7/site-packages/bs4/init.py", line 156, in init
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

@gaosaij
Copy link

gaosaij commented Jul 10, 2017

Why we need to use "if name == 'main':" and what does this mean?
I found that if i delete this the program still works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment